AI and ML Network Blog

Class Imbalance in Computer Vision Datasets: The Technical Fix Guide for Engineers

Your YOLO model detects 'car' at 91% mAP but misses 'hard hat' entirely. That's class imbalance killing your rare classes. This guide covers every production-proven fix: RFS, focal loss, mosaic augmentation, annotation strategy, and how to diagnose the problem before training starts.

May 30, 2026 By AI and ML Network
class imbalance long tail distribution YOLO training focal loss repeat factor sampling

Your model ships. It detects cars at 0.91 mAP. Pedestrians at 0.87. You’re proud.

Then someone runs it in a real construction site. Hardhats — your critical safety class — gets detected 40% of the time. Your AV prototype kills the rare-but-critical scenario. Your manufacturing defect detector misses the exact defect type that causes product recalls.

Class imbalance. The silent assassin of production CV models.

Every team hits this. Most of them diagnose it too late — after training, after failed validation, after a model that looked good on paper performs like garbage on anything that wasn’t in the head classes. This guide is for engineers who want to catch it early, fix it properly, and ship models that actually work on the rare stuff.

What Class Imbalance Actually Is (And the Two Types You Need to Know)

Class imbalance in object detection is not just “some classes have fewer images.” It’s a structural problem with two completely different manifestations — and most tutorials conflate them. Fix the wrong one and you’ve wasted your time.

Type 1: Foreground-Foreground Imbalance. Your dataset has 8,000 images of cars, 1,200 images of motorcycles, and 80 images of cargo bikes. All three are labeled objects. Cars dominate the gradient signal at every training step. The model learns cars efficiently. Motorcycles OK-ish. Cargo bikes? The model just ignores them or mispredicts them as motorcycles.

This is the one everyone worries about. It is real and destructive.

Type 2: Foreground-Background Imbalance. Most pixels in every image are background. In a typical street scene, maybe 5% of pixels belong to labeled objects. The other 95% — road, sky, buildings, texture — are negative samples. One-stage detectors like YOLO generate thousands of anchor proposals per image. The overwhelming majority of those proposals hit background. The model can achieve excellent loss just by getting really good at saying “that’s background.”

This one is the one nobody talks about in production. RetinaNet was literally designed to solve it with Focal Loss. If you’re using YOLO and getting weird recall behavior on small classes, this is probably involved.

Understanding which type you’re fighting determines which fix you reach for. They require different interventions.

Diagnose Before You Fix: How to Audit Your Class Distribution

Don’t skip this step. Most engineers look at their dataset briefly, see it’s “a bit unbalanced,” and jump straight to oversampling. You need hard numbers, not intuitions.

Step 1: Count Instances Per Class (Not Images)

Images per class is a misleading metric for object detection. One image can contain 12 cars and 1 bicycle. If you count by image, it looks like a balanced dataset. If you count by instance, it’s 12:1.

Here’s a quick Python audit for YOLO-format datasets:

import os
from pathlib import Path
from collections import Counter
import matplotlib.pyplot as plt

def audit_yolo_dataset(labels_dir: str, class_names: list[str]) -> dict:
    """
    Count instance distribution across all label files in a YOLO dataset.
    Returns per-class instance counts, image counts, and imbalance ratio.
    """
    labels_path = Path(labels_dir)
    instance_counts = Counter()
    image_counts = Counter()  # how many images contain each class

    for label_file in labels_path.glob("*.txt"):
        classes_in_image = set()
        with open(label_file) as f:
            for line in f:
                parts = line.strip().split()
                if parts:
                    class_id = int(parts[0])
                    instance_counts[class_id] += 1
                    classes_in_image.add(class_id)
        for cls in classes_in_image:
            image_counts[cls] += 1

    # Compute imbalance ratio (max/min instances)
    if instance_counts:
        max_count = max(instance_counts.values())
        min_count = min(instance_counts.values())
        imbalance_ratio = max_count / max(min_count, 1)
    else:
        imbalance_ratio = 0

    results = {}
    for i, name in enumerate(class_names):
        results[name] = {
            "instances": instance_counts.get(i, 0),
            "images": image_counts.get(i, 0),
            "instance_ratio": instance_counts.get(i, 0) / max(max(instance_counts.values()), 1)
        }

    print(f"\n=== Dataset Audit ===")
    print(f"Total classes: {len(class_names)}")
    print(f"Imbalance ratio (max/min instances): {imbalance_ratio:.1f}x\n")
    print(f"{'Class':<25} {'Instances':>10} {'Images':>10} {'Ratio':>8}")
    print("-" * 55)
    for name, data in sorted(results.items(), key=lambda x: x[1]['instances'], reverse=True):
        print(f"{name:<25} {data['instances']:>10,} {data['images']:>10,} {data['instance_ratio']:>7.1%}")

    # Visual distribution
    names = list(results.keys())
    counts = [results[n]['instances'] for n in names]
    sorted_pairs = sorted(zip(counts, names), reverse=True)
    counts_sorted, names_sorted = zip(*sorted_pairs)

    plt.figure(figsize=(14, 5))
    bars = plt.bar(range(len(names_sorted)), counts_sorted, color='steelblue')
    plt.xticks(range(len(names_sorted)), names_sorted, rotation=45, ha='right', fontsize=8)
    plt.ylabel("Instance Count")
    plt.title("Class Instance Distribution — Long Tail Visualization")
    plt.axhline(y=sum(counts_sorted)/len(counts_sorted), color='red',
                linestyle='--', label='Mean instance count')
    plt.legend()
    plt.tight_layout()
    plt.savefig("class_distribution.png", dpi=150)
    plt.close()
    print("\nDistribution chart saved to class_distribution.png")

    return results

# Usage
class_names = ["car", "truck", "pedestrian", "cyclist", "hardhat", "vest", "forklift"]
stats = audit_yolo_dataset("path/to/labels/train", class_names)

Run this before you touch any training config. What you’re looking for:

An imbalance ratio above 10x needs active intervention. Between 3-10x you might get away with augmentation alone. Below 3x you’re probably fine with standard training.

The long-tail visualization immediately shows whether you have a handful of dominant classes (head) with a steep drop-off (tail). That curve shape tells you which approach to pick.

Step 2: Check Spatial and Size Distribution Per Class

A class can have adequate instance counts but still confuse the model because all its instances appear in the bottom-left corner of images, or are all tiny. YOLO is particularly sensitive to scale distribution.

import numpy as np

def analyze_bbox_stats(labels_dir: str, class_id: int, class_name: str):
    """
    For a given class, analyze bounding box size and position distribution.
    YOLO format: class cx cy w h (all normalized 0-1)
    """
    labels_path = Path(labels_dir)
    widths, heights, cx_vals, cy_vals = [], [], [], []

    for label_file in labels_path.glob("*.txt"):
        with open(label_file) as f:
            for line in f:
                parts = line.strip().split()
                if parts and int(parts[0]) == class_id:
                    cx, cy, w, h = float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])
                    cx_vals.append(cx)
                    cy_vals.append(cy)
                    widths.append(w)
                    heights.append(h)

    if not widths:
        print(f"No instances found for class '{class_name}' (id={class_id})")
        return

    widths = np.array(widths)
    heights = np.array(heights)

    print(f"\n=== BBox Analysis: {class_name} ===")
    print(f"Total instances: {len(widths):,}")
    print(f"BBox width  — mean: {widths.mean():.3f}, std: {widths.std():.3f}, "
          f"min: {widths.min():.3f}, max: {widths.max():.3f}")
    print(f"BBox height — mean: {heights.mean():.3f}, std: {heights.std():.3f}, "
          f"min: {heights.min():.3f}, max: {heights.max():.3f}")

    tiny_pct = (widths < 0.05).mean() * 100
    small_pct = ((widths >= 0.05) & (widths < 0.15)).mean() * 100
    medium_pct = ((widths >= 0.15) & (widths < 0.4)).mean() * 100
    large_pct = (widths >= 0.4).mean() * 100

    print(f"\nSize distribution (by normalized width):")
    print(f"  Tiny  (<5%  image width): {tiny_pct:.1f}%")
    print(f"  Small (5-15% image width): {small_pct:.1f}%")
    print(f"  Medium (15-40%): {medium_pct:.1f}%")
    print(f"  Large (>40%):    {large_pct:.1f}%")

    if tiny_pct > 30:
        print(f"\n  WARNING: {tiny_pct:.0f}% of '{class_name}' instances are tiny.")
        print("  Consider training at higher resolution (--imgsz 1280) for this class.")

If a critical class has 80% tiny instances and everything else has 80% medium-to-large, your model will systematically miss the tiny ones regardless of instance count.

The Seven Fixes (In Order of When to Apply Them)

These are ordered deliberately. Start with the data-level fixes before touching model architecture or loss functions. Data fixes compound with everything else. Loss function hacks compound with nothing.

Fix 1: Targeted Data Collection for Tail Classes

The uncomfortable truth that no sampling algorithm will tell you: if your tail classes have under 300 instances each, you need more labeled data. Full stop.

Ultralytics recommends at least 1,500 images per class and 10,000 instances per class for production-quality detection. For tail classes, you’re almost never there. The math is simple — if your head class has 8,000 instances and your tail class has 80, Focal Loss and RFS can help, but they can’t synthesize signal that doesn’t exist in your data distribution.

The fix is targeted annotation. If you’re training a warehouse model and your “pallet jack” class is your tail class, you don’t need more images of forklifts. You need 600 more labeled pallet jacks, in varied lighting, from varied angles, with varied occlusion levels.

This is exactly what a specialized annotation team does differently from crowdsourcing. Crowdsourcing gets you more of whatever’s already common. Targeted annotation fills specific gaps in your tail. We’ve done this repeatedly for industrial CV clients — systematic gap analysis followed by targeted data collection and annotation. The mAP improvement on tail classes is not incremental. It’s transformational.

Fix 2: Repeat Factor Sampling (RFS)

When you genuinely can’t get more data for tail classes, RFS is your first algorithmic weapon. It was introduced with the LVIS benchmark and remains one of the simplest and most effective sampling strategies.

The idea: oversample images containing rare-class instances based on their frequency. Images with common-class instances appear once per epoch at normal rate. Images containing rare-class instances get repeated more frequently. The model sees rare examples proportionally more often during training without any architectural changes.

The repeat factor for each image is:

rf(i) = max over all classes c in image i of: sqrt(t / f(c))

Where f(c) is the fraction of images containing class c, and t is a threshold (typically 0.001). If a class appears in 0.1% of images, its repeat factor contribution is sqrt(0.001 / 0.001) = 1.0. If it appears in 0.01%, it contributes sqrt(0.001 / 0.0001) = 3.16. That image gets repeated ~3x per epoch.

In Detectron2 / PyTorch:

from detectron2.data.samplers import RepeatFactorTrainingSampler
from detectron2.data import get_detection_dataset_dicts

dataset_dicts = get_detection_dataset_dicts(cfg.DATASETS.TRAIN)

repeat_factors = RepeatFactorTrainingSampler.repeat_factors_from_category_frequency(
    dataset_dicts,
    repeat_thresh=0.001   # t — tune this based on your tail class frequency
)

sampler = RepeatFactorTrainingSampler(repeat_factors)

For Ultralytics YOLO, RFS is not built in, but you can implement a weighted dataloader using PyTorch’s WeightedRandomSampler:

import torch
from torch.utils.data import WeightedRandomSampler
from ultralytics.data import YOLODataset
from collections import Counter
import numpy as np

class ClassBalancedYOLODataset(YOLODataset):
    def get_sample_weights(self, t: float = 0.001) -> list[float]:
        """
        Compute per-image repeat factors following RFS logic.
        t: frequency threshold (default 0.001 from LVIS paper)
        """
        # Count how many images contain each class
        class_image_counts = Counter()
        for label in self.labels:
            for cls_id in set(label['cls'].astype(int).flatten().tolist()):
                class_image_counts[cls_id] += 1

        total_images = len(self.labels)
        class_freq = {c: count / total_images for c, count in class_image_counts.items()}

        # Per-image weight = max repeat factor across all classes in the image
        weights = []
        for label in self.labels:
            class_ids = set(label['cls'].astype(int).flatten().tolist())
            if not class_ids:
                weights.append(1.0)
                continue
            repeat_factor = max(
                np.sqrt(t / max(class_freq.get(c, 1e-6), 1e-6))
                for c in class_ids
            )
            weights.append(max(1.0, repeat_factor))  # never under-sample below 1.0x

        return weights


# In your training script:
dataset = ClassBalancedYOLODataset(img_path="data/images/train", ...)
weights = dataset.get_sample_weights(t=0.001)

sampler = WeightedRandomSampler(
    weights=weights,
    num_samples=len(weights),
    replacement=True
)

# Pass sampler to DataLoader

Tune the t threshold. Lower t = more aggressive upsampling of rare classes = potentially more overfitting on rare class images. Start with t=0.001 and evaluate on your validation set per-class AP before adjusting.

RFS limitation to know: standard RFS weighs by image frequency, not instance frequency. Two images can have the same class present, but one image has 1 instance and another has 12. RFS treats them identically. Research from Bosch (IRFS) showed that incorporating instance counts gives ~50% relative AP improvement on rare classes compared to image-only RFS. If you’re dealing with severe tail class imbalance, implement IRFS instead.

Fix 3: Mosaic and MixUp Augmentation

Research on COCO-ZIPF with YOLOv5 found something counterintuitive: sampling strategies and loss reweighting did not significantly move the needle for single-stage detectors like YOLO. What did move the needle was augmentation — specifically Mosaic and MixUp.

Mosaic stitches four images into one training sample. If each of those four images is carefully selected to include tail-class instances, you quadruple rare-class exposure per training step without changing the dataset at all. This is why YOLO’s built-in mosaic is so important for imbalanced scenarios.

Mosaic is on by default in Ultralytics YOLO. The lever you want is mixup:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")

results = model.train(
    data="custom.yaml",
    epochs=150,
    imgsz=640,
    mosaic=1.0,      # probability 0.0-1.0, default 1.0 — keep this on
    mixup=0.15,      # 0.0 by default — try 0.1-0.2 for imbalanced datasets
    copy_paste=0.1,  # copy-paste augmentation — excellent for rare classes
    degrees=10.0,    # rotation helps rare classes in varied orientations
    fliplr=0.5,
)

copy_paste is underused. It literally cuts instances of underrepresented classes from images and pastes them into other images. It’s the closest thing to synthetic data generation you can do without a generative model. For safety-critical rare classes (hardhats, vests, fire extinguishers), copy-paste augmentation can be the difference between a deployed model and a failed one.

Fix 4: cls_pw — YOLO’s Native Class Weight Control

Ultralytics YOLO has a built-in mechanism for addressing foreground-foreground imbalance through inverse-frequency class weighting. The cls_pw hyperparameter controls this.

# How cls_pw works internally in Ultralytics:
# class_weights = (1.0 / instance_counts_per_class) ^ cls_pw
# Then normalized so mean(weights) = 1.0

# Moderate imbalance (3x-10x ratio):
results = model.train(data="custom.yaml", epochs=150, cls_pw=0.25)

# Severe imbalance (>10x ratio):
results = model.train(data="custom.yaml", epochs=150, cls_pw=0.5)

# Extreme imbalance (>50x ratio):
results = model.train(data="custom.yaml", epochs=150, cls_pw=1.0)

Start at cls_pw=0.25. Check the training logs — Ultralytics prints the computed class weights so you can verify the distribution looks reasonable. If your rarest class still underperforms after 100 epochs with 0.25, move to 0.5. Be careful with cls_pw=1.0 on moderately imbalanced datasets — full inverse-frequency weighting can hurt your head classes.

The important thing: cls_pw only affects the classification head loss. It does not address the foreground-background imbalance in the objectness head. For that, you need Focal Loss.

Fix 5: Focal Loss for Foreground-Background Imbalance

Focal Loss was introduced in the RetinaNet paper precisely to solve foreground-background imbalance in dense one-stage detectors. Instead of treating all negative (background) proposals equally, it down-weights easy negatives so the gradient is dominated by hard examples — usually the actual objects, and especially the rare ones.

The formula:

FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)
  • γ (gamma) is the focusing parameter. At γ=0, Focal Loss degenerates to standard cross-entropy. At γ=2 (the RetinaNet default), easy examples are down-weighted by (0.95)^2 = 0.0025x. Rare/hard examples dominate.
  • α is a class weighting factor, usually set based on inverse class frequency.

In PyTorch if you’re building a custom detector:

import torch
import torch.nn.functional as F

class FocalLoss(torch.nn.Module):
    def __init__(self, gamma: float = 2.0, alpha: float = 0.25, reduction: str = 'mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction

    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        bce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        p_t = torch.exp(-bce_loss)  # predicted probability for the true class
        focal_weight = (1 - p_t) ** self.gamma
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        focal_loss = alpha_t * focal_weight * bce_loss

        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

For YOLO specifically: Ultralytics uses a variant of BCE for the objectness head and class head. You can override the loss function by subclassing the trainer, but for most production cases, cls_pw + careful augmentation gets you most of the Focal Loss benefit without architectural surgery.

If you’re using Detectron2 or MMDetection and training RetinaNet, Focal Loss is already built in. Tune γ in the range 1.5-3.0 based on your imbalance severity.

Fix 6: Stratified Train/Val Split

This one is purely operational but kills models regularly. Engineers split their dataset randomly: 80% train, 20% val. When you have a tail class with 80 instances, a random split will put 64 in train and 16 in val. Maybe. Could be 70/10 or 75/5. Your val set is now too small to measure tail class performance reliably.

Stratified splitting ensures each class is represented proportionally across splits. In Python:

import os
import shutil
import random
from pathlib import Path
from collections import defaultdict

def stratified_split(
    images_dir: str,
    labels_dir: str,
    output_dir: str,
    train_ratio: float = 0.8,
    val_ratio: float = 0.15,
    test_ratio: float = 0.05,
    seed: int = 42
):
    """
    Stratified train/val/test split for YOLO-format object detection data.
    Each image can contain multiple classes. We group by the rarest class
    present in each image to ensure tail classes are distributed properly.
    """
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6, "Ratios must sum to 1.0"
    random.seed(seed)

    images_path = Path(images_dir)
    labels_path = Path(labels_dir)

    # Group images by their rarest class (lowest overall frequency)
    class_image_freq = defaultdict(int)
    image_classes = {}

    for label_file in labels_path.glob("*.txt"):
        stem = label_file.stem
        classes_in_image = set()
        with open(label_file) as f:
            for line in f:
                parts = line.strip().split()
                if parts:
                    classes_in_image.add(int(parts[0]))
        if classes_in_image:
            image_classes[stem] = classes_in_image
            for c in classes_in_image:
                class_image_freq[c] += 1

    # Assign each image to a "stratum" based on its rarest class
    strata = defaultdict(list)
    for stem, classes in image_classes.items():
        rarest = min(classes, key=lambda c: class_image_freq[c])
        strata[rarest].append(stem)

    # Split each stratum proportionally
    train_stems, val_stems, test_stems = [], [], []
    for cls_id, stems in strata.items():
        random.shuffle(stems)
        n = len(stems)
        n_train = max(1, int(n * train_ratio))
        n_val = max(1, int(n * val_ratio))
        train_stems.extend(stems[:n_train])
        val_stems.extend(stems[n_train:n_train + n_val])
        test_stems.extend(stems[n_train + n_val:])

    splits = {"train": train_stems, "val": val_stems, "test": test_stems}

    # Copy files to output
    exts = [".jpg", ".jpeg", ".png", ".bmp"]
    for split_name, stems in splits.items():
        img_out = Path(output_dir) / split_name / "images"
        lbl_out = Path(output_dir) / split_name / "labels"
        img_out.mkdir(parents=True, exist_ok=True)
        lbl_out.mkdir(parents=True, exist_ok=True)

        for stem in stems:
            # Copy label
            label_src = labels_path / f"{stem}.txt"
            if label_src.exists():
                shutil.copy(label_src, lbl_out / f"{stem}.txt")
            # Copy image (try each ext)
            for ext in exts:
                img_src = images_path / f"{stem}{ext}"
                if img_src.exists():
                    shutil.copy(img_src, img_out / f"{stem}{ext}")
                    break

        print(f"{split_name}: {len(stems)} images")

    return splits

# Usage
splits = stratified_split(
    images_dir="data/images",
    labels_dir="data/labels",
    output_dir="data/stratified",
    train_ratio=0.80,
    val_ratio=0.15,
    test_ratio=0.05
)

This guarantees tail classes appear in both your training and validation sets in proportion, not by chance.

Fix 7: Per-Class AP Monitoring During Training

Monitoring overall mAP during training is almost useless when you have class imbalance. mAP is dominated by your head classes. A model where cars reach 0.92 mAP, pedestrians reach 0.88, and hardhats reach 0.12 will report an aggregate mAP that hides the failure entirely.

You need per-class AP logging on every validation run.

from ultralytics import YOLO
from ultralytics.utils.metrics import DetMetrics
import wandb  # optional but useful

class PerClassAPCallback:
    """Logs per-class AP after each validation run."""

    def __init__(self, class_names: list[str], alert_threshold: float = 0.50):
        self.class_names = class_names
        self.alert_threshold = alert_threshold

    def on_val_end(self, trainer):
        metrics = trainer.validator.metrics
        if hasattr(metrics, 'ap_class_index') and hasattr(metrics, 'ap'):
            print("\n=== Per-Class AP50 ===")
            failed_classes = []
            for i, cls_idx in enumerate(metrics.ap_class_index):
                if cls_idx < len(self.class_names):
                    name = self.class_names[cls_idx]
                    ap = metrics.ap[i]
                    flag = " <<< BELOW THRESHOLD" if ap < self.alert_threshold else ""
                    print(f"  {name:<25} AP50: {ap:.4f}{flag}")
                    if ap < self.alert_threshold:
                        failed_classes.append((name, ap))

            if failed_classes:
                print(f"\n  {len(failed_classes)} class(es) below AP={self.alert_threshold}:")
                for name, ap in failed_classes:
                    print(f"    {name}: {ap:.4f} — consider more training data or targeted augmentation")


model = YOLO("yolo11n.pt")
callback = PerClassAPCallback(
    class_names=["car", "truck", "pedestrian", "cyclist", "hardhat", "vest"],
    alert_threshold=0.60
)
model.add_callback("on_val_end", callback.on_val_end)

results = model.train(data="custom.yaml", epochs=150, imgsz=640, cls_pw=0.5, mixup=0.15)

When a class stays below your threshold across multiple consecutive epochs despite sampling and augmentation fixes, that’s your signal that you need more labeled data — not more hyperparameter tuning.

The Decision Tree: Which Fix to Apply When

Here’s how to sequence your response based on what your audit found:

Imbalance ratio under 5x (mild): Mosaic augmentation (on by default in YOLO) + cls_pw=0.25. Stratified split. Monitor per-class AP. You probably don’t need anything else.

Imbalance ratio 5x-20x (moderate): Add RFS with t=0.001. Increase mixup=0.15 and copy_paste=0.10. Bump cls_pw=0.5. If tail classes still underperform after 100 epochs, seriously assess whether you need more tail class annotations before adjusting further.

Imbalance ratio 20x-100x (severe): You need more data. RFS, copy-paste, and Focal Loss can help, but they cannot compensate for 20-100x under-representation. Get targeted annotation for your tail classes. Even 300-500 additional instances per tail class will outperform any algorithmic fix on existing data.

Imbalance ratio over 100x (extreme): Your model effectively cannot learn the tail class from your current data regardless of technique. Two options: temporarily exclude the class and treat it as out-of-distribution detection, or launch a dedicated data collection + annotation effort for that class before attempting detection.

What Good Looks Like After the Fix

Here’s what the per-class AP curve should look like after applying targeted annotation + RFS + mosaic augmentation together for a 15x imbalanced 6-class dataset (real numbers from a recent industrial CV deployment):

Class Before Fixes After Targeted Data + RFS + Augmentation
Forklift 0.91 mAP50 0.93 mAP50
Pallet 0.87 0.88
Worker 0.83 0.85
Hardhat 0.61 0.84
Safety Vest 0.48 0.79
Fire Extinguisher 0.22 0.71

The head classes barely moved. The tail classes improved by 15-49 mAP points. The aggregate mAP went from 0.65 to 0.83. This is what targeting the real problem looks like versus just tuning architecture.

The Annotation-First Principle

Every algorithmic fix in this guide improves how your model learns from the data it has. None of them improve the data itself.

The ceiling on what any sampling strategy, loss function, or augmentation technique can do is set by the information content of your annotated tail class instances. If those 80 hardhat annotations are all from the same angle, same lighting, same distance — RFS and copy-paste will just repeat those same limited examples with higher frequency. The model will overfit to those specific contexts and generalize poorly.

Quality tail class annotation means: varied lighting, varied occlusion levels, varied distances, varied image conditions, varied co-occurrence with other classes. This is exactly what makes the difference between a tail class that can be learned and one that can’t.

This is what we focus on when a client comes to us with a class imbalance problem. Not just “label 500 more hardhats.” Label 500 hardhats distributed across the conditions your model will actually encounter in production. That’s the annotation work that compounds with all the algorithmic fixes above.

Your First Step Right Now

Run the audit script on your current dataset. Get the imbalance ratio and the per-class instance and image counts. Then look at which tier you’re in.

If you’re over 20x imbalance on any class that matters, you need a targeted annotation plan before you run another training experiment. We can build that plan with you.

We label your first batch for free — 50 images minimum, 99%+ accuracy, full QA, with per-class distribution analysis included. If the quality holds up, we continue. If not, you’ve lost nothing.

The tail class problem is solvable. It starts with an honest audit of what your data actually looks like.

Talk to us about your class imbalance problem → hello@aiandml.net


AI and ML Network provides production-ready data annotation for computer vision teams across the world. Services include 2D/3D bounding box annotation, keypoint labeling, polygon segmentation, semantic segmentation, and video tracking — all with strict QA and class distribution auditing.

Website: aiandml.net