AI and ML Network Blog

Convert CVAT XML to YOLO and COCO to YOLO Segmentation: Complete Technical Guide

Step-by-step guide to convert CVAT XML annotations to YOLO format and COCO segmentation to YOLO. Includes Python scripts, validation code, and production-ready batch processing for computer vision projects.

May 26, 2026 By AI and ML Network 12 min read Intermediate
CVAT YOLO COCO Format Data Conversion Annotation Tools

When you’re scaling computer vision projects, format compatibility blocks your pipeline. Your team annotates in CVAT. Your training expects YOLO. Someone sends you COCO segmentation data. Now what?

Re-annotate everything? Waste three days debugging coordinate systems? Hope your intern doesn’t silently break the dataset?

I’ve converted 50,000+ annotated images at AI and ML Network. Here’s the production-grade process that prevents data loss, silent errors, and training failures.

Why Format Conversion Breaks Your Pipeline

Every annotation tool picks its own coordinate system. CVAT outputs XML with absolute pixels. YOLO wants normalized .txt files. COCO uses complex JSON with polygon arrays.

Your options:

  1. Re-annotate from scratch (expensive, slow, demoralizing)
  2. Modify training code for multiple formats (maintenance nightmare)
  3. Convert formats correctly (fast, scales, works)

Option three wins. But only if your conversion doesn’t introduce silent coordinate errors that destroy model accuracy.

The Format Breakdown: What You’re Actually Converting

CVAT XML Structure:

  • XML-based with nested <image> and <box> tags
  • Absolute pixel coordinates: xtl, ytl, xbr, ybr
  • Supports bounding boxes, polygons, keypoints, attributes
  • One XML file per annotation task

YOLO Format:

  • One .txt file per image
  • Space-separated: class_id x_center y_center width height
  • All coordinates normalized 0-1 range
  • Requires data.yaml for class mapping

COCO JSON Format:

  • Single JSON for entire dataset
  • Arrays: images, annotations, categories
  • Segmentation as polygon coordinates or RLE
  • Bounding box format: [x, y, width, height] in absolute pixels

The critical difference? Coordinate normalization and origin points. Get this wrong once, and every bounding box in your dataset is misaligned.

Converting CVAT XML to YOLO Format

Environment Setup (Ubuntu 24.04)

pip install opencv-python lxml --break-system-packages

The Conversion Script

Create cvat_to_yolo.py:

import xml.etree.ElementTree as ET
import os
from pathlib import Path

def parse_cvat_xml(xml_path):
    """Extract annotations from CVAT XML and convert coordinate system"""
    tree = ET.parse(xml_path)
    root = tree.getroot()
    
    annotations = []
    
    for image in root.findall('image'):
        img_name = image.get('name')
        img_width = int(image.get('width'))
        img_height = int(image.get('height'))
        
        image_annotations = []
        
        for box in image.findall('box'):
            label = box.get('label')
            xtl = float(box.get('xtl'))
            ytl = float(box.get('ytl'))
            xbr = float(box.get('xbr'))
            ybr = float(box.get('ybr'))
            
            # CRITICAL: Convert absolute pixels to normalized YOLO format
            x_center = ((xtl + xbr) / 2) / img_width
            y_center = ((ytl + ybr) / 2) / img_height
            width = (xbr - xtl) / img_width
            height = (ybr - ytl) / img_height
            
            image_annotations.append({
                'label': label,
                'x_center': x_center,
                'y_center': y_center,
                'width': width,
                'height': height
            })
        
        annotations.append({
            'image_name': img_name,
            'image_width': img_width,
            'image_height': img_height,
            'boxes': image_annotations
        })
    
    return annotations

def write_yolo_labels(annotations, output_dir, class_mapping):
    """Write YOLO format label files with proper formatting"""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    for img_data in annotations:
        img_name = Path(img_data['image_name']).stem
        label_path = os.path.join(output_dir, f"{img_name}.txt")
        
        with open(label_path, 'w') as f:
            for box in img_data['boxes']:
                class_id = class_mapping.get(box['label'], 0)
                # 6 decimal precision prevents rounding errors
                line = f"{class_id} {box['x_center']:.6f} {box['y_center']:.6f} {box['width']:.6f} {box['height']:.6f}\n"
                f.write(line)

# Production usage
xml_path = "/path/to/annotations.xml"
output_dir = "/path/to/labels"
class_mapping = {'person': 0, 'vehicle': 1, 'bicycle': 2}

annotations = parse_cvat_xml(xml_path)
write_yolo_labels(annotations, output_dir, class_mapping)

Create data.yaml (Required for YOLO Training)

# data.yaml
path: /path/to/dataset
train: images/train
val: images/val
test: images/test

nc: 3  # number of classes
names: ['person', 'vehicle', 'bicycle']

Common Conversion Errors That Break Training

1. Coordinate Overflow (Values > 1.0)

  • Root cause: Annotation boxes extend outside image bounds
  • Fix: Audit CVAT project for edge-case annotations
  • Detection: Run validation before training

2. Missing Label Files

  • Root cause: Images without annotations don’t get .txt files
  • Impact: YOLO training crashes or skips images
  • Fix: Create empty .txt files for annotation-free images

3. Class Mapping Mismatch

  • Root cause: CVAT label names don’t match data.yaml class order
  • Impact: Model trains on wrong classes, accuracy tanks
  • Fix: Explicit class mapping dictionary, verify before batch processing

Converting COCO Segmentation to YOLO Segmentation Format

YOLO segmentation extends bounding boxes with normalized polygon coordinates:

Format: class_id x1 y1 x2 y2 x3 y3 ... xn yn

All coordinates normalized 0-1, representing polygon vertices.

Setup Requirements

pip install pycocotools opencv-python numpy --break-system-packages

COCO to YOLO Segmentation Script

Create coco_to_yolo_seg.py:

import json
import os
from pathlib import Path
import numpy as np

def load_coco_json(json_path):
    """Load and validate COCO format JSON"""
    with open(json_path, 'r') as f:
        coco_data = json.load(f)
    
    # Validate required fields
    required = ['images', 'annotations', 'categories']
    if not all(key in coco_data for key in required):
        raise ValueError(f"Invalid COCO JSON. Missing: {[k for k in required if k not in coco_data]}")
    
    return coco_data

def coco_to_yolo_segmentation(coco_data, output_dir):
    """Convert COCO polygon segmentation to YOLO normalized format"""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Build lookup tables
    images = {img['id']: img for img in coco_data['images']}
    categories = {cat['id']: idx for idx, cat in enumerate(coco_data['categories'])}
    
    # Group annotations by image
    image_annotations = {}
    for ann in coco_data['annotations']:
        img_id = ann['image_id']
        if img_id not in image_annotations:
            image_annotations[img_id] = []
        image_annotations[img_id].append(ann)
    
    # Process each image's annotations
    conversion_stats = {'total': 0, 'polygons': 0, 'rle': 0, 'errors': 0}
    
    for img_id, annotations in image_annotations.items():
        img_info = images[img_id]
        img_name = Path(img_info['file_name']).stem
        img_width = img_info['width']
        img_height = img_info['height']
        
        label_path = os.path.join(output_dir, f"{img_name}.txt")
        
        with open(label_path, 'w') as f:
            for ann in annotations:
                if 'segmentation' not in ann or not ann['segmentation']:
                    continue
                
                class_id = categories[ann['category_id']]
                conversion_stats['total'] += 1
                
                # Handle polygon segmentation (list format)
                if isinstance(ann['segmentation'], list):
                    conversion_stats['polygons'] += 1
                    
                    for seg in ann['segmentation']:
                        # seg format: [x1, y1, x2, y2, ..., xn, yn]
                        if len(seg) < 6:  # Need at least 3 points
                            conversion_stats['errors'] += 1
                            continue
                        
                        # Normalize all coordinates
                        normalized_points = []
                        for i in range(0, len(seg), 2):
                            x_norm = seg[i] / img_width
                            y_norm = seg[i + 1] / img_height
                            
                            # Clamp to valid range
                            x_norm = max(0.0, min(1.0, x_norm))
                            y_norm = max(0.0, min(1.0, y_norm))
                            
                            normalized_points.extend([x_norm, y_norm])
                        
                        # Write YOLO segmentation line
                        points_str = ' '.join([f"{p:.6f}" for p in normalized_points])
                        f.write(f"{class_id} {points_str}\n")
                
                # Handle RLE format (dict format)
                elif isinstance(ann['segmentation'], dict):
                    conversion_stats['rle'] += 1
                    print(f"Warning: RLE format detected for {img_name}. Requires conversion to polygon.")
    
    print(f"Conversion complete:")
    print(f"  Total annotations: {conversion_stats['total']}")
    print(f"  Polygons converted: {conversion_stats['polygons']}")
    print(f"  RLE format (needs handling): {conversion_stats['rle']}")
    print(f"  Errors: {conversion_stats['errors']}")

def create_yaml(coco_data, output_path):
    """Generate data.yaml from COCO categories"""
    categories = [cat['name'] for cat in sorted(coco_data['categories'], key=lambda x: x['id'])]
    
    yaml_content = f"""# Generated from COCO dataset
path: /path/to/dataset
train: images/train
val: images/val

nc: {len(categories)}
names: {categories}
"""
    
    with open(output_path, 'w') as f:
        f.write(yaml_content)
    
    print(f"Created {output_path} with {len(categories)} classes")

# Production usage
coco_json = "/path/to/annotations.json"
output_labels = "/path/to/labels"

coco_data = load_coco_json(coco_json)
coco_to_yolo_segmentation(coco_data, output_labels)
create_yaml(coco_data, "data.yaml")

Handling RLE (Run-Length Encoding) in COCO

Some COCO datasets encode masks as RLE instead of polygons. Here’s the conversion:

from pycocotools import mask as maskUtils
from skimage import measure
import numpy as np

def rle_to_polygon(rle, img_height, img_width):
    """Convert COCO RLE mask to polygon coordinates for YOLO"""
    
    # Decode RLE to binary mask
    if isinstance(rle, dict):
        binary_mask = maskUtils.decode(rle)
    else:
        rle_obj = {'size': [img_height, img_width], 'counts': rle}
        binary_mask = maskUtils.decode(rle_obj)
    
    # Extract contours from binary mask
    contours = measure.find_contours(binary_mask, 0.5)
    
    polygons = []
    for contour in contours:
        # Simplify contour (reduce points while preserving shape)
        contour = measure.approximate_polygon(contour, tolerance=1.0)
        
        if len(contour) < 3:  # Need minimum 3 points
            continue
        
        # Convert from (row, col) to (x, y) and flatten
        polygon = []
        for point in contour:
            x = point[1]  # column = x
            y = point[0]  # row = y
            polygon.extend([x, y])
        
        polygons.append(polygon)
    
    return polygons

# Integrate into main conversion
def handle_rle_annotation(ann, img_width, img_height):
    """Process RLE annotation within COCO to YOLO pipeline"""
    rle = ann['segmentation']
    polygons = rle_to_polygon(rle, img_height, img_width)
    
    yolo_lines = []
    class_id = ann['category_id']
    
    for polygon in polygons:
        # Normalize coordinates
        normalized = []
        for i in range(0, len(polygon), 2):
            x_norm = polygon[i] / img_width
            y_norm = polygon[i + 1] / img_height
            normalized.extend([x_norm, y_norm])
        
        points_str = ' '.join([f"{p:.6f}" for p in normalized])
        yolo_lines.append(f"{class_id} {points_str}")
    
    return yolo_lines

Validation: Catch Errors Before Training

import cv2
import numpy as np

def validate_yolo_format(label_path, img_width, img_height):
    """Validate YOLO label file format and coordinate ranges"""
    errors = []
    
    with open(label_path, 'r') as f:
        for line_num, line in enumerate(f, 1):
            parts = line.strip().split()
            
            if len(parts) < 5:
                errors.append(f"Line {line_num}: Insufficient values")
                continue
            
            try:
                class_id = int(parts[0])
                coords = [float(x) for x in parts[1:]]
            except ValueError as e:
                errors.append(f"Line {line_num}: Parse error - {e}")
                continue
            
            # Validate bounding box (5 values)
            if len(parts) == 5:
                x_center, y_center, width, height = coords
                
                # Check normalization
                if not (0 <= x_center <= 1 and 0 <= y_center <= 1):
                    errors.append(f"Line {line_num}: Center out of bounds")
                
                if not (0 < width <= 1 and 0 < height <= 1):
                    errors.append(f"Line {line_num}: Size out of bounds")
            
            # Validate segmentation (odd length after class_id)
            else:
                if len(coords) % 2 != 0:
                    errors.append(f"Line {line_num}: Odd coordinate count")
                
                if len(coords) < 6:
                    errors.append(f"Line {line_num}: Polygon needs ≥3 points")
                
                # Check all points normalized
                for i, coord in enumerate(coords):
                    if coord < 0 or coord > 1:
                        errors.append(f"Line {line_num}: Coord {i} = {coord:.4f} out of range")
    
    return errors

def visualize_yolo_annotations(img_path, label_path, class_names):
    """Draw YOLO annotations on image for visual verification"""
    img = cv2.imread(img_path)
    if img is None:
        return None
    
    h, w = img.shape[:2]
    
    with open(label_path, 'r') as f:
        for line in f:
            parts = line.strip().split()
            class_id = int(parts[0])
            class_name = class_names[class_id] if class_id < len(class_names) else f"Class{class_id}"
            
            # Bounding box
            if len(parts) == 5:
                x_center, y_center, width, height = map(float, parts[1:])
                
                # Convert to pixel coordinates
                x1 = int((x_center - width/2) * w)
                y1 = int((y_center - height/2) * h)
                x2 = int((x_center + width/2) * w)
                y2 = int((y_center + height/2) * h)
                
                cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
                cv2.putText(img, class_name, (x1, y1-10),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
            
            # Segmentation polygon
            else:
                coords = list(map(float, parts[1:]))
                points = []
                for i in range(0, len(coords), 2):
                    x = int(coords[i] * w)
                    y = int(coords[i+1] * h)
                    points.append([x, y])
                
                points = np.array(points, dtype=np.int32)
                cv2.polylines(img, [points], True, (0, 255, 0), 2)
                
                # Add label at polygon centroid
                M = cv2.moments(points)
                if M["m00"] != 0:
                    cx = int(M["m10"] / M["m00"])
                    cy = int(M["m01"] / M["m00"])
                    cv2.putText(img, class_name, (cx, cy),
                               cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    return img

# Run validation pipeline
errors = validate_yolo_format("label.txt", 1920, 1080)
if errors:
    print("Validation errors found:")
    for error in errors:
        print(f"  - {error}")
else:
    print("✓ Validation passed")

# Visual spot check
img = visualize_yolo_annotations("image.jpg", "label.txt", ['person', 'vehicle'])
if img is not None:
    cv2.imwrite("verification.jpg", img)

Batch Processing for Production Pipelines

import glob
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

def batch_convert_cvat_to_yolo(xml_files, output_dir, class_mapping, workers=4):
    """Multi-threaded batch conversion with progress tracking"""
    
    def process_file(xml_file):
        try:
            annotations = parse_cvat_xml(xml_file)
            write_yolo_labels(annotations, output_dir, class_mapping)
            return len(annotations), None
        except Exception as e:
            return 0, str(e)
    
    total_images = 0
    errors = []
    
    with ThreadPoolExecutor(max_workers=workers) as executor:
        results = list(tqdm(
            executor.map(process_file, xml_files),
            total=len(xml_files),
            desc="Converting CVAT to YOLO"
        ))
    
    for result, error in results:
        if error:
            errors.append(error)
        else:
            total_images += result
    
    print(f"\nConversion complete:")
    print(f"  Files processed: {len(xml_files)}")
    print(f"  Images converted: {total_images}")
    print(f"  Errors: {len(errors)}")
    
    if errors:
        print("\nError log:")
        for err in errors[:10]:  # Show first 10
            print(f"  - {err}")
    
    return total_images, errors

# Usage for entire dataset
xml_files = glob.glob("/path/to/annotations/*.xml")
class_mapping = {'person': 0, 'vehicle': 1, 'bicycle': 2}

total, errors = batch_convert_cvat_to_yolo(
    xml_files,
    "/path/to/output/labels",
    class_mapping,
    workers=8
)

Pre-Training Verification Checklist

Run this before training to prevent silent failures:

1. File Count Verification

# Count images vs labels
image_count=$(find images/ -type f | wc -l)
label_count=$(find labels/ -type f -name "*.txt" | wc -l)
echo "Images: $image_count | Labels: $label_count"

2. Format Validation

def verify_dataset(labels_dir, images_dir, class_names):
    """Comprehensive dataset verification"""
    label_files = glob.glob(f"{labels_dir}/*.txt")
    
    issues = {
        'missing_images': [],
        'format_errors': [],
        'coordinate_errors': [],
        'empty_files': []
    }
    
    for label_file in tqdm(label_files, desc="Verifying"):
        # Check corresponding image exists
        img_name = Path(label_file).stem
        img_path = f"{images_dir}/{img_name}.jpg"  # Adjust extension
        
        if not os.path.exists(img_path):
            issues['missing_images'].append(img_name)
            continue
        
        # Get image dimensions
        img = cv2.imread(img_path)
        if img is None:
            continue
        h, w = img.shape[:2]
        
        # Validate format
        errors = validate_yolo_format(label_file, w, h)
        if errors:
            issues['format_errors'].append((img_name, errors))
        
        # Check for empty files
        if os.path.getsize(label_file) == 0:
            issues['empty_files'].append(img_name)
    
    # Print summary
    print("\n=== Dataset Verification Report ===")
    print(f"Total labels checked: {len(label_files)}")
    print(f"Missing images: {len(issues['missing_images'])}")
    print(f"Format errors: {len(issues['format_errors'])}")
    print(f"Empty label files: {len(issues['empty_files'])}")
    
    return issues

# Run verification
issues = verify_dataset("labels/train", "images/train", ['person', 'vehicle'])

3. Visual Spot Checks

import random

def random_spot_check(labels_dir, images_dir, class_names, samples=10):
    """Randomly verify visual alignment"""
    label_files = glob.glob(f"{labels_dir}/*.txt")
    samples = random.sample(label_files, min(samples, len(label_files)))
    
    for label_file in samples:
        img_name = Path(label_file).stem
        img_path = f"{images_dir}/{img_name}.jpg"
        
        if os.path.exists(img_path):
            vis = visualize_yolo_annotations(img_path, label_file, class_names)
            if vis is not None:
                output = f"spot_check_{img_name}.jpg"
                cv2.imwrite(output, vis)
                print(f"Generated: {output}")

random_spot_check("labels/train", "images/train", ['person', 'vehicle'])

What Happens When You Get This Wrong

Symptom 1: Model trains but accuracy stays at 0%

  • Root cause: Class IDs misaligned between labels and data.yaml
  • Fix: Verify class mapping, rebuild labels

Symptom 2: Boxes appear in wrong locations

  • Root cause: Coordinate normalization error or origin mismatch
  • Fix: Verify conversion math, check for absolute vs. normalized coordinates

Symptom 3: Training crashes with “invalid bbox” errors

  • Root cause: Coordinates outside 0-1 range
  • Fix: Run validation, clamp coordinates during conversion

Symptom 4: Some classes never get detected

  • Root cause: Missing or empty label files for certain classes
  • Fix: Verify all classes present in training data, check label distribution

How AI and ML Network Handles Format Conversions

We’ve converted 50,000+ annotated images for computer vision teams. Our conversion pipeline includes:

  • Automated validation that catches coordinate errors before they reach training
  • Visual spot-checks on random samples to verify alignment
  • Format-specific QA for CVAT, Label Studio, COCO, YOLO, Pascal VOC
  • Same-day turnaround for urgent project needs

We guarantee 99%+ conversion accuracy. Your model trains on correct data, or we fix it free.

Need a free 50-image sample conversion? Send us your dataset. We’ll convert it and return a validation report so you can verify quality before committing to the full batch.

Final Production Checklist

Before deploying your converted dataset:

  • Label count matches image count
  • All coordinates between 0 and 1
  • Class IDs match data.yaml order
  • Visual spot-checks pass (random 10-20 samples)
  • No empty label files (or intentionally empty for no-annotation images)
  • data.yaml paths point to correct directories
  • Train/val split exists and is balanced
  • Backup of original annotations stored safely

Format conversion isn’t glamorous work. But done right, it saves your team days of debugging and prevents silent training failures that waste GPU hours.


Ready to scale your annotation workflow? We handle the tedious conversion work so your ML team can focus on model development. Contact us at aiandml.net for a free sample batch.

Keep Reading

Related posts

View all posts