Skip to content

ArgoHA/D-FINE-seg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

244 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

D-FINE-seg

Real-Time Object Detection and Instance Segmentation

Quick StartUsageExportInferenceBenchmarksVideo TutorialColab

License Contact me


D-FINE-seg extends the D-FINE real-time transformer based object detector with instance segmentation. It adds a lightweight mask head, segmentation-aware training (box-cropped BCE and dice mask losses, auxiliary and denoising mask supervision), and mask-aware Hungarian matching. On the TACO and VisDrone datasets, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency.

The framework covers the full workflow — from data preparation and training (with DDP, EMA, AMP, mosaic) through export (ONNX, TensorRT, OpenVINO) to optimized multi-backend inference for both object detection and instance segmentation tasks.

This is not a fork. The detection core is based on the original D-FINE paper; everything else — segmentation head, training pipeline, export, inference, augmentations — was reimplemented from scratch.

Paper: D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment (coming soon)

Highlights

  • Instance segmentation via a lightweight mask head on top of D-FINE's HybridEncoder PAN outputs — fuses stride 8/16/32 features to 1/4 resolution, then dot-product between per-query mask embeddings (3-layer MLP) and shared mask features produces per-instance masks
  • New losses: box-cropped BCE + Dice mask losses computed only inside GT boxes and normalized by ROI area
  • Mask-aware denoising: contrastive denoising training extended with mask supervision for faster convergence (adds no inference cost)
  • Mask-aware matching: Hungarian matcher augmented with Dice overlap cost and sigmoid focal mask cost alongside classification, L1, and GIoU costs
  • 5 model sizes — Nano, Small, Medium, Large, Extra-Large — with HGNetv2 backbones
  • Production-ready: export to ONNX / TensorRT / OpenVINO, optimized inference with Torch / TRT / OV / ONNX backends

Quick Start

Installation

git clone https://github.com/ArgoHA/D-FINE-seg.git
cd D-FINE-seg
pip install -r requirements.txt

For larger models (L, X), download pretrained backbone weights from Google Drive and place them in the pretrained/ folder.

Prepare Your Data

Organize your dataset in the following structure with YOLO style annotations:

data/dataset/
├── images/    # all images: .jpg, .png, etc.
└── labels/    # all labels: one .txt per image (same filename stem)

Detection labels: class_id xc yc w h (normalized)

Segmentation labels: class_id x1 y1 x2 y2 ... xN yN (normalized polygon coordinates)

Configure

Edit config.yaml — key settings:

task: detect  # "detect" or "segment"
exp_name: my_exp  # experiment name (used in output paths)
model_name: s  # n / s / m / l / x

train:
  root: /path/to/project  # project root, will be used for outputs
  data_path: /path/to/dataset  # folder with images/ and labels/
  label_to_name:
    0: class_a
    1: class_b
  epochs: 75
  batch_size: 8
  img_size: [640, 640]  # (h, w)

Usage

make split           # create train/val CSV splits (test split if configured)
make train           # train the model
make export          # export to ONNX, TensorRT, OpenVINO
make bench           # benchmark all exported models on the val set

make infer           # run on test folder, save visualizations + YOLO txt predictions
make check_errors    # compare predictions against GT, save only mismatches (FP/FN)
make test_batching   # find optimal batch size for your GPU

make ov_int8         # INT8 accuracy-aware quantization for OpenVINO (can take hours)

Notes:

  • make train requires train.csv and val.csv in train.data_path (generated by make split).
  • make infer runs Torch inference on train.path_to_test_data and writes to train.infer_path.

Or run in sequence:

make                 # train -> export -> bench (does not run split)

Or run overwriting configs from CLI

python -m src.dl.train exp_name=my_exp

Enable DDP (multi-GPU) by setting train.ddp.enabled: True and train.ddp.n_gpus: N in config. Then just run make train — it auto-launches with torchrun.

Training Features

Feature Description
DDP Multi-GPU distributed training with SyncBatchNorm
AMP Automatic mixed precision (~40% less VRAM, ~15% faster)
EMA Exponential moving average of weights
Gradient accumulation Effective batch size = batch_size x b_accum_steps
Gradient clipping Configurable max norm
Mosaic augmentation 4-image mosaic with affine transforms (recommended for detection)
Albumentations Rotation, flip, blur, noise, gamma, grayscale, coarse dropout, multiscale
OneCycleLR scheduler Separate learning rates for backbone and head
Early stopping Configurable patience
WandB integration Automatic experiment tracking
Optimal threshold search Auto-finds best confidence threshold after training
Background warm-up Ignore background-only images for N initial epochs

Export

Format Half Precision Notes
ONNX With optional fused postprocessor
TensorRT FP16 Must be exported on the target GPU
OpenVINO FP16, INT8 Single export for FP32 or FP16 (pick during inference) and separate INT8 quantization script

Tip: FP16 is the best latency/accuracy trade-off. For GPU use TensorRT, for CPU - OpenVINO.

Inference

Backends

Four inference backends in src/infer/:

Backend Format Devices
Torch .pt CUDA, MPS, CPU
TensorRT .engine CUDA
OpenVINO .xml CPU, iGPU
ONNX Runtime .onnx CUDA, CPU

Gradio Demo

python -m demo.demo

A web UI for uploading images and running inference interactively.

Benchmarks

VisDrone - object detection

VisDrone dataset - a large-scale drone-captured benchmark with 10 categories across diverse urban and rural scenes (~6500 train / ~550 val / ~1600 test-dev images). YOLO26 trained for 100 epochs, D-FINE for 75. YOLO26 confidence threshold - 0.25, D-FINE - 0.5. F1-score measured with IoU threshold 0.5. Preserved original dataset split (VisDrone2019-DET-train, VisDrone2019-DET-val, VisDrone2019-DET-test-dev). Metrics are reported on test-dev set. Latency measured end-to-end (preprocessing + forward pass + postprocessing) on RTX 5070 Ti with TensorRT FP16 at 640x640, batch size 1.

Model F1-score IoU Precision Recall Latency (ms)
D-FINE N 0.513 0.275 0.665 0.417 2.7
YOLO26 N 0.455 0.226 0.631 0.356 2.8
D-FINE S 0.563 0.315 0.681 0.479 3.2
YOLO26 S 0.510 0.264 0.652 0.419 3.1
D-FINE M 0.587 0.335 0.684 0.514 3.8
YOLO26 M 0.562 0.301 0.667 0.485 3.6
D-FINE L 0.584 0.333 0.669 0.518 4.4
YOLO26 L 0.568 0.308 0.676 0.490 4.1
D-FINE X 0.592 0.338 0.672 0.530 5.7
YOLO26 X 0.584 0.319 0.682 0.510 5.3

D-FINE outperforms YOLO26 in fine-tuning setting on VisDrone dataset in F1-score across every model size. D-FINE achieves ~6% higher mean relative F1-score with ~4% latency overhead. Notably, IoU is ~13% higher (mean relative improvement across all models).

Details

VisDrone

TACO - object detection and instance segmentation

TACO dataset (1500 images, 59 effective classes of waste in diverse environments, 86/14 train/val split by batch ID). The benchmarking environment is the same as for VisDrone.

Instance Segmentation

Model Params (M) F1-score IoU Precision Recall Latency (ms)
D-FINE-seg N 5.1 0.213 0.095 0.272 0.175 4.3
YOLO26-seg N 2.7 0.062 0.027 0.272 0.035 3.8
D-FINE-seg S 11.9 0.263 0.125 0.339 0.215 5.0
YOLO26-seg S 10.4 0.177 0.080 0.278 0.130 4.3
D-FINE-seg M 21.2 0.284 0.134 0.316 0.258 5.8
YOLO26-seg M 23.6 0.267 0.128 0.365 0.210 5.3
D-FINE-seg L 32.8 0.317 0.152 0.369 0.278 6.4
YOLO26-seg L 28.0 0.287 0.137 0.394 0.226 5.8
D-FINE-seg X 64.3 0.350 0.172 0.391 0.318 7.8
YOLO26-seg X 62.8 0.300 0.146 0.408 0.238 7.6

Object Detection

Model Params (M) F1-score IoU Precision Recall Latency (ms)
D-FINE N 3.8 0.223 0.108 0.295 0.180 3.2
YOLO26 N 2.4 0.072 0.033 0.274 0.042 3.4
D-FINE S 10.3 0.274 0.140 0.327 0.240 3.6
YOLO26 S 9.5 0.170 0.081 0.279 0.122 3.5
D-FINE M 19.6 0.282 0.147 0.342 0.239 4.3
YOLO26 M 20.4 0.232 0.115 0.303 0.188 4.2
D-FINE L 31.2 0.342 0.180 0.409 0.294 4.9
YOLO26 L 24.8 0.250 0.128 0.356 0.193 4.7
D-FINE X 62.6 0.364 0.195 0.394 0.339 6.2
YOLO26 X 55.7 0.303 0.158 0.412 0.239 6.1

D-FINE-seg outperforms YOLO26 in fine-tuning setting on TACO dataset in F1-score across every model size (N/S/M/L/X). In segmentation task - ~65% higher mean relative F1-score and 10% latency overhead. In detection task - ~70% higher F1-score and 1% latency overhead.

COCO-style APs

Mask AP (Segmentation)
Model Mask mAP@50-95 Mask mAP@50
D-FINE-seg N 0.094 0.141
YOLO26-seg N 0.041 0.058
D-FINE-seg S 0.177 0.250
YOLO26-seg S 0.111 0.165
D-FINE-seg M 0.157 0.229
YOLO26-seg M 0.195 0.270
D-FINE-seg L 0.212 0.310
YOLO26-seg L 0.174 0.242
D-FINE-seg X 0.242 0.340
YOLO26-seg X 0.210 0.291
Box AP (Detection)
Model Box mAP@50-95 Box mAP@50
D-FINE N 0.123 0.169
YOLO26 N 0.060 0.075
D-FINE S 0.202 0.244
YOLO26 S 0.098 0.124
D-FINE M 0.204 0.246
YOLO26 M 0.172 0.214
D-FINE L 0.256 0.314
YOLO26 L 0.230 0.272
D-FINE X 0.269 0.336
YOLO26 X 0.256 0.300

AP computed with confidence threshold 0.01, max 100 detections per image. D-FINE-seg wins on 4 of 5 mask AP sizes (YOLO26 leads at M) and all 5 box AP sizes.

Format Comparisons

Measured on TACO with D-FINE-seg S / D-FINE S at 640x640. Latency = preprocessing + inference + postprocessing.

Desktop: Intel i5-12400F + RTX 5070 Ti
Model Format F1-score Latency (ms)
D-FINE-seg S Torch FP32 0.263 20.4
D-FINE-seg S TensorRT FP32 0.264 6.5
D-FINE-seg S TensorRT FP16 0.263 5.0
D-FINE S Torch FP32 0.276 18.0
D-FINE S TensorRT FP32 0.272 4.5
D-FINE S TensorRT FP16 0.274 3.6

TensorRT FP16 -> ~4x faster than Torch FP32, no F1 drop

Edge: Intel N150 (OpenVINO)
Model Format F1-score Latency (ms)
D-FINE-seg S FP32 0.264 431.2
D-FINE-seg S FP16 0.264 272.2
D-FINE-seg S INT8 0.243 205.0
D-FINE S FP32 0.272 188.4
D-FINE S FP16 0.271 120.8
D-FINE S INT8 0.250 76.3

FP16 -> ~60% faster than FP32, no F1 drop. INT8 -> ~2x faster than FP32 but noticeable F1 drop

Outputs

Output Location Description
Models + logs output/models/{exp_name}_{date}/ Weights, training metrics, confusion matrix, F1 vs threshold plots, per-class metrics, bench metrics
Debug images output/debug_images/ Preprocessed training images (with augmentations)
Eval predictions output/eval_preds/ Val set predictions with GT (green) and preds (blue)
Bench images output/bench_imgs/ Predictions from all exported models
Infer output/infer/ Visualizations + YOLO txt annotations
Check errors output/check_errors/ FP and FN only — for finding mislabeled samples

Result examples

Training

Training

Benchmarking

Benchmarking

WandB dashboard

WandB

Inference

Citation

If you use D-FINE-seg in your research, please cite:

D-FINE-seg - comming soon

And the original D-FINE paper:

@misc{peng2024dfine,
      title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
      author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
      year={2024},
      eprint={2410.13842},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is licensed under the Apache 2.0 License.

Acknowledgement

The detection core is based on the D-FINE paper and architecture. The mask head design follows the Mask DINO paradigm. Thank you to both teams for their excellent work.

Benchmarks in this project use the VisDrone and TACO datasets. We thank the authors for making these datasets publicly available.

About

D-FINE-seg Object Detection and Segmentation Framework (Train, Export, Inference)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •