Real-Time Object Detection and Instance Segmentation
Quick Start • Usage • Export • Inference • Benchmarks • Video Tutorial • Colab
D-FINE-seg extends the D-FINE real-time transformer based object detector with instance segmentation. It adds a lightweight mask head, segmentation-aware training (box-cropped BCE and dice mask losses, auxiliary and denoising mask supervision), and mask-aware Hungarian matching. On the TACO and VisDrone datasets, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency.
The framework covers the full workflow — from data preparation and training (with DDP, EMA, AMP, mosaic) through export (ONNX, TensorRT, OpenVINO) to optimized multi-backend inference for both object detection and instance segmentation tasks.
This is not a fork. The detection core is based on the original D-FINE paper; everything else — segmentation head, training pipeline, export, inference, augmentations — was reimplemented from scratch.
Paper: D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment (coming soon)
- Instance segmentation via a lightweight mask head on top of D-FINE's HybridEncoder PAN outputs — fuses stride 8/16/32 features to 1/4 resolution, then dot-product between per-query mask embeddings (3-layer MLP) and shared mask features produces per-instance masks
- New losses: box-cropped BCE + Dice mask losses computed only inside GT boxes and normalized by ROI area
- Mask-aware denoising: contrastive denoising training extended with mask supervision for faster convergence (adds no inference cost)
- Mask-aware matching: Hungarian matcher augmented with Dice overlap cost and sigmoid focal mask cost alongside classification, L1, and GIoU costs
- 5 model sizes — Nano, Small, Medium, Large, Extra-Large — with HGNetv2 backbones
- Production-ready: export to ONNX / TensorRT / OpenVINO, optimized inference with Torch / TRT / OV / ONNX backends
git clone https://github.com/ArgoHA/D-FINE-seg.git
cd D-FINE-seg
pip install -r requirements.txtFor larger models (L, X), download pretrained backbone weights from Google Drive and place them in the pretrained/ folder.
Organize your dataset in the following structure with YOLO style annotations:
data/dataset/
├── images/ # all images: .jpg, .png, etc.
└── labels/ # all labels: one .txt per image (same filename stem)
Detection labels: class_id xc yc w h (normalized)
Segmentation labels: class_id x1 y1 x2 y2 ... xN yN (normalized polygon coordinates)
Edit config.yaml — key settings:
task: detect # "detect" or "segment"
exp_name: my_exp # experiment name (used in output paths)
model_name: s # n / s / m / l / x
train:
root: /path/to/project # project root, will be used for outputs
data_path: /path/to/dataset # folder with images/ and labels/
label_to_name:
0: class_a
1: class_b
epochs: 75
batch_size: 8
img_size: [640, 640] # (h, w)make split # create train/val CSV splits (test split if configured)
make train # train the model
make export # export to ONNX, TensorRT, OpenVINO
make bench # benchmark all exported models on the val set
make infer # run on test folder, save visualizations + YOLO txt predictions
make check_errors # compare predictions against GT, save only mismatches (FP/FN)
make test_batching # find optimal batch size for your GPU
make ov_int8 # INT8 accuracy-aware quantization for OpenVINO (can take hours)Notes:
make trainrequirestrain.csvandval.csvintrain.data_path(generated bymake split).make inferruns Torch inference ontrain.path_to_test_dataand writes totrain.infer_path.
Or run in sequence:
make # train -> export -> bench (does not run split)Or run overwriting configs from CLI
python -m src.dl.train exp_name=my_expEnable DDP (multi-GPU) by setting train.ddp.enabled: True and train.ddp.n_gpus: N in config. Then just run make train — it auto-launches with torchrun.
| Feature | Description |
|---|---|
| DDP | Multi-GPU distributed training with SyncBatchNorm |
| AMP | Automatic mixed precision (~40% less VRAM, ~15% faster) |
| EMA | Exponential moving average of weights |
| Gradient accumulation | Effective batch size = batch_size x b_accum_steps |
| Gradient clipping | Configurable max norm |
| Mosaic augmentation | 4-image mosaic with affine transforms (recommended for detection) |
| Albumentations | Rotation, flip, blur, noise, gamma, grayscale, coarse dropout, multiscale |
| OneCycleLR scheduler | Separate learning rates for backbone and head |
| Early stopping | Configurable patience |
| WandB integration | Automatic experiment tracking |
| Optimal threshold search | Auto-finds best confidence threshold after training |
| Background warm-up | Ignore background-only images for N initial epochs |
| Format | Half Precision | Notes |
|---|---|---|
| ONNX | — | With optional fused postprocessor |
| TensorRT | FP16 | Must be exported on the target GPU |
| OpenVINO | FP16, INT8 | Single export for FP32 or FP16 (pick during inference) and separate INT8 quantization script |
Tip: FP16 is the best latency/accuracy trade-off. For GPU use TensorRT, for CPU - OpenVINO.
Four inference backends in src/infer/:
| Backend | Format | Devices |
|---|---|---|
| Torch | .pt |
CUDA, MPS, CPU |
| TensorRT | .engine |
CUDA |
| OpenVINO | .xml |
CPU, iGPU |
| ONNX Runtime | .onnx |
CUDA, CPU |
python -m demo.demoA web UI for uploading images and running inference interactively.
VisDrone dataset - a large-scale drone-captured benchmark with 10 categories across diverse urban and rural scenes (~6500 train / ~550 val / ~1600 test-dev images). YOLO26 trained for 100 epochs, D-FINE for 75. YOLO26 confidence threshold - 0.25, D-FINE - 0.5. F1-score measured with IoU threshold 0.5. Preserved original dataset split (VisDrone2019-DET-train, VisDrone2019-DET-val, VisDrone2019-DET-test-dev). Metrics are reported on test-dev set. Latency measured end-to-end (preprocessing + forward pass + postprocessing) on RTX 5070 Ti with TensorRT FP16 at 640x640, batch size 1.
| Model | F1-score | IoU | Precision | Recall | Latency (ms) |
|---|---|---|---|---|---|
| D-FINE N | 0.513 | 0.275 | 0.665 | 0.417 | 2.7 |
| YOLO26 N | 0.455 | 0.226 | 0.631 | 0.356 | 2.8 |
| D-FINE S | 0.563 | 0.315 | 0.681 | 0.479 | 3.2 |
| YOLO26 S | 0.510 | 0.264 | 0.652 | 0.419 | 3.1 |
| D-FINE M | 0.587 | 0.335 | 0.684 | 0.514 | 3.8 |
| YOLO26 M | 0.562 | 0.301 | 0.667 | 0.485 | 3.6 |
| D-FINE L | 0.584 | 0.333 | 0.669 | 0.518 | 4.4 |
| YOLO26 L | 0.568 | 0.308 | 0.676 | 0.490 | 4.1 |
| D-FINE X | 0.592 | 0.338 | 0.672 | 0.530 | 5.7 |
| YOLO26 X | 0.584 | 0.319 | 0.682 | 0.510 | 5.3 |
D-FINE outperforms YOLO26 in fine-tuning setting on VisDrone dataset in F1-score across every model size. D-FINE achieves ~6% higher mean relative F1-score with ~4% latency overhead. Notably, IoU is ~13% higher (mean relative improvement across all models).
TACO dataset (1500 images, 59 effective classes of waste in diverse environments, 86/14 train/val split by batch ID). The benchmarking environment is the same as for VisDrone.
| Model | Params (M) | F1-score | IoU | Precision | Recall | Latency (ms) |
|---|---|---|---|---|---|---|
| D-FINE-seg N | 5.1 | 0.213 | 0.095 | 0.272 | 0.175 | 4.3 |
| YOLO26-seg N | 2.7 | 0.062 | 0.027 | 0.272 | 0.035 | 3.8 |
| D-FINE-seg S | 11.9 | 0.263 | 0.125 | 0.339 | 0.215 | 5.0 |
| YOLO26-seg S | 10.4 | 0.177 | 0.080 | 0.278 | 0.130 | 4.3 |
| D-FINE-seg M | 21.2 | 0.284 | 0.134 | 0.316 | 0.258 | 5.8 |
| YOLO26-seg M | 23.6 | 0.267 | 0.128 | 0.365 | 0.210 | 5.3 |
| D-FINE-seg L | 32.8 | 0.317 | 0.152 | 0.369 | 0.278 | 6.4 |
| YOLO26-seg L | 28.0 | 0.287 | 0.137 | 0.394 | 0.226 | 5.8 |
| D-FINE-seg X | 64.3 | 0.350 | 0.172 | 0.391 | 0.318 | 7.8 |
| YOLO26-seg X | 62.8 | 0.300 | 0.146 | 0.408 | 0.238 | 7.6 |
| Model | Params (M) | F1-score | IoU | Precision | Recall | Latency (ms) |
|---|---|---|---|---|---|---|
| D-FINE N | 3.8 | 0.223 | 0.108 | 0.295 | 0.180 | 3.2 |
| YOLO26 N | 2.4 | 0.072 | 0.033 | 0.274 | 0.042 | 3.4 |
| D-FINE S | 10.3 | 0.274 | 0.140 | 0.327 | 0.240 | 3.6 |
| YOLO26 S | 9.5 | 0.170 | 0.081 | 0.279 | 0.122 | 3.5 |
| D-FINE M | 19.6 | 0.282 | 0.147 | 0.342 | 0.239 | 4.3 |
| YOLO26 M | 20.4 | 0.232 | 0.115 | 0.303 | 0.188 | 4.2 |
| D-FINE L | 31.2 | 0.342 | 0.180 | 0.409 | 0.294 | 4.9 |
| YOLO26 L | 24.8 | 0.250 | 0.128 | 0.356 | 0.193 | 4.7 |
| D-FINE X | 62.6 | 0.364 | 0.195 | 0.394 | 0.339 | 6.2 |
| YOLO26 X | 55.7 | 0.303 | 0.158 | 0.412 | 0.239 | 6.1 |
D-FINE-seg outperforms YOLO26 in fine-tuning setting on TACO dataset in F1-score across every model size (N/S/M/L/X). In segmentation task - ~65% higher mean relative F1-score and 10% latency overhead. In detection task - ~70% higher F1-score and 1% latency overhead.
Mask AP (Segmentation)
| Model | Mask mAP@50-95 | Mask mAP@50 |
|---|---|---|
| D-FINE-seg N | 0.094 | 0.141 |
| YOLO26-seg N | 0.041 | 0.058 |
| D-FINE-seg S | 0.177 | 0.250 |
| YOLO26-seg S | 0.111 | 0.165 |
| D-FINE-seg M | 0.157 | 0.229 |
| YOLO26-seg M | 0.195 | 0.270 |
| D-FINE-seg L | 0.212 | 0.310 |
| YOLO26-seg L | 0.174 | 0.242 |
| D-FINE-seg X | 0.242 | 0.340 |
| YOLO26-seg X | 0.210 | 0.291 |
Box AP (Detection)
| Model | Box mAP@50-95 | Box mAP@50 |
|---|---|---|
| D-FINE N | 0.123 | 0.169 |
| YOLO26 N | 0.060 | 0.075 |
| D-FINE S | 0.202 | 0.244 |
| YOLO26 S | 0.098 | 0.124 |
| D-FINE M | 0.204 | 0.246 |
| YOLO26 M | 0.172 | 0.214 |
| D-FINE L | 0.256 | 0.314 |
| YOLO26 L | 0.230 | 0.272 |
| D-FINE X | 0.269 | 0.336 |
| YOLO26 X | 0.256 | 0.300 |
AP computed with confidence threshold 0.01, max 100 detections per image. D-FINE-seg wins on 4 of 5 mask AP sizes (YOLO26 leads at M) and all 5 box AP sizes.
Measured on TACO with D-FINE-seg S / D-FINE S at 640x640. Latency = preprocessing + inference + postprocessing.
Desktop: Intel i5-12400F + RTX 5070 Ti
| Model | Format | F1-score | Latency (ms) |
|---|---|---|---|
| D-FINE-seg S | Torch FP32 | 0.263 | 20.4 |
| D-FINE-seg S | TensorRT FP32 | 0.264 | 6.5 |
| D-FINE-seg S | TensorRT FP16 | 0.263 | 5.0 |
| D-FINE S | Torch FP32 | 0.276 | 18.0 |
| D-FINE S | TensorRT FP32 | 0.272 | 4.5 |
| D-FINE S | TensorRT FP16 | 0.274 | 3.6 |
TensorRT FP16 -> ~4x faster than Torch FP32, no F1 drop
Edge: Intel N150 (OpenVINO)
| Model | Format | F1-score | Latency (ms) |
|---|---|---|---|
| D-FINE-seg S | FP32 | 0.264 | 431.2 |
| D-FINE-seg S | FP16 | 0.264 | 272.2 |
| D-FINE-seg S | INT8 | 0.243 | 205.0 |
| D-FINE S | FP32 | 0.272 | 188.4 |
| D-FINE S | FP16 | 0.271 | 120.8 |
| D-FINE S | INT8 | 0.250 | 76.3 |
FP16 -> ~60% faster than FP32, no F1 drop. INT8 -> ~2x faster than FP32 but noticeable F1 drop
| Output | Location | Description |
|---|---|---|
| Models + logs | output/models/{exp_name}_{date}/ |
Weights, training metrics, confusion matrix, F1 vs threshold plots, per-class metrics, bench metrics |
| Debug images | output/debug_images/ |
Preprocessed training images (with augmentations) |
| Eval predictions | output/eval_preds/ |
Val set predictions with GT (green) and preds (blue) |
| Bench images | output/bench_imgs/ |
Predictions from all exported models |
| Infer | output/infer/ |
Visualizations + YOLO txt annotations |
| Check errors | output/check_errors/ |
FP and FN only — for finding mislabeled samples |
Training
Benchmarking
WandB dashboard
Inference
If you use D-FINE-seg in your research, please cite:
D-FINE-seg - comming soon
And the original D-FINE paper:
@misc{peng2024dfine,
title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
year={2024},
eprint={2410.13842},
archivePrefix={arXiv},
primaryClass={cs.CV}
}This project is licensed under the Apache 2.0 License.
The detection core is based on the D-FINE paper and architecture. The mask head design follows the Mask DINO paradigm. Thank you to both teams for their excellent work.
Benchmarks in this project use the VisDrone and TACO datasets. We thank the authors for making these datasets publicly available.







