D-FINE-seg

Real-Time Object Detection and Instance Segmentation

Quick Start • Usage • Export • Inference • Benchmarks • Video Tutorial • Colab

D-FINE-seg extends the D-FINE real-time transformer based object detector with instance segmentation. It adds a lightweight mask head, segmentation-aware training (box-cropped BCE and dice mask losses, auxiliary and denoising mask supervision), and mask-aware Hungarian matching. On the TACO and VisDrone datasets, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency.

The framework covers the full workflow — from data preparation and training (with DDP, EMA, AMP, mosaic) through export (ONNX, TensorRT, OpenVINO) to optimized multi-backend inference for both object detection and instance segmentation tasks.

This is not a fork. The detection core is based on the original D-FINE paper; everything else — segmentation head, training pipeline, export, inference, augmentations — was reimplemented from scratch.

Paper: D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment (coming soon)

Highlights

Instance segmentation via a lightweight mask head on top of D-FINE's HybridEncoder PAN outputs — fuses stride 8/16/32 features to 1/4 resolution, then dot-product between per-query mask embeddings (3-layer MLP) and shared mask features produces per-instance masks
New losses: box-cropped BCE + Dice mask losses computed only inside GT boxes and normalized by ROI area
Mask-aware denoising: contrastive denoising training extended with mask supervision for faster convergence (adds no inference cost)
Mask-aware matching: Hungarian matcher augmented with Dice overlap cost and sigmoid focal mask cost alongside classification, L1, and GIoU costs
5 model sizes — Nano, Small, Medium, Large, Extra-Large — with HGNetv2 backbones
Production-ready: export to ONNX / TensorRT / OpenVINO, optimized inference with Torch / TRT / OV / ONNX backends

Quick Start

Installation

git clone https://github.com/ArgoHA/D-FINE-seg.git
cd D-FINE-seg
pip install -r requirements.txt

For larger models (L, X), download pretrained backbone weights from Google Drive and place them in the pretrained/ folder.

Prepare Your Data

Organize your dataset in the following structure with YOLO style annotations:

data/dataset/
├── images/    # all images: .jpg, .png, etc.
└── labels/    # all labels: one .txt per image (same filename stem)

Detection labels: class_id xc yc w h (normalized)

Segmentation labels: class_id x1 y1 x2 y2 ... xN yN (normalized polygon coordinates)

Configure

Edit config.yaml — key settings:

task: detect  # "detect" or "segment"
exp_name: my_exp  # experiment name (used in output paths)
model_name: s  # n / s / m / l / x

train:
  root: /path/to/project  # project root, will be used for outputs
  data_path: /path/to/dataset  # folder with images/ and labels/
  label_to_name:
    0: class_a
    1: class_b
  epochs: 75
  batch_size: 8
  img_size: [640, 640]  # (h, w)

Usage

make split           # create train/val CSV splits (test split if configured)
make train           # train the model
make export          # export to ONNX, TensorRT, OpenVINO
make bench           # benchmark all exported models on the val set

make infer           # run on test folder, save visualizations + YOLO txt predictions
make check_errors    # compare predictions against GT, save only mismatches (FP/FN)
make test_batching   # find optimal batch size for your GPU

make ov_int8         # INT8 accuracy-aware quantization for OpenVINO (can take hours)

Notes:

make train requires train.csv and val.csv in train.data_path (generated by make split).
make infer runs Torch inference on train.path_to_test_data and writes to train.infer_path.

Or run in sequence:

make                 # train -> export -> bench (does not run split)

Or run overwriting configs from CLI

python -m src.dl.train exp_name=my_exp

Enable DDP (multi-GPU) by setting train.ddp.enabled: True and train.ddp.n_gpus: N in config. Then just run make train — it auto-launches with torchrun.

Training Features

Feature	Description
DDP	Multi-GPU distributed training with SyncBatchNorm
AMP	Automatic mixed precision (~40% less VRAM, ~15% faster)
EMA	Exponential moving average of weights
Gradient accumulation	Effective batch size = `batch_size x b_accum_steps`
Gradient clipping	Configurable max norm
Mosaic augmentation	4-image mosaic with affine transforms (recommended for detection)
Albumentations	Rotation, flip, blur, noise, gamma, grayscale, coarse dropout, multiscale
OneCycleLR scheduler	Separate learning rates for backbone and head
Early stopping	Configurable patience
WandB integration	Automatic experiment tracking
Optimal threshold search	Auto-finds best confidence threshold after training
Background warm-up	Ignore background-only images for N initial epochs

Export

Format	Half Precision	Notes
ONNX	—	With optional fused postprocessor
TensorRT	FP16	Must be exported on the target GPU
OpenVINO	FP16, INT8	Single export for FP32 or FP16 (pick during inference) and separate INT8 quantization script

Tip: FP16 is the best latency/accuracy trade-off. For GPU use TensorRT, for CPU - OpenVINO.

Inference

Backends

Four inference backends in src/infer/:

Backend	Format	Devices
Torch	`.pt`	CUDA, MPS, CPU
TensorRT	`.engine`	CUDA
OpenVINO	`.xml`	CPU, iGPU
ONNX Runtime	`.onnx`	CUDA, CPU

Gradio Demo

python -m demo.demo

A web UI for uploading images and running inference interactively.

Benchmarks

VisDrone - object detection

VisDrone dataset - a large-scale drone-captured benchmark with 10 categories across diverse urban and rural scenes (~6500 train / ~550 val / ~1600 test-dev images). YOLO26 trained for 100 epochs, D-FINE for 75. YOLO26 confidence threshold - 0.25, D-FINE - 0.5. F1-score measured with IoU threshold 0.5. Preserved original dataset split (VisDrone2019-DET-train, VisDrone2019-DET-val, VisDrone2019-DET-test-dev). Metrics are reported on test-dev set. Latency measured end-to-end (preprocessing + forward pass + postprocessing) on RTX 5070 Ti with TensorRT FP16 at 640x640, batch size 1.

Model	F1-score	IoU	Precision	Recall	Latency (ms)
D-FINE N	0.513	0.275	0.665	0.417	2.7
YOLO26 N	0.455	0.226	0.631	0.356	2.8
D-FINE S	0.563	0.315	0.681	0.479	3.2
YOLO26 S	0.510	0.264	0.652	0.419	3.1
D-FINE M	0.587	0.335	0.684	0.514	3.8
YOLO26 M	0.562	0.301	0.667	0.485	3.6
D-FINE L	0.584	0.333	0.669	0.518	4.4
YOLO26 L	0.568	0.308	0.676	0.490	4.1
D-FINE X	0.592	0.338	0.672	0.530	5.7
YOLO26 X	0.584	0.319	0.682	0.510	5.3

D-FINE outperforms YOLO26 in fine-tuning setting on VisDrone dataset in F1-score across every model size. D-FINE achieves ~6% higher mean relative F1-score with ~4% latency overhead. Notably, IoU is ~13% higher (mean relative improvement across all models).

Details

TACO - object detection and instance segmentation

TACO dataset (1500 images, 59 effective classes of waste in diverse environments, 86/14 train/val split by batch ID). The benchmarking environment is the same as for VisDrone.

Instance Segmentation

Model	Params (M)	F1-score	IoU	Precision	Recall	Latency (ms)
D-FINE-seg N	5.1	0.213	0.095	0.272	0.175	4.3
YOLO26-seg N	2.7	0.062	0.027	0.272	0.035	3.8
D-FINE-seg S	11.9	0.263	0.125	0.339	0.215	5.0
YOLO26-seg S	10.4	0.177	0.080	0.278	0.130	4.3
D-FINE-seg M	21.2	0.284	0.134	0.316	0.258	5.8
YOLO26-seg M	23.6	0.267	0.128	0.365	0.210	5.3
D-FINE-seg L	32.8	0.317	0.152	0.369	0.278	6.4
YOLO26-seg L	28.0	0.287	0.137	0.394	0.226	5.8
D-FINE-seg X	64.3	0.350	0.172	0.391	0.318	7.8
YOLO26-seg X	62.8	0.300	0.146	0.408	0.238	7.6

Object Detection

Model	Params (M)	F1-score	IoU	Precision	Recall	Latency (ms)
D-FINE N	3.8	0.223	0.108	0.295	0.180	3.2
YOLO26 N	2.4	0.072	0.033	0.274	0.042	3.4
D-FINE S	10.3	0.274	0.140	0.327	0.240	3.6
YOLO26 S	9.5	0.170	0.081	0.279	0.122	3.5
D-FINE M	19.6	0.282	0.147	0.342	0.239	4.3
YOLO26 M	20.4	0.232	0.115	0.303	0.188	4.2
D-FINE L	31.2	0.342	0.180	0.409	0.294	4.9
YOLO26 L	24.8	0.250	0.128	0.356	0.193	4.7
D-FINE X	62.6	0.364	0.195	0.394	0.339	6.2
YOLO26 X	55.7	0.303	0.158	0.412	0.239	6.1

D-FINE-seg outperforms YOLO26 in fine-tuning setting on TACO dataset in F1-score across every model size (N/S/M/L/X). In segmentation task - ~65% higher mean relative F1-score and 10% latency overhead. In detection task - ~70% higher F1-score and 1% latency overhead.

COCO-style APs

Mask AP (Segmentation)

Model	Mask mAP@50-95	Mask mAP@50
D-FINE-seg N	0.094	0.141
YOLO26-seg N	0.041	0.058
D-FINE-seg S	0.177	0.250
YOLO26-seg S	0.111	0.165
D-FINE-seg M	0.157	0.229
YOLO26-seg M	0.195	0.270
D-FINE-seg L	0.212	0.310
YOLO26-seg L	0.174	0.242
D-FINE-seg X	0.242	0.340
YOLO26-seg X	0.210	0.291

Box AP (Detection)

Model	Box mAP@50-95	Box mAP@50
D-FINE N	0.123	0.169
YOLO26 N	0.060	0.075
D-FINE S	0.202	0.244
YOLO26 S	0.098	0.124
D-FINE M	0.204	0.246
YOLO26 M	0.172	0.214
D-FINE L	0.256	0.314
YOLO26 L	0.230	0.272
D-FINE X	0.269	0.336
YOLO26 X	0.256	0.300

AP computed with confidence threshold 0.01, max 100 detections per image. D-FINE-seg wins on 4 of 5 mask AP sizes (YOLO26 leads at M) and all 5 box AP sizes.

Format Comparisons

Measured on TACO with D-FINE-seg S / D-FINE S at 640x640. Latency = preprocessing + inference + postprocessing.

Desktop: Intel i5-12400F + RTX 5070 Ti

Model	Format	F1-score	Latency (ms)
D-FINE-seg S	Torch FP32	0.263	20.4
D-FINE-seg S	TensorRT FP32	0.264	6.5
D-FINE-seg S	TensorRT FP16	0.263	5.0
D-FINE S	Torch FP32	0.276	18.0
D-FINE S	TensorRT FP32	0.272	4.5
D-FINE S	TensorRT FP16	0.274	3.6

TensorRT FP16 -> ~4x faster than Torch FP32, no F1 drop

Edge: Intel N150 (OpenVINO)

Model	Format	F1-score	Latency (ms)
D-FINE-seg S	FP32	0.264	431.2
D-FINE-seg S	FP16	0.264	272.2
D-FINE-seg S	INT8	0.243	205.0
D-FINE S	FP32	0.272	188.4
D-FINE S	FP16	0.271	120.8
D-FINE S	INT8	0.250	76.3

FP16 -> ~60% faster than FP32, no F1 drop. INT8 -> ~2x faster than FP32 but noticeable F1 drop

Outputs

Output	Location	Description
Models + logs	`output/models/{exp_name}_{date}/`	Weights, training metrics, confusion matrix, F1 vs threshold plots, per-class metrics, bench metrics
Debug images	`output/debug_images/`	Preprocessed training images (with augmentations)
Eval predictions	`output/eval_preds/`	Val set predictions with GT (green) and preds (blue)
Bench images	`output/bench_imgs/`	Predictions from all exported models
Infer	`output/infer/`	Visualizations + YOLO txt annotations
Check errors	`output/check_errors/`	FP and FN only — for finding mislabeled samples

Result examples

Training

Benchmarking

WandB dashboard

Inference

Citation

If you use D-FINE-seg in your research, please cite:

D-FINE-seg - comming soon

And the original D-FINE paper:

@misc{peng2024dfine,
      title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
      author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
      year={2024},
      eprint={2410.13842},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is licensed under the Apache 2.0 License.

Acknowledgement

The detection core is based on the D-FINE paper and architecture. The mask head design follows the Mask DINO paradigm. Thank you to both teams for their excellent work.

Benchmarks in this project use the VisDrone and TACO datasets. We thank the authors for making these datasets publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
assets		assets
demo		demo
pretrained		pretrained
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
config_.yaml		config_.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-FINE-seg

Highlights

Quick Start

Installation

Prepare Your Data

Configure

Usage

Training Features

Export

Inference

Backends

Gradio Demo

Benchmarks

VisDrone - object detection

TACO - object detection and instance segmentation

Instance Segmentation

Object Detection

COCO-style APs

Format Comparisons

Outputs

Result examples

Citation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ArgoHA/D-FINE-seg

Folders and files

Latest commit

History

Repository files navigation

D-FINE-seg

Highlights

Quick Start

Installation

Prepare Your Data

Configure

Usage

Training Features

Export

Inference

Backends

Gradio Demo

Benchmarks

VisDrone - object detection

TACO - object detection and instance segmentation

Instance Segmentation

Object Detection

COCO-style APs

Format Comparisons

Outputs

Result examples

Citation

License

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages