Human Instance Segmentation

❗Note❗

The README was written entirely by Claude Code, so it contains non-existent command line options. Please check the options yourself and replace them correctly. Pull requests are welcome.

A lightweight ROI-based hierarchical instance segmentation model for human detection with knowledge distillation from EfficientNet-based teacher models. The model achieves efficient real-time performance through a two-stage hierarchical architecture and temperature progression distillation techniques.

Instance Segmentation Mode
160x120 Instance Segmentation Mode
Binary Mask Mode

Architecture Overview

The Human Instance Segmentation model employs a sophisticated hierarchical segmentation approach that combines:

Two-Stage Architecture: Coarse binary segmentation followed by ROI-based instance refinement
Multi-Architecture Support: B0 (lightweight), B1 (balanced), and B7 (high-accuracy) variants
Knowledge Distillation: Temperature progression (10→1) for efficient knowledge transfer
Real-time Processing: Optimized for edge devices with ONNX/TensorRT deployment

Key Features

Direct RGB input processing without separate feature extraction
Pre-trained UNet for robust binary foreground/background segmentation
ROI-based refinement for precise instance separation
3-class output system (background, target instance, non-target instances)
Optional post-processing with dilation and edge smoothing

Architecture Details

Model Hierarchy

B0 Architecture (Lightweight)

Encoder: EfficientNet-B0 based (timm-efficientnet-b0)
Parameters: ~5.3M
ONNX Size: ~71MB
ROI Size: 64×48 (standard), 80×60 (enhanced)
Mask Size: 128×96 (standard), 160×120 (enhanced)
Use Case: Real-time edge deployment, mobile devices

B1 Architecture (Balanced)

Encoder: EfficientNet-B1 based (timm-efficientnet-b1)
Parameters: ~7.8M
ONNX Size: ~81MB
ROI Size: 64×48 (standard), 80×60 (enhanced)
Mask Size: 128×96 (standard), 160×120 (enhanced)
Use Case: Balanced performance/accuracy trade-off

B7 Architecture (High-Accuracy)

Encoder: EfficientNet-B7 based (timm-efficientnet-b7)
Parameters: ~66M
ONNX Size: ~90MB
ROI Size: 64×48 (standard), 80×60 (enhanced), 128×96 (ultra)
Mask Size: 128×96 (standard), 160×120 (enhanced), 256×192 (ultra)
Use Case: Maximum accuracy, server deployment

Core Components

1. Pretrained UNet Module

Architecture: Enhanced UNet with residual blocks
Normalization: LayerNorm2D for stable training
Activation: ReLU/SiLU configurable
Output: Binary foreground/background mask
Training: Frozen during instance segmentation training

2. ROI Extraction Module

Input: COCO bounding boxes
Normalization: Coordinates normalized to [0, 1]
Pooling: Dynamic RoI Align with configurable output sizes
Batch Processing: Efficient multi-instance handling

3. Instance Segmentation Head

Architecture: Hierarchical UNet V2 with attention modules
Classes: 3-class segmentation (background, target, non-target)
Features:
- Residual blocks for feature refinement
- Attention gating for focus on person boundaries
- Distance-aware loss for better instance separation
- Contour detection auxiliary task

4. Loss Functions

Primary Loss: Weighted CrossEntropy + Dice Loss
Class Weights:
- Background: 0.538
- Target: 0.750
- Non-target: 1.712 (1.2× boosted)
Auxiliary Losses:
- Distance transform loss for boundary awareness
- Contour detection loss for edge refinement
- Separation-aware weighting for instance distinction

Architecture Diagram

             ┌─────────────────────────────┐           ┌──────────────────────────────┐
             │       Input RGB Image       │           │             ROIs             │
             │        [B, 3, H, W]         │           │            [N, 5]            │
             └──────────────┬──────────────┘           │ [batch_idx, x1, y1, x2, y2]  │
                            │                          │ (0-1 normalized coordinates) │
                            │                          └──────────────┬───────────────┘
                            │                                         │
             ┌──────────────▼──────────────┐                          │
             │   Pretrained UNet Module    │                          │
             │    (Frozen during training) │                          │
             │   Output: Binary FG/BG      │                          │
             └──────────────┬──────────────┘                          │
                            │                                         │
              ┌─────────────┴─────────────┐                           │
              │                           │                           │
  ┌───────────▼───────────┐   ┌───────────▼──────────┐                │
  │  Binary Mask Output   │   │   Feature Maps       │                │
  │   [B, 1, H, W]        │   │   for ROI Pooling    │                │
  └───────────┬───────────┘   └───────────┬──────────┘                │
              │                           │                           │
              └─────────────┬─────────────┘                           │
                            │◀────────────────────────────────────────┘
            ┌───────────────▼───────────────┐
            │   Dynamic RoI Align           │
            │  Output: [N, C, H_roi, W_roi] │
            └───────────────┬───────────────┘
                            │
              ┌─────────────┴─────────────┐
              │                           │
 ┌────────────▼───────────┐   ┌───────────▼────────────┐
 │      EfficientNet      │   │  Pretrained UNet Mask  │
 │      Encoder           │   │  (for each ROI)        │
 │      (B0/B1/B7)        │   │  [N, 1, H_roi, W_roi]  │
 └────────────┬───────────┘   └───────────┬────────────┘
              │                           │
              └─────────────┬─────────────┘
                            │
              ┌─────────────▼─────────────┐
              │  Instance Segmentation    │
              │  Head (UNet V2)           │
              │  - Attention Modules      │
              │  - Residual Blocks        │
              │  - Distance-Aware Loss    │
              └─────────────┬─────────────┘
                            │
              ┌─────────────▼─────────────┐
              │   3-Class Output Logits   │
              │   [N, 3, mask_h, mask_w]  │
              │   Classes:                │
              │   0: Background           │
              │   1: Target Instance      │
              │   2: Non-target Instances │
              └─────────────┬─────────────┘
                            │
              ┌─────────────▼─────────────┐
              │   Post-Processing         │
              │   (Optional)              │
              │   - Mask Dilation         │
              │   - Edge Smoothing        │
              └───────────────────────────┘

Model Architecture

Click to expand

Training Pipeline

Knowledge Distillation Pipeline

Teacher Model Training: Train B7 architecture to high accuracy
Temperature Progression: Gradual temperature reduction (10→1)
Student Training: Distill to B0/B1 with feature and logit matching
Fine-tuning: Optional direct training on target dataset

Training Stages

Stage 1: UNet Pre-training
- Binary person segmentation on COCO dataset
- Frozen after pre-training for all subsequent stages
Stage 2: Knowledge Distillation
- Teacher model (B7) provides soft targets
- Temperature progression for smooth knowledge transfer
- Feature matching at multiple decoder levels
Stage 3: Instance Segmentation Training
- ROI-based training with 3-class outputs
- Distance-aware loss for instance separation
- Auxiliary tasks for boundary refinement

Refinement Mechanism

Hierarchical Refinement Process

Coarse Segmentation: Pretrained UNet provides initial binary mask
ROI Extraction: Extract regions around detected persons
Feature Enhancement: Process ROIs through EfficientNet encoder
Instance Refinement:
- Apply attention-gated refinement
- Use binary mask as prior for background suppression
- Separate overlapping instances via distance transform

Key Refinement Techniques

Attention Gating: Focus processing on person boundaries
Distance Transform: Encode spatial relationships for better separation
Contour Detection: Auxiliary task for edge preservation
Separation-Aware Weighting: Boost non-target class for clearer boundaries

Dataset Structure

Directory Layout

data/
├── annotations/
│   ├── instances_train2017_person_only_no_crowd.json  # Full training set
│   ├── instances_val2017_person_only_no_crowd.json    # Full validation set
│   ├── instances_train2017_person_only_no_crowd_100imgs.json  # Dev subset
│   └── instances_val2017_person_only_no_crowd_100imgs.json    # Dev subset
├── images/
│   ├── train2017/  # COCO training images
│   └── val2017/    # COCO validation images
└── pretrained/
    ├── best_model_b0_*.pth  # Pretrained B0 models
    ├── best_model_b1_*.pth  # Pretrained B1 models
    └── best_model_b7_*.pth  # Pretrained B7 models

Annotation Format

Format: COCO JSON format
Categories: Person only (no crowd annotations)
Content: Bounding boxes and segmentation polygons
Filtering: Crowd instances removed for cleaner training

Dataset Statistics

Full Dataset: ~64K training, ~2.7K validation images
Development Subsets: 100, 500 image versions
Class Distribution:
- Background: ~53.8% pixels
- Target instances: ~33.3% pixels
- Non-target instances: ~12.9% pixels

Environment Setup

Prerequisites

Python 3.10
CUDA 11.8+ (for GPU support)
uv package manager

Installation with uv

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv

# Activate environment
source .venv/bin/activate  # On Linux/Mac
# or
.venv\Scripts\activate  # On Windows

# Install dependencies
uv pip install -r pyproject.toml

# Install development dependencies (optional)
uv pip install -e ".[dev]"

Verify Installation

# Check PyTorch and CUDA
uv run python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

# Check ONNX Runtime
uv run python -c "import onnxruntime as ort; print(f'ONNX Runtime: {ort.__version__}')"

UNet Distillation Commands

Distillation Configuration Files

rgb_hierarchical_unet_v2_distillation_b0_from_b7_temp_prog: B7→B0 distillation
rgb_hierarchical_unet_v2_distillation_b1_from_b7_temp_prog: B7→B1 distillation
rgb_hierarchical_unet_v2_distillation_b7_from_b7_temp_prog: B7 self-distillation

Basic Distillation Training

# B7 to B0 distillation with temperature progression
uv run python train_distillation_staged.py \
--config rgb_hierarchical_unet_v2_distillation_b0_from_b7_temp_prog \
--epochs 100 \
--batch_size 16

# B7 to B1 distillation
uv run python train_distillation_staged.py \
--config rgb_hierarchical_unet_v2_distillation_b1_from_b7_temp_prog \
--epochs 100 \
--batch_size 12

Advanced Distillation Options

# Resume from checkpoint
uv run python train_distillation_staged.py \
--config rgb_hierarchical_unet_v2_distillation_b7_from_b7_temp_prog \
--resume checkpoints/distillation_epoch_050.pth \
--epochs 100

ROI-Based Hierarchical Training

Standard Configuration Files

rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B0
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B0
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B1
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B1
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B7
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B7

Enhanced Configuration Files

rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B0_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B0_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B1_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B1_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B7_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B7_enhanced
rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r128x96m256x192_disttrans_contdet_baware_from_B7_enhanced

Basic Training Commands

# Train B0 model with standard ROI size (development dataset)
uv run python train_advanced.py \
--config rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B0 \
--epochs 10 \
--batch_size 8

# Train B1 model with enhanced ROI size (full dataset)
uv run python train_advanced.py \
--config rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r80x60m160x120_disttrans_contdet_baware_from_B1_enhanced \
--train_ann data/annotations/instances_train2017_person_only_no_crowd.json \
--val_ann data/annotations/instances_val2017_person_only_no_crowd.json \
--epochs 100 \
--batch_size 6

# Train B7 model with ultra ROI size
uv run python train_advanced.py \
--config rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r128x96m256x192_disttrans_contdet_baware_from_B7_enhanced \
--train_ann data/annotations/instances_train2017_person_only_no_crowd.json \
--val_ann data/annotations/instances_val2017_person_only_no_crowd.json \
--epochs 100 \
--batch_size 4

Advanced Training Options

# Resume training from checkpoint
uv run python train_advanced.py \
--config rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B0 \
--resume experiments/*/checkpoints/checkpoint_epoch_0050_640x640_0750.pth \
--epochs 100

# Fine-tuning with smaller learning rate
uv run python train_advanced.py \
--config rgb_hierarchical_unet_v2_fullimage_pretrained_peopleseg_r64x48m128x96_disttrans_contdet_baware_from_B0 \
--pretrained_checkpoint experiments/*/checkpoints/best_model_*.pth \
--learning_rate 1e-5 \
--epochs 20

Validation Commands

# Validate single checkpoint
uv run python validate_advanced.py \
experiments/*/checkpoints/best_model_epoch_*_640x640_*.pth \
--val_ann data/annotations/instances_val2017_person_only_no_crowd.json \
--batch_size 16

ONNX Export

Export Scripts

export_peopleseg_onnx.py: Export pretrained UNet models
export_hierarchical_instance_peopleseg_onnx.py: Export full hierarchical models
export_bilateral_filter.py: Export bilateral filter post-processing
export_edge_smoothing_onnx.py: Export edge smoothing modules

Pre-trained weights

https://github.com/PINTO0309/human-instance-segmentation/releases/tag/weights

Basic Export Commands

# Export B0 model to ONNX
uv run python export_hierarchical_instance_peopleseg_onnx.py \
experiments/*/checkpoints/best_model_b0_*.pth \
--output models/b0_model.onnx \
--image_size 640,640

# Export B1 model with 1-pixel dilation
uv run python export_hierarchical_instance_peopleseg_onnx.py \
experiments/*/checkpoints/best_model_b1_*.pth \
--output models/b1_model_dil2.onnx \
--image_size 640,640 \
--dilation_pixels 1

# Export B7 model with custom ROI size
uv run python export_hierarchical_instance_peopleseg_onnx.py \
experiments/*/checkpoints/best_model_b7_*.pth \
--output models/b7_model_ultra.onnx \
--image_size 1024,1024

Export Post-Processing Modules

# Export edge smoothing module
uv run python export_edge_smoothing_onnx.py

# Export bilateral filter
uv run python export_bilateral_filter.py

ONNX Optimization

# Optimize ONNX model with onnxsim
uv run python -m onnxsim models/b0_model.onnx models/b0_model_opt.onnx

# Verify optimized model
uv run python -c "import onnx; model = onnx.load('models/b0_model_opt.onnx'); onnx.checker.check_model(model); print('Model is valid')"

Test Inference

Test Script: `test_hierarchical_instance_peopleseg_onnx.py`

Basic Testing

# Test ONNX model on validation images
uv run python test_hierarchical_instance_peopleseg_onnx.py \
--onnx best_model_b1_80x60_0.8551_dil1.onnx \
--annotations data/annotations/instances_val2017_person_only_no_crowd_100imgs.json \
--images_dir data/images/val2017 \
--num_images 5 \
--output_dir test_outputs

# Test with CUDA provider
uv run python test_hierarchical_instance_peopleseg_onnx.py \
--onnx best_model_b1_80x60_0.8551_dil1.onnx \
--annotations data/annotations/instances_val2017_person_only_no_crowd.json \
--provider cuda \
--num_images 10 \
--output_dir test_outputs_cuda

Advanced Testing Options

# Test with binary mask visualization (green overlay)
uv run python test_hierarchical_instance_peopleseg_onnx.py \
--onnx best_model_b1_80x60_0.8551_dil1.onnx \
--annotations data/annotations/instances_val2017_person_only_no_crowd.json \
--num_images 20 \
--binary_mode \
--alpha 0.7 \
--output_dir test_binary_masks

# Test with custom score threshold
uv run python test_hierarchical_instance_peopleseg_onnx.py \
--onnx best_model_b1_80x60_0.8551_dil1.onnx \
--annotations data/annotations/instances_val2017_person_only_no_crowd.json \
--num_images 15 \
--score_threshold 0.5 \
--save_masks \
--output_dir test_high_confidence

Performance Benchmarking

pip install sit4onnx

# CUDA
sit4onnx -if best_model_b0_64x48_0.8545_dil1.onnx -oep cuda

INFO: file: best_model_b0_64x48_0.8545_dil1.onnx
INFO: providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 480, 640] dtype: float32
INFO: input_name.2: rois shape: [1, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  177.2298812866211 ms
INFO: avg elapsed time per pred:  17.72298812866211 ms
INFO: output_name.1: masks shape: [1, 3, 128, 96] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 480, 640] dtype: float32

sit4onnx -if best_model_b1_80x60_0.8551_dil1.onnx -oep cuda

INFO: file: best_model_b1_80x60_0.8551_dil1.onnx
INFO: providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: input_name.2: rois shape: [1, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  251.79290771484375 ms
INFO: avg elapsed time per pred:  25.179290771484375 ms
INFO: output_name.1: masks shape: [1, 3, 160, 120] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 640, 640] dtype: float32

# TensorRT
sit4onnx -if best_model_b0_64x48_0.8545_dil1.onnx -oep tensorrt

INFO: file: best_model_b0_64x48_0.8545_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 480, 640] dtype: float32
INFO: input_name.2: rois shape: [1, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  47.41835594177246 ms
INFO: avg elapsed time per pred:  4.741835594177246 ms
INFO: output_name.1: masks shape: [1, 3, 128, 96] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 480, 640] dtype: float32

sit4onnx -if best_model_b1_80x60_0.8551_dil1.onnx -oep tensorrt

INFO: file: best_model_b1_80x60_0.8551_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: input_name.2: rois shape: [1, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  68.60971450805664 ms
INFO: avg elapsed time per pred:  6.860971450805664 ms
INFO: output_name.1: masks shape: [1, 3, 160, 120] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 640, 640] dtype: float32

# TensorRT + Multi-ROI
sit4onnx -if best_model_b0_64x48_0.8545_dil1.onnx -oep tensorrt -fs 1 3 480 640 -fs 3 5

INFO: file: best_model_b0_64x48_0.8545_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 480, 640] dtype: float32
INFO: input_name.2: rois shape: [3, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  65.09065628051758 ms
INFO: avg elapsed time per pred:  6.509065628051758 ms
INFO: output_name.1: masks shape: [3, 3, 128, 96] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 480, 640] dtype: float32

sit4onnx -if best_model_b1_80x60_0.8551_dil1.onnx -oep tensorrt -fs 1 3 640 640 -fs 3 5

INFO: file: best_model_b1_80x60_0.8551_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: input_name.2: rois shape: [3, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  97.52345085144043 ms
INFO: avg elapsed time per pred:  9.752345085144043 ms
INFO: output_name.1: masks shape: [3, 3, 160, 120] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 640, 640] dtype: float32

sit4onnx -if best_model_b0_64x48_0.8545_dil1.onnx -oep tensorrt -fs 1 3 480 640 -fs 10 5

INFO: file: best_model_b0_64x48_0.8545_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 480, 640] dtype: float32
INFO: input_name.2: rois shape: [10, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  126.00469589233398 ms
INFO: avg elapsed time per pred:  12.600469589233398 ms
INFO: output_name.1: masks shape: [10, 3, 128, 96] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 480, 640] dtype: float32

sit4onnx -if best_model_b1_80x60_0.8551_dil1.onnx -oep tensorrt -fs 1 3 640 640 -fs 10 5

INFO: file: best_model_b1_80x60_0.8551_dil1.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: input_name.2: rois shape: [10, 5] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  196.87914848327637 ms
INFO: avg elapsed time per pred:  19.687914848327637 ms
INFO: output_name.1: masks shape: [10, 3, 160, 120] dtype: float32
INFO: output_name.2: binary_masks shape: [1, 1, 640, 640] dtype: float32

License

This project is licensed under the MIT License - see below for details:

MIT License

Copyright (c) 2025 Katsuya Hyodo

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Citations and Acknowledgments

This project builds upon several excellent works in the computer vision community:

People Segmentation

We gratefully acknowledge the work by Vladimir Iglovikov (Ternaus) on people segmentation:

Repository: https://github.com/ternaus/people_segmentation

People Segmentation Custom

Repository: https://github.com/PINTO0309/people_segmentation

EfficientNet

@article{tan2019efficientnet,
  title={EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks},
  author={Tan, Mingxing and Le, Quoc V},
  journal={arXiv preprint arXiv:1905.11946},
  year={2019}
}

COCO Dataset

@inproceedings{lin2014microsoft,
  title={Microsoft COCO: Common Objects in Context},
  author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Dollár, Piotr and Zitnick, C Lawrence},
  booktitle={European Conference on Computer Vision},
  pages={740--755},
  year={2014},
  organization={Springer}
}

U-Net Architecture

@inproceedings{ronneberger2015u,
  title={U-Net: Convolutional Networks for Biomedical Image Segmentation},
  author={Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  pages={234--241},
  year={2015},
  organization={Springer}
}

Knowledge Distillation

@article{hinton2015distilling,
  title={Distilling the Knowledge in a Neural Network},
  author={Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
  journal={arXiv preprint arXiv:1503.02531},
  year={2015}
}

Special Thanks

The PyTorch team for the excellent deep learning framework
The ONNX community for cross-platform model deployment tools
The Albumentations team for powerful augmentation pipelines
The Segmentation Models PyTorch contributors for pre-trained encoders

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
data		data
ext_extractor		ext_extractor
onnx_models		onnx_models
roi_analysis_full		roi_analysis_full
src/human_edge_detection		src/human_edge_detection
test_data		test_data
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
ACTIVATION_FUNCTION_SWITCHING.md		ACTIVATION_FUNCTION_SWITCHING.md
ADVANCED_FEATURES_SUMMARY.md		ADVANCED_FEATURES_SUMMARY.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PROGRESSIVE_UNFREEZING.md		PROGRESSIVE_UNFREEZING.md
README.md		README.md
README_jp.md		README_jp.md
RGB_B0_FROM_B3_TEMP_PROG_ARCHITECTURE.md		RGB_B0_FROM_B3_TEMP_PROG_ARCHITECTURE.md
Refine_the_Binary_Mask.md		Refine_the_Binary_Mask.md
TEMPERATURE_KL_LOSS_ANALYSIS.md		TEMPERATURE_KL_LOSS_ANALYSIS.md
TEMPERATURE_SOFT_HARD_TARGETS.md		TEMPERATURE_SOFT_HARD_TARGETS.md
TEMP_PROG_CONFIG.md		TEMP_PROG_CONFIG.md
activate.sh		activate.sh
analyze_dataset_bboxes.py		analyze_dataset_bboxes.py
analyze_dataset_quality_mismatch.py		analyze_dataset_quality_mismatch.py
analyze_double_normalization.py		analyze_double_normalization.py
analyze_feature_outputs.py		analyze_feature_outputs.py
analyze_hierarchical_training.py		analyze_hierarchical_training.py
analyze_model_complexity.py		analyze_model_complexity.py
analyze_pixel_ratio.py		analyze_pixel_ratio.py
analyze_pretrained_unet.py		analyze_pretrained_unet.py
analyze_roi_sizes.py		analyze_roi_sizes.py
analyze_teacher_low_quality.py		analyze_teacher_low_quality.py
analyze_temperature_kl_effect.py		analyze_temperature_kl_effect.py
analyze_training.py		analyze_training.py
bbox_distribution.png		bbox_distribution.png
data_analyze_051520.json		data_analyze_051520.json
data_analyze_061512.json		data_analyze_061512.json
data_analyze_081012.json		data_analyze_081012.json
data_analyze_100.json		data_analyze_100.json
data_analyze_121.json		data_analyze_121.json
data_analyze_131.json		data_analyze_131.json
data_analyze_151.json		data_analyze_151.json
data_analyze_all1.json		data_analyze_all1.json
data_analyze_full.json		data_analyze_full.json
data_analyze_no_separation.json		data_analyze_no_separation.json
export_bilateral_filter.py		export_bilateral_filter.py
export_edge_smoothing_onnx.py		export_edge_smoothing_onnx.py
export_hierarchical_instance_peopleseg_onnx.py		export_hierarchical_instance_peopleseg_onnx.py
export_peopleseg_onnx.py		export_peopleseg_onnx.py
feature_analysis_results.json		feature_analysis_results.json
main.py		main.py
pixel_ratio_analysis.json		pixel_ratio_analysis.json
print_coco_640x480_images.py		print_coco_640x480_images.py
pyproject.toml		pyproject.toml
retrain_distillation.sh		retrain_distillation.sh
run_distillation_training.sh		run_distillation_training.sh
run_experiments.py		run_experiments.py
selected_test_images_large_roi.json		selected_test_images_large_roi.json
temperature_concept_explained.png		temperature_concept_explained.png
temperature_kl_effect.png		temperature_kl_effect.png
temperature_targets_explained.png		temperature_targets_explained.png
temperature_training_evolution.png		temperature_training_evolution.png
test_hierarchical_instance_peopleseg_onnx.py		test_hierarchical_instance_peopleseg_onnx.py
train_advanced.py		train_advanced.py
train_distillation_staged.py		train_distillation_staged.py
train_staged_distillation.sh		train_staged_distillation.sh
train_yolo_feature_distillation.py		train_yolo_feature_distillation.py
training_analysis_report.md		training_analysis_report.md
uv.lock		uv.lock
validate.py		validate.py
validate_advanced.py		validate_advanced.py
validate_rgb_hierarchical_simple.py		validate_rgb_hierarchical_simple.py
validate_teacher_inference.py		validate_teacher_inference.py
visualize_temperature_targets.py		visualize_temperature_targets.py

License

PINTO0309/human-instance-segmentation

Folders and files

Latest commit

History

Repository files navigation

Human Instance Segmentation

Table of Contents

Architecture Overview

Key Features

Architecture Details

Model Hierarchy

B0 Architecture (Lightweight)

B1 Architecture (Balanced)

B7 Architecture (High-Accuracy)

Core Components

1. Pretrained UNet Module

2. ROI Extraction Module

3. Instance Segmentation Head

4. Loss Functions

Architecture Diagram

Model Architecture

Training Pipeline

Knowledge Distillation Pipeline

Training Stages

Refinement Mechanism

Hierarchical Refinement Process

Key Refinement Techniques

Dataset Structure

Directory Layout

Annotation Format

Dataset Statistics

Environment Setup

Prerequisites

Installation with uv

Verify Installation

UNet Distillation Commands

Distillation Configuration Files

Basic Distillation Training

Advanced Distillation Options

ROI-Based Hierarchical Training

Standard Configuration Files

Enhanced Configuration Files

Basic Training Commands

Advanced Training Options

Validation Commands

ONNX Export

Export Scripts

Pre-trained weights

Basic Export Commands

Export Post-Processing Modules

ONNX Optimization

Test Inference

Test Script: test_hierarchical_instance_peopleseg_onnx.py

Basic Testing

Advanced Testing Options

Performance Benchmarking

License

Citations and Acknowledgments

People Segmentation

People Segmentation Custom

EfficientNet

COCO Dataset

U-Net Architecture

Knowledge Distillation

Special Thanks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages

Test Script: `test_hierarchical_instance_peopleseg_onnx.py`