Skip to content

MVPavan/data-miner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Miner

Python 3.12+ PostgreSQL License: MIT Documentation

A PostgreSQL-backed, supervisor-managed video processing pipeline for generating large-scale computer vision datasets from YouTube videos.

✨ Features

  • 🔍 YouTube Search - Find videos by keywords and hashtags
  • 📥 Smart Downloads - Rate-limited downloading with hashtag blocklists
  • 🎬 Frame Extraction - Configurable sampling strategies (interval, time, keyframe)
  • 🎯 ML Filtering - SigLIP2-based image-text similarity filtering
  • 🔄 Deduplication - DINOv3/FAISS-based cross-video deduplication
  • 🎯 Object Detection - Open-set detection (GroundingDINO, OWLv2)

🏗️ Architecture

flowchart LR
    subgraph Central["Central Pipeline"]
        D[Download] --> E[Extract]
    end
    
    subgraph Project["Per-Project Pipeline"]
        F[Filter] --> DU[Cross-Dedup] --> DT[Detect]
    end
    
    E --> F
Loading

The pipeline uses:

  • PostgreSQL for state management with row-level locking
  • Supervisor for worker process management
  • Heartbeat-based locking for concurrent safety

🚀 Quick Start

# Creates new .venv and Install in editable mode
uv sync

# Install with editable mode in exisiting virtual environment
uv pip install -e .

# Initialize database
data-miner init-db

# Add videos and run pipeline
data-miner populate --config config.yaml
data-miner workers setup --config config.yaml
data-miner workers start

📁 Project Structure

data_miner/
├── cli.py              # CLI commands
├── config/             # Configuration system
├── db/                 # Database layer
├── workers/            # Supervisor-managed workers
├── modules/            # Core processing logic
├── models/             # ML model wrappers
└── utils/              # Utilities

📚 Documentation

Full documentation is available at mvpavan.github.io/data-miner

User Guide Developer Docs
Installation Architecture Overview
Configuration Database Models
CLI Reference Worker System
Quickstart Contributing

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

large scale vision data mining and annotation platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors