dataprof

The High-Performance Profiler for Large Datasets

Profile 50GB datasets in seconds on your laptop.

DataProf is built for Data Scientists and Engineers who need to understand their data fast. No more MemoryError when trying to profile a CSV larger than your RAM.

Pandas-Profiling vs DataProf on a 10GB CSV:

Feature	Pandas-Profiling / YData	DataProf
Memory Usage	12GB+ (Crashes)	< 100MB (Streaming)
Speed	15+ minutes	45 seconds
Implementation	Python (Slow)	Rust (Blazing Fast)

Quick Start

Installation

The easiest way to get started is via pip:

pip install dataprof

Python Usage

Forget complex configurations. Just point to your file:

import dataprof

# Analyze a huge file without crashing memory
# Generates a report.html with quality metrics and distributions
dataprof.profile("huge_dataset.csv").save("report.html")

CLI & Rust Usage (Advanced)

If you prefer the command line or are a Rust developer:

# Install via cargo
cargo install dataprof

# Generate report from CLI
dataprof-cli report huge_data.csv -o report.html

More options: dataprof-cli --help | Full CLI Guide

💡 Key Features

No Size Limits: Profiles files larger than RAM using streaming and memory mapping.
Blazing Fast: Written in Rust with SIMD acceleration.
Privacy Guaranteed: Data never leaves your machine.
Format Support: CSV, Parquet, JSON/L, and Databases (Postgres, MySQL, etc.).
Smart Detection: Automatically identifies Emails, IPs, IBANs, Credit Cards, and more.

📊 Beautiful Reports

Interactive Demo
Animated walkthrough of dataprof features and dashboards

Single File Analysis
Interactive dashboards with quality scoring and distributions

Batch Processing Dashboard
Aggregate metrics from hundreds of files in one view

Documentation

Python API Reference
CLI Guide

Advanced Examples

Batch Processing (Python)

# Process a whole directory of files in parallel
result = dataprof.batch_analyze_directory("/data_folder", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")

Database Integration (Python)

# Profile a SQL query directly
await dataprof.analyze_database_async(
    "postgresql://user:pass@localhost/db",
    "SELECT * FROM sales_data_2024"
)

Rust Library Usage

use dataprof::*;

let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;
println!("Quality Score: {}", report.quality_score());

Development

# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release

# Test databases (optional)
docker-compose -f .devcontainer/compose.yml up -d

# Common tasks
cargo test          # Run tests
cargo bench         # Benchmarks
cargo clippy        # Linting

Feature Flags

# Minimal (CSV/JSON only)
cargo build --release

# With Apache Arrow (large files >100MB)
cargo build --release --features arrow

# With Parquet support
cargo build --release --features parquet

# With databases
cargo build --release --features postgres,mysql,sqlite

# Python async support
maturin develop --features python-async,database,postgres

# All features
cargo build --release --all-features

When to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration

Documentation

User Guides: CLI Reference Database Connectors

🤝 Contributing

We welcome contributions from everyone! Whether you want to:

Fix a bug 🐛
Add a feature ✨
Improve documentation 📚
Report an issue 📝

Quick Start for Contributors

Fork & clone:

git clone https://github.com/YOUR-USERNAME/dataprof.git
cd dataprof

Build & test:
```
cargo build
cargo test
```

Create a feature branch:

git checkout -b feature/your-feature-name

Before submitting PR:

cargo fmt --all
cargo clippy --all --all-targets
cargo test --all

Submit a Pull Request with clear description

All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.

License

Dual-licensed under either:

You may use this project under the terms of either license.

Name		Name	Last commit message	Last commit date
Latest commit History 557 Commits
.cargo		.cargo
.devcontainer		.devcontainer
.github		.github
assets		assets
benches		benches
docs		docs
examples		examples
python		python
scripts		scripts
src		src
templates		templates
.gitignore		.gitignore
.trufflehogignore		.trufflehogignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
deny.toml		deny.toml
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

dataprof

Quick Start

Installation

Python Usage

CLI & Rust Usage (Advanced)

💡 Key Features

📊 Beautiful Reports

Documentation

Advanced Examples

Development

Feature Flags

Documentation

🤝 Contributing

Quick Start for Contributors

License

About

Licenses found

Uh oh!

Releases 23

Uh oh!

Contributors 5

Uh oh!

Languages

License

Licenses found

AndreaBozzo/dataprof

Folders and files

Latest commit

History

Repository files navigation

dataprof

Quick Start

Installation

Python Usage

CLI & Rust Usage (Advanced)

💡 Key Features

📊 Beautiful Reports

Documentation

Advanced Examples

Development

Feature Flags

Documentation

🤝 Contributing

Quick Start for Contributors

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 23

Uh oh!

Contributors 5

Uh oh!

Languages