DataMax

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

✨ Key Features

🔁 Full QA Pipeline: Single-script automation chains parsing, QA generation, and quality evaluation, so datasets are curated end-to-end without manual orchestration.
🔄 Multi-format Support: Unified loaders handle PDF, DOC/DOCX, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, and mainstream image formats without extra plugins.
🧹 Intelligent Cleaning: Built-in anomaly detection, privacy-aware redaction, and customizable filters normalize noisy enterprise documents.
🤖 AI Annotation: LLM-powered workflows auto-generate Q&A pairs, summaries, and structured labels for downstream model training.
⚡ High Performance: Streaming chunkers, caching, and parallel execution keep large batch jobs fast and resource-efficient.
🎯 Developer Friendly: Type-hinted SDK with declarative configs, pluggable pipelines, and rich error handling simplifies integration.
☁️ Cloud Ready: Native connectors for OSS, MinIO, and S3-compatible storage make hybrid or fully managed deployments straightforward.

🚀 Quick Start

Install

pip install pydatamax

Examples

from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get data
data = dm.get_data()

# get content
content = data.get("content")

# get pre label. return trainable qa list
qa = dm.get_pre_label(
    content=content,
    api_key=api_key,
    base_url=base_url,
    model_name=model,
    question_number=50,  # question_number_per_chunk
    max_qps=100.0,
    debug=False,
    structured_data=True,  # enable structured output
    auto_self_review_mode=True,  # auto review qa, pass with 4 and 5 score, drop with 1, 2 and 3 score.
    review_max_qps=100.0,
)


# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)

🤝 Contributing

Issues and Pull Requests are welcome!

📄 License

This project is licensed under the MIT License.

📞 Contact Us

📧 Email: [email protected], [email protected]
🐛 Issues: GitHub Issues
📚 Best Practice: How to generate qa
💬 Wechat Group:

⭐ If this project helps you, please give us a star!

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github		.github
datamax		datamax
docs		docs
examples/scripts		examples/scripts
scripts		scripts
tests		tests
typings		typings
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_zh.md		README_zh.md
docker-compose.yml		docker-compose.yml
logo.png		logo.png
logo_zh.png		logo_zh.png
magic-pdf.template.json		magic-pdf.template.json
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
wechat.jpg		wechat.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataMax

✨ Key Features

🚀 Quick Start

Install

Examples

🤝 Contributing

📄 License

📞 Contact Us

About

Uh oh!

Releases 13

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

Hi-Dolphin/datamax

Folders and files

Latest commit

History

Repository files navigation

DataMax

✨ Key Features

🚀 Quick Start

Install

Examples

🤝 Contributing

📄 License

📞 Contact Us

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages