A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.
- 🔁 Full QA Pipeline: Single-script automation chains parsing, QA generation, and quality evaluation, so datasets are curated end-to-end without manual orchestration.
- 🔄 Multi-format Support: Unified loaders handle PDF, DOC/DOCX, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, and mainstream image formats without extra plugins.
- 🧹 Intelligent Cleaning: Built-in anomaly detection, privacy-aware redaction, and customizable filters normalize noisy enterprise documents.
- 🤖 AI Annotation: LLM-powered workflows auto-generate Q&A pairs, summaries, and structured labels for downstream model training.
- ⚡ High Performance: Streaming chunkers, caching, and parallel execution keep large batch jobs fast and resource-efficient.
- 🎯 Developer Friendly: Type-hinted SDK with declarative configs, pluggable pipelines, and rich error handling simplifies integration.
- ☁️ Cloud Ready: Native connectors for OSS, MinIO, and S3-compatible storage make hybrid or fully managed deployments straightforward.
pip install pydatamaxfrom datamax import DataMax
# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"
# init client
client = DataMax(file_path=FILE_PATHS)
# get data
data = dm.get_data()
# get content
content = data.get("content")
# get pre label. return trainable qa list
qa = dm.get_pre_label(
content=content,
api_key=api_key,
base_url=base_url,
model_name=model,
question_number=50, # question_number_per_chunk
max_qps=100.0,
debug=False,
structured_data=True, # enable structured output
auto_self_review_mode=True, # auto review qa, pass with 4 and 5 score, drop with 1, 2 and 3 score.
review_max_qps=100.0,
)
# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)Issues and Pull Requests are welcome!
This project is licensed under the MIT License.
- 📧 Email: [email protected], [email protected]
- 🐛 Issues: GitHub Issues
- 📚 Best Practice: How to generate qa
- 💬 Wechat Group:

⭐ If this project helps you, please give us a star!
