A comprehensive data preparation and processing pipeline built with Python and Docker, designed to generate synthetic Question-Answer pairs from RBI Circulars using Google's Gemini 2.0-flash model. This project encompasses the entire pipeline from web scraping RBI circulars to pushing the generated dataset to the Hugging Face Hub.
This project aims to create a high-quality Question-Answer dataset from RBI Circulars for fine-tuning Large Language Models. The dataset is available on Hugging Face Hub at Vishva007/RBI-Circular-QA-Dataset.
- Web Scraping: Automated extraction of RBI Circulars from the official website
- PDF Processing: Tools for processing and extracting text from RBI circular PDFs into high quality Markdown files
- AI Integration: Uses Google's Gemini 2.0-flash model for generating synthetic QA pairs
- Dataset Generation: Automated pipeline for creating high-quality QA pairs
- Hugging Face Integration: Direct upload capability to Hugging Face Hub
- Docker Support: Containerized environment with GPU support
- Test Scripts: Included test scripts for validation
- Utility Functions: Reusable utility modules for common tasks
- Docker and Docker Compose
- NVIDIA GPU with CUDA support (for GPU acceleration)
- Python 3.x
- Clone the repository:
git clone https://github.com/vishvaRam/Data-Prep-for-LLM-fine-tuning.git
cd Data_prep
- Build and start the Docker container:
docker-compose up --build
.
├── Code/
│ ├── AI-tasks/ # AI-related processing tasks
│ ├── Data/ # Data storage directory
│ ├── Utils/ # Utility functions
│ ├── Test-scripts/ # Testing and validation scripts
│ ├── prepare-dataset/ # Dataset preparation tools
│ ├── process-md/ # Markdown processing tools
│ ├── convert-markdown/# Markdown conversion utilities
│ ├── fetch-data/ # Data fetching utilities
│ ├── main.py # Print the Config of the Docker Image
│ ├── Dockerfile # Docker configuration
│ └── requirements.txt # Python dependencies
├── .devcontainer/ # Development container configuration
└── docker-compose.yml # Docker Compose configuration
The project uses Docker Compose for configuration. Key settings include:
- GPU support enabled
- Volume mounting for code persistence
- Network configuration for external connectivity (If Langfuse is used)
Key Python packages include:
- google-generativeai (for Gemini 2.0-flash model integration)
- langchain (for LLM operations)
- langfuse (for Monitor)
- marker-pdf (for PDF to Markdown processing)
- selenium (for web scraping)
- and more (see requirements.txt for complete list)
- Start the container:
docker-compose up
- The project workflow:
- Web scrape RBI circulars using the fetch-data module
- Process PDFs using the process-md module
- Convert the md files into chunks
- Generate QA pairs using Gemini 2.0-flash model from the chunks
- Validate and prepare the dataset (if necessary)
- Push to Hugging Face Hub
The generated dataset contains synthetic Question-Answer pairs created from RBI Circulars using the Gemini 2.0-flash model. The dataset is structured to facilitate fine-tuning of Large Language Models for better understanding and processing of RBI circulars.
Dataset Location: Vishva007/RBI-Circular-QA-Dataset
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch
- OpenCV
- Langfuse
- marker-pdf
- All other open-source contributors