This repository contains the code and resources for building a Retrieval-Augmented Generation (RAG) system for a Manufacturing Execution System (MES) in the chemical industry.
Here are some images for the system:
- 1. Features
- 2. Technologies Used
- 3. Prerequisites
- 4. Setup Instructions
- 5. Data Preprocessing
- 6. Embedding Generation
- 7. Vector Indexing
- 8. RAG Workflow
- 9. Handling Multimedia
- 10. Scaling
- 11. Code Structure
- 12. Testing and Evaluation
- 13. Contributions
- 14. License
- File Upload 📤: Supports uploading multiple formats like text files, PDFs, DOCX, CSV, JSON, audio, video, and images.
- Text Extraction 📄: Extracts text from PDFs, DOCX, CSV, JSON, images (via OCR), and transcribes audio and video.
- Embedding Storage 💾: Generates embeddings for text data and stores them in a FAISS vector index for fast retrieval.
- Conversational Interface 🗣️: Allows users to ask questions and receive context-aware answers through a Streamlit-based interface.
- Multimedia Handling 🎥🎙️: Transcribes audio and video, and extracts text from images using OCR techniques.
- Streamlit: For building an interactive and user-friendly interface.
- LangChain: For integration with OpenAI’s models, vector stores, and retrieval-based generation.
- OpenAI API: For embedding generation and natural language processing.
- Whisper: For transcribing speech from audio and video.
- FAISS: For storing and efficiently retrieving embeddings.
- PyPDF2: To extract text from PDF files.
- python-docx: To handle DOCX files.
- pytesseract: For OCR in images.
- moviepy: For extracting audio from video.
- pandas: For handling structured CSV and JSON files.
Ensure that the following are installed before setting up the system:
- Python 3.x
- Required Python libraries (refer to the Requirements.txt file)
Follow the steps below to set up the system:
- Clone the repository:
git clone <repository-url>
- Navigate to the project directory: cd
- Install the dependencies: pip install -r requirements.txt
The system requires preprocessing of various types of data:
- Textual Data: Extract text from PDFs, DOCX, CSV, and JSON files using the respective libraries (PyPDF2, python-docx, pandas).
- OCR for Images: Extract text from images using OCR technology (pytesseract).
- Audio and Video Transcription: Convert audio and video files to text using Whisper API.
- Generate embeddings for all textual content, including transcriptions from audio and video files.
- Use the OpenAI API to generate embeddings that capture the context of the extracted text data.
- Store the generated embeddings in a FAISS index for efficient and fast retrieval.
- The FAISS index is queried to fetch the most relevant information based on user queries.
- Input: User submits a question or query through the Streamlit interface.
- Retrieval: Relevant context is retrieved from the FAISS index by matching the question with the stored embeddings.
- Generation: The OpenAI model generates a response based on the retrieved context.
- Output: The generated response is returned to the user.
- Images: OCR technology (pytesseract) is used to extract text from images.
- Audio/Video: Whisper transcribes speech from audio and video files into text, which is then processed by the system for context-aware generation.
- The system is designed to scale as data grows.
- FAISS ensures that the retrieval process remains fast and efficient, even with large datasets.
- The system can be deployed on cloud infrastructure for handling large-scale data storage and computational needs.
The project is organized as follows:
- /app: Contains the Streamlit interface for interacting with the system.
- /modules: Contains scripts for file handling, embedding generation, and FAISS integration.
- /data: Stores preprocessed data and embeddings.
- /tests: Contains test scripts to evaluate system performance.
The system has been tested for:
- Accuracy of Transcriptions: Ensuring that audio and video files are transcribed correctly.
- Embedding Quality: Verifying that the generated embeddings accurately reflect the content’s context.
- Search Performance: Testing the retrieval speed and relevance of the results from the FAISS index.
We welcome contributions! If you'd like to help improve the system:
- Fork the repository.
- Submit a pull request with your proposed changes.
This project is licensed under the MIT License - see the LICENSE file for details.