Skip to content

Sumitkumar005/RAG-System-Knowledge-Base-Integration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

17e3205 Β· Mar 13, 2025

History

31 Commits
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024
Mar 13, 2025
Mar 11, 2025
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024
Dec 24, 2024

Repository files navigation

βš™οΈπŸ§¬πŸ” RAG System for MES in Chemical Industry πŸ­πŸ”¬πŸ“Š

This repository contains the code and resources for building a Retrieval-Augmented Generation (RAG) system for a Manufacturing Execution System (MES) in the chemical industry.

Table of Contents πŸ“– πŸ”¬

RAG System Knowledge Base Integration

Here are some images for the system:

1. Features βš™οΈ

  • File Upload πŸ“€: Supports uploading multiple formats like text files, PDFs, DOCX, CSV, JSON, audio, video, and images.
  • Text Extraction πŸ“„: Extracts text from PDFs, DOCX, CSV, JSON, images (via OCR), and transcribes audio and video.
  • Embedding Storage πŸ’Ύ: Generates embeddings for text data and stores them in a FAISS vector index for fast retrieval.
  • Conversational Interface πŸ—£οΈ: Allows users to ask questions and receive context-aware answers through a Streamlit-based interface.
  • Multimedia Handling πŸŽ₯πŸŽ™οΈ: Transcribes audio and video, and extracts text from images using OCR techniques.

Image Alt Text.

2. Technologies Used πŸ› οΈ

  • Streamlit: For building an interactive and user-friendly interface.
  • LangChain: For integration with OpenAI’s models, vector stores, and retrieval-based generation.
  • OpenAI API: For embedding generation and natural language processing.
  • Whisper: For transcribing speech from audio and video.
  • FAISS: For storing and efficiently retrieving embeddings.
  • PyPDF2: To extract text from PDF files.
  • python-docx: To handle DOCX files.
  • pytesseract: For OCR in images.
  • moviepy: For extracting audio from video.
  • pandas: For handling structured CSV and JSON files.

3. Prerequisites πŸ“¦

Ensure that the following are installed before setting up the system:

  • Python 3.x
  • Required Python libraries (refer to the Requirements.txt file)

4. Setup Instructions βš™οΈ

Follow the steps below to set up the system:

  1. Clone the repository:
    git clone <repository-url>
  2. Navigate to the project directory: cd
  3. Install the dependencies: pip install -r requirements.txt

5. Data Preprocessing 🧹

Image Alt Text The system requires preprocessing of various types of data:

  • Textual Data: Extract text from PDFs, DOCX, CSV, and JSON files using the respective libraries (PyPDF2, python-docx, pandas).
  • OCR for Images: Extract text from images using OCR technology (pytesseract).
  • Audio and Video Transcription: Convert audio and video files to text using Whisper API.

6. Embedding Generation 🧠

  • Generate embeddings for all textual content, including transcriptions from audio and video files.
  • Use the OpenAI API to generate embeddings that capture the context of the extracted text data.

7. Vector Indexing πŸ”

  • Store the generated embeddings in a FAISS index for efficient and fast retrieval.
  • The FAISS index is queried to fetch the most relevant information based on user queries.

8. RAG Workflow πŸ”„

  1. Input: User submits a question or query through the Streamlit interface.
  2. Retrieval: Relevant context is retrieved from the FAISS index by matching the question with the stored embeddings.
  3. Generation: The OpenAI model generates a response based on the retrieved context.
  4. Output: The generated response is returned to the user.

9. Handling Multimedia 🎬🎧

  • Images: OCR technology (pytesseract) is used to extract text from images.
  • Audio/Video: Whisper transcribes speech from audio and video files into text, which is then processed by the system for context-aware generation.

10. Scaling πŸš€

  • The system is designed to scale as data grows.
  • FAISS ensures that the retrieval process remains fast and efficient, even with large datasets.
  • The system can be deployed on cloud infrastructure for handling large-scale data storage and computational needs.

11. Code Structure πŸ—οΈ

The project is organized as follows:

  • /app: Contains the Streamlit interface for interacting with the system.
  • /modules: Contains scripts for file handling, embedding generation, and FAISS integration.
  • /data: Stores preprocessed data and embeddings.
  • /tests: Contains test scripts to evaluate system performance.

12. Testing and Evaluation πŸ§ͺ

Image Alt Text

The system has been tested for:

  • Accuracy of Transcriptions: Ensuring that audio and video files are transcribed correctly.
  • Embedding Quality: Verifying that the generated embeddings accurately reflect the content’s context.
  • Search Performance: Testing the retrieval speed and relevance of the results from the FAISS index.

13. Contributions 🀝

We welcome contributions! If you'd like to help improve the system:

  • Fork the repository.
  • Submit a pull request with your proposed changes.

14. License πŸ“œ

This project is licensed under the MIT License - see the LICENSE file for details.