Designed to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.
pip install SimpleRAGHuggingFace
During the first execution, the dataset is loaded, vectorized, and embeddings are stored:
from rag import Rag
RAG_HF_DATASET = "JulianVelandia/unal-repository-dataset-alternative-format"
rag = Rag(hf_dataset=RAG_HF_DATASET)
query = "What is the lighting design, control, and beautification of the field at Alfonso López Stadium?"
response = rag.retrieval_augmented_generation(query)
print(response)
Once run for the first time, the dataset can be queried for cosine similarity with the following parameters
Parameters:
- query (str): The input question or statement to be processed.
- max_sections (int): Maximum number of context sections to retrieve (range: 1 to 10).
Higher values provide more context but may dilute relevance.
- threshold (float): Minimum similarity score for a section to be included (range: 0.0 to 1.0).
Higher values ensure stricter relevance.
- max_words (int, optional): Maximum number of words in the combined context (default: 1000).
Longer limits provide more detail but may reduce conciseness.
Returns:
- str: The combined query and relevant context, or just the query if no context is found.
This process generates:
- Original Database: Stored in memory as a list of documents.
- Vectorized Database: Saved as a
.npy
file in theembeddings/
folder.
Once the setup is complete, you can perform queries:
query = "What is the lighting design, control, and beautification of the field at Alfonso López Stadium?"
response = rag.retrieval_augmented_generation(query)
print(response)
The result will be the initial prompt
combined with the most relevant sections of context:
What is the lighting design, control, and beautification of the field at Alfonso López Stadium?
Keep in mind this context:
Lighting design ... Alfonso López Stadium, as well as the results obtained, understanding that a soccer team ...
...
-
Setup (Preprocessing):
- Load the dataset from Hugging Face.
- Vectorize the documents using TF-IDF.
- Save the embeddings in
.npy
format.
HF Dataset -> Load -> Vectorization -> Embeddings (.npy)
-
Querying:
- Vectorize the prompt.
- Calculate cosine similarity between the prompt and the vectorized documents.
- Retrieve the most relevant sections.
- Combine the prompt with the retrieved context.
Prompt -> Vectorization -> Cosine Similarity -> Retrieval -> Combined Context