RAGLLM.PlusPlus

Local RAG pipeline with dual embedding models and LLM inference. Scripts come in two backend variants: vLLM (Safetensors, GPU half/bfloat16 inference) OR llama-cpp-python + ggml-python (GGUF format).

Both vLLM and llama-cpp-python are inference engines: vLLM is optimized for multi-user high-throughput production, while llama-cpp-python prioritizes portability for single-user workloads.

Scripts

Script	Backend	Embedder1	Embedder2	LLM	Focus
`RAGLLM_Code_Reasoning-Nemotron-1.1-7B_LLAMA-CPP.py`	llama-cpp-python	`nomic-ai/CodeRankEmbed`	`BAAI/bge-code-v1`	`OpenCodeReasoning-Nemotron-1.1-7B-F16.gguf`	Code / technical
`RAGLLM_Code_Reasoning-Nemotron-1.1-7B_VLLM.py`	vLLM	`nomic-ai/CodeRankEmbed`	`BAAI/bge-code-v1`	`nvidia/OpenCodeReasoning-Nemotron-1.1-7B`	Code / technical
`RAGLLM_English_GLM-4.7-Flash-Q4_1_LLAMA-CPP.py`	llama-cpp-python	`BAAI/bge-m3`	`nomic-ai/nomic-embed-text-v2-moe`	`zai-org_GLM-4.7-Flash-Q4_1.gguf`	English / general
`RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_LLAMA-CPP.py`	llama-cpp-python	`BAAI/bge-m3`	`nomic-ai/nomic-embed-text-v2-moe`	`nvidia_Llama-3.1-Nemotron-Nano-8B-v1-bf16.gguf`	English / general
`RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_VLLM.py`	vLLM	`BAAI/bge-m3`	`nomic-ai/nomic-embed-text-v2-moe`	`nvidia/Llama-3.1-Nemotron-Nano-8B-v1`	English / general

All scripts use dual-embed retrieval: embeddings from both models are loaded through SentenceTransformer, concatenated, and L2-normalized (concat_l2norm), producing higher-recall retrieval than a single embedder.

FlashAttention-2 is enabled for embedding models where the underlying architecture and Transformers backend support it, with automatic fallback to standard attention where FA2 is unsupported.

Requirements

Hardware

NVIDIA GPU with CUDA 12.x, Ampere or newer
16+ GB RAM

OS

Linux only (Ubuntu or WSL2)

Python

3.12+

Installation

1. Install RAPIDS dependencies (required for faiss-gpu-cu12-cuvs)

pip install libcuvs-cu12==25.10.0 librmm-cu12==25.10.0 libraft-cu12==25.10.0 \
    rapids-logger "nvidia-nvjitlink-cu12>=12.9" --extra-index-url https://pypi.nvidia.com

2. Install system dependency

sudo apt-get install -y libopenblas-dev

3. Install Python dependencies

pip install -r requirements.txt

4. Install LLM backend

For vLLM scripts (_VLLM.py):

pip install vllm==0.10.1 --extra-index-url https://download.pytorch.org/whl/cu128

vLLM is not in requirements.txt due to its build complexity. Requires transformers>=4.46,<5.0 (already pinned in requirements).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGLLM.PlusPlus

Scripts

Requirements

Installation

1. Install RAPIDS dependencies (required for faiss-gpu-cu12-cuvs)

2. Install system dependency

3. Install Python dependencies

4. Install LLM backend

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
RAGLLM_Code_Reasoning-Nemotron-1.1-7B_LLAMA-CPP.py		RAGLLM_Code_Reasoning-Nemotron-1.1-7B_LLAMA-CPP.py
RAGLLM_Code_Reasoning-Nemotron-1.1-7B_VLLM.py		RAGLLM_Code_Reasoning-Nemotron-1.1-7B_VLLM.py
RAGLLM_English_GLM-4.7-Flash-Q4_1_LLAMA-CPP.py		RAGLLM_English_GLM-4.7-Flash-Q4_1_LLAMA-CPP.py
RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_LLAMA-CPP.py		RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_LLAMA-CPP.py
RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_VLLM.py		RAGLLM_English_Llama-3.1-Nemotron-Nano-8B-v1_VLLM.py
README.md		README.md
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAGLLM.PlusPlus

Scripts

Requirements

Installation

1. Install RAPIDS dependencies (required for faiss-gpu-cu12-cuvs)

2. Install system dependency

3. Install Python dependencies

4. Install LLM backend

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages