Master’s Thesis by Anubhuti Singh, April 2025
Paderborn University — Data Science Research Group
- Overview
- Repository Structure
- Data & Knowledge Graph
- Prerequisites
- Installation & Setup
- Usage
- Python Scripts Overview
- Generating Requirements File
- Acknowledgments
This project investigates how integrating Knowledge Graphs (KGs) with Large Language Models (LLMs) can improve accuracy and reasoning in organizational-domain question answering. We evaluate:
- Traditional Retrieval (TF-IDF, BM25) vs. Dense RAG (DPR)
- KG-Enhanced Retrieval (subgraph vs. triple chunking + fusion)
- SPARQL-Driven Querying (LLM→SPARQL→KG)
- Source: Four CSV/Excel tables (employees, organizations, applications, processes).
- Format: original, un-anonymized tables—preserve full fidelity and FK relationships.
- Location: stored off-repo (internal).
- Construction:
kgCreation/CompleteFINKG.ipynb
uses rdflib +ontology_schema.jsonld
. - Output:
kgCreation/ExtendedFinKG_anonymized.ttl
(Turtle). - Anonymization: Node IDs & predicates are pseudonymized; The overall structure (the types of entities and how they relate) still uses clear, readable names.
- Anonymized Q&A:
anonymize/groundTruth_anonymized.xlsx
contains expert-verified queries & answers. - Regeneration: run
anonymize/anonymize.ipynb
to adjust or re-anonymize.
- Python 3.8+
- Conda (recommended) or
virtualenv
+pip
- Git
- (Optional) Java 8+ for Blazegraph
git clone https://github.com/your-username/anubhuti_master_thesis-1-main.git
cd anubhuti_master_thesis-1-main
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On Windows
pip install -r requirements.txt
-
Download Blazegraph
- Visit the Blazegraph Releases page.
- Download the
blazegraph.jar
file.
-
Run Blazegraph
- Place the
blazegraph.jar
file in a folder (e.g.,C:\Blazegraph
). - Open Command Prompt and navigate to that folder:
cd C:\Blazegraph
- Start Blazegraph:
java -server -Xmx4g -jar blazegraph.jar
- Once Blazegraph is running, you can access it at:
http://localhost:9999/blazegraph
- Navigate to "Namespace" in the UI and create a new namespace with:
- Name:
myGraph
- Mode:
triples
Default namespacekb
Update the namespace in the blazegraph url in all the files using graphDB
blazegraph
. http://localhost:9999/blazegraph/namespace/myGraph/sparql - Place the
Below is an overview of the key Python files and their functionalities:
Implements baseline retrieval approaches.
BM25.ipynb
– Lexical retrieval with BM25.DPR.ipynb
– Dense retrieval using transformers.tf-idf.ipynb
– TF-IDF based baseline.simpleRAG.py
– A simple RAG (Retrieval-Augmented Generation) prototype using CSV data.
SPARQL query generation and evaluation using Blazegraph.
eval.ipynb
– Approach evaluation using metrics (Precision, Recall, F1, MRR, hits@k,Rougue, Bleu, chrF).LLMApache.ipynb
– Query generation and validation with LLM and using Apache Jena Fuseki graph database.LLMBlazegraphVal.ipynb
– Batch SPARQL generation and LLM answer generation from Excel queries using Blazegraph graph database.
Knowledge graph construction and schema management.
CompleteFINKG.ipynb
– Builds the complete financial knowledge graph from CSVs and RDF triples.ExtendedFinKG_anonymized.ttl
– An anonymized RDF Turtle file representing an extended version of the financial knowledge graph.ontology_schema.jsonld
– JSON-LD representation of the ontology used to define schema and classes in the knowledge graph.
LangGraph-based RAG pipelines.
LG_hybrid_subgraph.ipynb
– Retrieves and processes hybrid subgraphs using LLM-assisted query parsing and subgraph ranking.LG_hybrid_triple.ipynb
– Focuses on triple-level retrieval and evaluation within the LangGraph pipeline.LG_LLMblazegraph.ipynb
– Combines LangGraph and Blazegraph to perform end-to-end retrieval and answer generation via SPARQL.
Final and hybridized implementation of the RDF-RAG pipeline.
hybrid.ipynb
- Main hybrid retrieval pipeline that uses both lexical and dense retrieval (via RRF), LLM parsing, subgraph linking, and natural language generation.hybrid_triple.ipynb
- Variant of the hybrid model working at the triple-level rather than the subgraph level.
To automatically generate a requirements.txt
file based on the libraries used in the project, you can use any of the following methods:
Make sure all your dependencies are installed in your current Python environment, then run:
pip freeze > requirements.txt
This will create (or overwrite) a requirements.txt
file listing the exact versions of the installed packages.
- Siemens Energy for providing computational resources and domain expertise.
- University of Paderborn for academic support and guidance.
For any questions or wishes, please contact: Anubhuti Singh at [email protected].