Skip to content

Commit

Permalink
rename to sparql_llm
Browse files Browse the repository at this point in the history
  • Loading branch information
vemonet committed Sep 17, 2024
1 parent 5358178 commit c1e41f9
Show file tree
Hide file tree
Showing 29 changed files with 232 additions and 347 deletions.
51 changes: 51 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Tests
on: [push, pull_request, workflow_call, workflow_dispatch]

jobs:

tests:
name: ✅ Run tests
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
# os: ["ubuntu-latest", "windows-latest", "macos-latest"]
os: ["ubuntu-latest"]
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
pip install hatch
- name: Test with coverage
run: |
hatch run test
codeql:
name: 🔎 CodeQL analysis
runs-on: ubuntu-latest
permissions:
security-events: write
contents: read
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:
languages: python

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v3
with:
category: "/language:python"
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ RUN pip install --upgrade pip
COPY . /app/
# COPY ./scripts/prestart.sh /app/

RUN pip install -e ".[cpu]"
RUN pip install -e "."

ENV PYTHONPATH=/app
ENV MODULE_NAME=src.expasy_chat.api
ENV MODULE_NAME=src.sparql_llm.api
# ENV VARIABLE_NAME=app
56 changes: 32 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
# 🦜✨ LLM for SPARQL query generation

Reusable components and complete webapp to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and query validation from the endpoint schema.

The different components of the system can be used separately, or the whole chat system webapp can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator).

This repository contains:

* Utilities and functions to improve LLMs capabilities when working with [SPARQL](https://www.w3.org/TR/sparql11-overview/) endpoints and [RDF](https://www.w3.org/RDF/) knowledge graph. In particular improving SPARQL query generation.
* Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore.
* A complete reusable system to deploy a LLM chat system for multiple SPARQL endpoints (WIP)
* The deployment for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the SIB
* Functions to extract and load relevant metadata from a SPARQL endpoints. Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore.
* Function to automatically parse and validate SPARQL queries based on a endpoint VoID description.
* A complete reusable system to deploy a LLM chat system with web UI, API and vector database, designed to help users to write SPARQL queries for a given set of endpoints by exploiting metadata uploaded to the endpoints (WIP).
* The deployment configuration for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the [SIB](https://www.sib.swiss/).

## 🪄 Reusable components

Expand Down Expand Up @@ -34,9 +38,11 @@ print(docs[0].metadata)
### SPARQL endpoint schema loader

Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on its [VoID description](https://www.w3.org/TR/void/) present in your endpoint. Ideally the endpoint should also contain the ontology describing the class, so the `rdfs:label` and `rdfs:comment` of the class can be used to generate embeddings and improve semantic matching.
Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the [VoID description](https://www.w3.org/TR/void/) present in the endpoint. Ideally the endpoint should also contain the ontology describing the class, so the `rdfs:label` and `rdfs:comment` of the classes can be used to generate embeddings and improve semantic matching.

Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.
> [!TIP]
>
> Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.
```python
from sparql_llm import SparqlVoidShapesLoader
Expand Down Expand Up @@ -64,25 +70,25 @@ This takes a SPARQL query and validates the predicates/types used are compliant

This function supports:

* federated queries (VoID description will be retrieved for each SERVICE call),
* federated queries (VoID description will be automatically retrieved for each SERVICE call in the query),
* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)

The function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.
This function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.

It will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes or predicates in the context), which can be passed to an LLM to help for fixing the query.
It will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes/predicates), which can be passed to an LLM as context to help it figuring out how to fix the query.

```python
from sparql_llm import validate_sparql_with_void

sparql_query = """PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX orth:<http://purl.org/net/orth#>
PREFIX dcterms:<http://purl.org/dc/terms/>
PREFIX obo:<http://purl.obolibrary.org/obo/>
PREFIX lscr:<http://purl.org/lscr#>
PREFIX genex:<http://purl.org/genex#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX orth: <http://purl.org/net/orth#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX genex: <http://purl.org/genex#>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?diseaseLabel ?humanProtein ?hgncSymbol ?orthologRatProtein ?orthologRatGene
WHERE {
Expand Down Expand Up @@ -122,39 +128,41 @@ WHERE {
?anatEntity rdfs:label 'brain' .
?ratOrganism obo:RO_0002162 taxon:10116 .
}
}
"""
}"""

issues = validate_sparql_with_void(sparql_query, "https://sparql.uniprot.org/sparql/")
print("\n".join(issues))
```

## 🚀 Deploy chat system
## 🚀 Complete chat system

> [!WARNING]
>
> To deploy the complete chat system right now you will need to fork this repository, change the configuration in `src/sparql_llm/config.py` and `compose.yml`, then deploy with docker/podman compose.
>
> We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!
> It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!
Create a `.env` file at the root of the repository to provide OpenAI API key to a `.env` file at the root of the repository:

```bash
OPENAI_API_KEY=sk-proj-YYY
GLHF_API_KEY=APIKEY_FOR_glhf.chat_USED_FOR_OPEN_SOURCE_MODELS
EXPASY_API_KEY=NOT_SO_SECRET_API_KEY_USED_BY_FRONTEND_TO_AVOID_SPAM_FROM_CRAWLERS
LOGS_API_KEY=PASSWORD_TO_ACCESS_LOGS_THROUGH_THE_API
LOGS_API_KEY=PASSWORD_TO_EASILY_ACCESS_LOGS_THROUGH_THE_API
```

Start the web UI, API, and similarity search engine in production (you might need to make some changes to the `compose.yml` file to adapt it to your server setup):
Start the web UI, API, and similarity search engine in production (you might need to make some changes to the `compose.yml` file to adapt it to your server/proxy setup):

```bash
docker compose up
```

Start the stack locally for development:
Start the stack locally for development, with code from `src` folder mounted in the container and automatic API reload on changes to the code:

```bash
docker compose -f compose.dev.yml up
```

* Chat web UI available at http://localhost:8000
* OpenAPI Swagger UI available at http://localhost:8000/docs
* Vector database dashboard UI available at http://localhost:6333/dashboard
1 change: 1 addition & 0 deletions compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ services:
- ./prestart.sh:/app/prestart.sh
entrypoint: /start-reload.sh

# In case you need a GPU-enabled workspace
# workspace:
# image: ghcr.io/vemonet/gpu-workspace:main
# # Enable GPUs in this container:
Expand Down
20 changes: 3 additions & 17 deletions compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ services:

vectordb:
# https://hub.docker.com/r/qdrant/qdrant/tags
# image: docker.io/qdrant/qdrant:v1.9.5
image: docker.io/qdrant/qdrant:v1.11.3
# image: qdrant/qdrant:v1.9.2-unprivileged # Unprivileged don't work when mounting a volume
container_name: vectordb
Expand All @@ -12,11 +11,6 @@ services:
# - ./qdrant_config.yml:/qdrant/config/production.yaml
environment:
- QDRANT_ALLOW_RECOVERY_MODE=true
# networks:
# - default
# ports:
# - 6333:6333
# - 6334:6334
# command:
# - ./qdrant --config-path /qdrant/config/production.yaml

Expand All @@ -34,21 +28,13 @@ services:
- ./data/fastembed_cache:/tmp/fastembed_cache
- ./data/logs:/logs
- ./src:/app/src
# entrypoint: uvicorn src.expasy_chat.api:app --host 0.0.0.0 --port 80
# entrypoint: uvicorn src.sparql_llm.api:app --host 0.0.0.0 --port 80
env_file:
- .env
# networks:
# - default

# TODO: add ollama

# networks:
# default:
# driver: pasta
# # driver: bridge


# podman-compose down && podman network prune -f
# podman exec -it expasy-chat_api_1 bash -c "apt-get update && apt-get install -y telnet && telnet vectordb 6334"
# podman exec -it sparql-llm_api_1 bash -c "apt-get update && apt-get install -y telnet && telnet vectordb 6334"
# < /dev/tcp/vectordb/6334
# podman exec -it api bash -c "< /dev/tcp/vectordb/6334"
# podman exec -it api bash -c "< /dev/tcp/vectordb/6334"
12 changes: 6 additions & 6 deletions deploy.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
if [ "$1" = "--build" ]; then
echo "📦️ Re-building"
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/expasy-chat ; git pull ; podman-compose up --force-recreate --build -d"'
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/sparql-llm ; git pull ; podman-compose up --force-recreate --build -d"'
elif [ "$1" = "--logs" ]; then
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/expasy-chat ; podman-compose logs api"'
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/sparql-llm ; podman-compose logs api"'
elif [ "$1" = "--likes" ]; then
mkdir -p data/prod
scp expasychat:/var/containers/podman/expasy-chat/data/logs/likes.jsonl ./data/prod/
scp expasychat:/var/containers/podman/expasy-chat/data/logs/dislikes.jsonl ./data/prod/
scp expasychat:/var/containers/podman/expasy-chat/data/logs/user_questions.log ./data/prod/
scp expasychat:/var/containers/podman/sparql-llm/data/logs/likes.jsonl ./data/prod/
scp expasychat:/var/containers/podman/sparql-llm/data/logs/dislikes.jsonl ./data/prod/
scp expasychat:/var/containers/podman/sparql-llm/data/logs/user_questions.log ./data/prod/
else
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/expasy-chat ; git pull ; podman-compose up --force-recreate -d"'
ssh expasychat 'sudo -u podman bash -c "cd /var/containers/podman/sparql-llm ; git pull ; podman-compose up --force-recreate -d"'
fi
6 changes: 3 additions & 3 deletions notebooks/compare_queries_examples_to_void.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -356,9 +356,9 @@
"source": [
"from qdrant_client.models import FieldCondition, Filter, MatchValue\n",
"\n",
"from expasy_chat.config import settings\n",
"from expasy_chat.embed import get_vectordb\n",
"from expasy_chat.validate_sparql import get_void_dict, sparql_query_to_dict\n",
"from sparql_llm.config import settings\n",
"from sparql_llm.embed import get_vectordb\n",
"from sparql_llm.validate_sparql import get_void_dict, sparql_query_to_dict\n",
"\n",
"check_endpoints = {\n",
" \"UniProt\": \"https://sparql.uniprot.org/sparql/\",\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/compute_stats_on_example_queries.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
"import pandas as pd\n",
"from rdflib import Graph\n",
"\n",
"from expasy_chat.validate_sparql import sparql_query_to_dict\n",
"from sparql_llm.validate_sparql import sparql_query_to_dict\n",
"\n",
"GET_EXAMPLE_QUERY = \"\"\"PREFIX sh: <http://www.w3.org/ns/shacl#>\n",
"PREFIX schema: <https://schema.org/>\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/get_shex_from_void.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1504,7 +1504,7 @@
}
],
"source": [
"from expasy_chat.void_to_shex import get_shex_dict_from_void, get_shex_from_void\n",
"from sparql_llm.void_to_shex import get_shex_dict_from_void, get_shex_from_void\n",
"\n",
"shex_dict = get_shex_dict_from_void(\"https://sparql.uniprot.org/sparql/\")\n",
"print(len(shex_dict))\n",
Expand Down
Loading

0 comments on commit c1e41f9

Please sign in to comment.