Skip to content

Commit

Permalink
update readme and start rename
Browse files Browse the repository at this point in the history
  • Loading branch information
vemonet committed Sep 17, 2024
1 parent bf1288b commit 5358178
Show file tree
Hide file tree
Showing 8 changed files with 326 additions and 158 deletions.
76 changes: 76 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# 🧑‍💻 Development setup

This page is for if you want to run the package and reusable components in development, and get involved by making a code contribution.

## 📥️ Clone

Clone the repository:

```bash
git clone https://github.com/sib-swiss/sparql-llm
cd sparql-llm
```

## 🐣 Install dependencies

> This repository uses [`hatch`](https://hatch.pypa.io/latest/) to easily handle scripts and virtual environments. Checkout the `pyproject.toml` file for more details on the scripts available. You can also just install dependencies with `pip install .` and run the python scripts in `src`
Install [Hatch](https://hatch.pypa.io), this will automatically handle virtual environments and make sure all dependencies are installed when you run a script in the project:

```bash
pipx install hatch
```

## ☑️ Run tests

Make sure the existing tests still work by running the test suite and linting checks. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;

```bash
hatch run test
```

To display all logs when debugging:

```bash
hatch run test -s
```

## Format code

```bash
hatch run fmt
```

## ♻️ Reset the environment

In case you are facing issues with dependencies not updating properly you can easily reset the virtual environment with:

```bash
hatch env prune
```

Manually trigger installing the dependencies in a local virtual environment:

```bash
hatch -v env create
```

## 🏷️ New release process

The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:

1. Make sure the `PYPI_TOKEN` secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI at [pypi.org/manage/account](https://pypi.org/manage/account).
2. Increment the `version` number in the `pyproject.toml` file in the root folder of the repository.

```bash
hatch version fix
```

3. Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.

You can also build and publish from your computer:

```bash
hatch build
hatch publish
```
114 changes: 0 additions & 114 deletions README-components.md

This file was deleted.

156 changes: 141 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,160 @@
# 🧬 API for Expasy 4
# 🦜✨ LLM for SPARQL query generation

> This repository uses [`hatch`](https://hatch.pypa.io/latest/) to easily handle scripts and virtual environments. Checkout the `pyproject.toml` file for more details on the scripts available.
>
> You can also just install dependencies with `pip install .` and run the python scripts in `src`
This repository contains:

* Utilities and functions to improve LLMs capabilities when working with [SPARQL](https://www.w3.org/TR/sparql11-overview/) endpoints and [RDF](https://www.w3.org/RDF/) knowledge graph. In particular improving SPARQL query generation.
* Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore.
* A complete reusable system to deploy a LLM chat system for multiple SPARQL endpoints (WIP)
* The deployment for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the SIB

## 🪄 Reusable components

## 🚀 Deploy
### Installation

Add the OpenAI API key to a `.env` file at the root of the repository:
This package requires Python >=3.9, install it from the git repository with:

```bash
echo "OPENAI_API_KEY=sk-proj-XXX" > .env
pip install git+https://github.com/sib-swiss/sparql-llm.git
```

Start the API + similarity search engine:
### SPARQL query examples loader

```bash
docker compose up
Load SPARQL query examples defined using the SHACL ontology from a SPARQL endpoint. See **[github.com/sib-swiss/sparql-examples](https://github.com/sib-swiss/sparql-examples)** for more details on how to define the examples.

```python
from sparql_llm import SparqlExamplesLoader

loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```

> Refer to the [LangChain documentation](https://python.langchain.com/v0.2/docs/) to figure out how to best integrate documents loaders to your stack.
### SPARQL endpoint schema loader

Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on its [VoID description](https://www.w3.org/TR/void/) present in your endpoint. Ideally the endpoint should also contain the ontology describing the class, so the `rdfs:label` and `rdfs:comment` of the class can be used to generate embeddings and improve semantic matching.

Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.

```python
from sparql_llm import SparqlVoidShapesLoader

loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```

### Generate complete ShEx shapes from VoID description

You can also generate the complete ShEx shapes for a SPARQL endpoint with:

```python
from sparql_llm import get_shex_from_void

shex_str = get_shex_from_void("https://sparql.uniprot.org/sparql/")
print(shex_str)
```

## 🧑‍💻 Development
### Validate a SPARQL query based on VoID description

This takes a SPARQL query and validates the predicates/types used are compliant with the VoID description present in the SPARQL endpoint the query is executed on.

Start the workspace + similarity search engine:
This function supports:

* federated queries (VoID description will be retrieved for each SERVICE call),
* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)

The function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.

It will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes or predicates in the context), which can be passed to an LLM to help for fixing the query.

```python
from sparql_llm import validate_sparql_with_void

sparql_query = """PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX orth:<http://purl.org/net/orth#>
PREFIX dcterms:<http://purl.org/dc/terms/>
PREFIX obo:<http://purl.obolibrary.org/obo/>
PREFIX lscr:<http://purl.org/lscr#>
PREFIX genex:<http://purl.org/genex#>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?diseaseLabel ?humanProtein ?hgncSymbol ?orthologRatProtein ?orthologRatGene
WHERE {
SERVICE <https://sparql.uniprot.org/sparql> {
SELECT DISTINCT * WHERE {
?humanProtein a up:Protein ;
up:organism/up:scientificName 'Homo sapiens' ;
up:annotation ?annotation ;
rdfs:seeAlso ?hgnc .
?hgnc up:database <http://purl.uniprot.org/database/HGNC> ;
rdfs:label ?hgncSymbol . # comment
?annotation a up:Disease_Annotation ;
up:disease ?disease .
?disease a up:Disease ;
rdfs:label ?diseaseLabel . # skos:prefLabel
FILTER CONTAINS(?diseaseLabel, "cancer")
}
}
SERVICE <https://sparql.omabrowser.org/sparql/> {
SELECT ?humanProtein ?orthologRatProtein ?orthologRatGene WHERE {
?humanProteinOma a orth:Protein ;
lscr:xrefUniprot ?humanProtein .
?orthologRatProtein a orth:Protein ;
sio:SIO_010078 ?orthologRatGene ; # 79
orth:organism/obo:RO_0002162/up:scientificNam 'Rattus norvegicus' .
?cluster a orth:OrthologsCluster .
?cluster orth:hasHomologousMember ?node1 .
?cluster orth:hasHomologousMember ?node2 .
?node1 orth:hasHomologousMember* ?humanProteinOma .
?node2 orth:hasHomologousMember* ?orthologRatProtein .
FILTER(?node1 != ?node2)
}
}
SERVICE <https://www.bgee.org/sparql/> {
?orthologRatGene genex:isExpressedIn ?anatEntity ;
orth:organism ?ratOrganism .
?anatEntity rdfs:label 'brain' .
?ratOrganism obo:RO_0002162 taxon:10116 .
}
}
"""

issues = validate_sparql_with_void(sparql_query, "https://sparql.uniprot.org/sparql/")
print("\n".join(issues))
```

## 🚀 Deploy chat system

> [!WARNING]
>
> To deploy the complete chat system right now you will need to fork this repository, change the configuration in `src/sparql_llm/config.py` and `compose.yml`, then deploy with docker/podman compose.
>
> We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!
Create a `.env` file at the root of the repository to provide OpenAI API key to a `.env` file at the root of the repository:

```bash
docker compose -f compose.dev.yml up
OPENAI_API_KEY=sk-proj-YYY
GLHF_API_KEY=APIKEY_FOR_glhf.chat_USED_FOR_OPEN_SOURCE_MODELS
EXPASY_API_KEY=NOT_SO_SECRET_API_KEY_USED_BY_FRONTEND_TO_AVOID_SPAM_FROM_CRAWLERS
LOGS_API_KEY=PASSWORD_TO_ACCESS_LOGS_THROUGH_THE_API
```

Inside the workspace container install the dependencies:
Start the web UI, API, and similarity search engine in production (you might need to make some changes to the `compose.yml` file to adapt it to your server setup):

```bash
pip install -e ".[cpu,test]"
docker compose up
```

Start the stack locally for development:

```bash
docker compose -f compose.dev.yml up
```

8 changes: 7 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ build-backend = "hatchling.build"
[project]
requires-python = ">=3.9"
version = "0.0.1"
name = "expasy-chat"
name = "sparql-llm"
description = "Scripts for Expasy 4, prepare data, train models, etc."
license = "MIT"
authors = [
Expand Down Expand Up @@ -104,6 +104,7 @@ target-version = "py39"
line-length = 120
exclude = [
"notebooks",
"**/__init__.py",
]

[tool.ruff.lint]
Expand Down Expand Up @@ -137,3 +138,8 @@ ignore = [
"T201", # do not use print
"B008", # do not perform function calls in argument defaults
]

[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["I", "F401"] # module imported but unused
# Tests can use magic values, assertions, and relative imports:
"tests/**/*" = ["PLR2004", "S101", "S105", "TID252"]
Loading

0 comments on commit 5358178

Please sign in to comment.