Skip to content

Commit 1d95afe

Browse files
a-klosCopilot
andauthored
fix: add missing information for the pyproject toml files and add dedicated documentaion for each lib (#151)
This pull request introduces major improvements to documentation, metadata, and configuration for the three main Python libraries in the STACKIT RAG template: `admin-api-lib`, `extractor-api-lib`, and `rag-core-api`. The changes focus on adding comprehensive README files for each library, updating package metadata in `pyproject.toml` for clarity and compliance, and refining dependency and configuration management. These updates make the libraries easier to understand, install, and extend, and improve maintainability for both operators and developers. **Documentation enhancements:** * Added detailed `README.md` files for `libs/admin-api-lib`, `libs/extractor-api-lib`, and `libs/rag-core-api`, describing module responsibilities, features, endpoints, configuration, usage, extension, and contribution guidelines. [[1]](diffhunk://#diff-0064014deac3d21031c406697c008f92f0bb2783aa7eaaaf264a2345eea2cc9eR1-R96) [[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aR1-R94) [[3]](diffhunk://#diff-eb80132f5f4660c40ce8a60f375daec36d19a5e070d120a478f60d74384183d9R1-R96) **Package metadata and configuration improvements:** * Updated `pyproject.toml` for all three libraries to include new version numbers (`v3.2.1`), expanded author and maintainer information, license, repository, homepage, and readme fields for better package distribution and compliance. [[1]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19) [[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L7-R19) [[3]](diffhunk://#diff-9c4162cc1c16dd4c7ec5e95e79df285e8c0882a1db7ff2892c746a0537d26c36L7-R19) * Improved dependency specification in `libs/extractor-api-lib/pyproject.toml` by switching `fasttext` to a stable PyPI version and adjusting other package versions. * Refined pytest and flake8 configuration for consistency and clarity, such as changing `log_cli` to boolean and updating exclusions. [[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L139-R148) [[2]](diffhunk://#diff-9c5aeb0db77c2eec077d07ddc3b3810ae1a4a1e50ee7061fba37a46706c513fbL7-R19) These changes collectively strengthen the documentation, usability, and maintainability of the STACKIT RAG template libraries, making them more accessible for new users and contributors. --------- Co-authored-by: Copilot <[email protected]>
1 parent 66570f9 commit 1d95afe

File tree

19 files changed

+532
-164
lines changed

19 files changed

+532
-164
lines changed

libs/admin-api-lib/README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# admin-api-lib
2+
3+
Document lifecycle orchestration for the STACKIT RAG template. This library exposes a FastAPI-compatible admin surface that receives raw user content, coordinates extraction, summarisation, chunking, and storage, and finally hands normalized information pieces to the core RAG API.
4+
5+
It powers the [`services/admin-backend`](https://github.com/stackitcloud/rag-template/tree/main/services/admin-backend) deployment and is the primary integration point for operators managing their document corpus.
6+
7+
## Responsibilities
8+
9+
1. **Ingestion** – Accept files or external sources from the admin UI or API clients.
10+
2. **Extraction** – Call `extractor-api-lib` to obtain normalized information pieces.
11+
3. **Enhancement** – Summarize and enrich content using LLMs and tracing hooks from `rag-core-lib`.
12+
4. **Chunking** – Split content via recursive or semantic strategies before vectorization.
13+
5. **Persistence** – Store raw assets in S3-compatible storage and push processed chunks to `rag-core-api`.
14+
6. **Status tracking** – Keep track of upload progress and expose document status endpoints backed by KeyDB/Redis.
15+
16+
## Feature highlights
17+
18+
- Ready-to-wire dependency-injector container with sensible defaults for S3 storage, KeyDB status tracking, and background tasks.
19+
- Pluggable chunkers (`recursive` vs `semantic`) and summariser implementations with shared retry/backoff controls.
20+
- Rich Pydantic request/response models covering uploads, non-file sources, and document status queries.
21+
- Thin endpoint implementations that can be swapped or extended while keeping the public API stable.
22+
- Structured tracing (Langfuse) and logging that mirror the behaviour of the chat backend.
23+
24+
## Installation
25+
26+
```bash
27+
pip install admin-api-lib
28+
```
29+
30+
Requires Python 3.13 and `rag-core-lib`.
31+
32+
## Module tour
33+
34+
- `dependency_container.py` – Configures and wires dependency-injection providers. Override registrations here to customise behaviour.
35+
- `api_endpoints/` & `impl/api_endpoints/` – Endpoints + abstractions for file uploads, source uploads, deletions, document status, and reference retrieval.
36+
- `apis/` – Admin API abstractions and implementations.
37+
- `chunker/` & `impl/chunker/` – Abstractions + default text/semantic chunkers and chunker type selection class.
38+
- `extractor_api_client/` & `rag_backend_client/` – Generated OpenAPI clients to talk to the extractor and rag core API services.
39+
- `file_services/` & `impl/file_services/` – Abstract and default S3 interface.
40+
- `summarizer/` & `impl/summarizer/` – Interfaces and LangChain-based summariser that leverage shared retry logic.
41+
- `information_enhancer/` & `impl/information_enhancer/` – Abstractions + page and summary enhancer. Enhancers are centralized with general enhancer.
42+
- `impl/key_db/` – KeyDB/Redis client implementation for document status tracking.
43+
- `impl/mapper/` – Mapper between extractor documents and langchain documents.
44+
- `impl/settings/` – Configuration settings for dependency injection container components.
45+
- `prompt_templates/` – Default summarisation prompt shipped with the template.
46+
- `utils/` – Utility functions and classes.
47+
48+
## Endpoints provided
49+
50+
- `POST /upload_file` – Uploads user selected files
51+
- `POST /upload_source` - Uploads user selected sources
52+
- `DELETE /documents/{identification}` – Deletes a document from the system.
53+
- `GET /document_reference/{identification}` – Retrieves a document reference.
54+
- `GET /all_documents_status` – Retrieves the status of all documents.
55+
56+
Refer to [`libs/README.md`](../README.md#2-admin-api-lib) for in-depth API documentation.
57+
58+
## Configuration overview
59+
60+
All settings are powered by `pydantic-settings`, so you can use environment variables or instantiate classes manually:
61+
62+
- `S3_*` (`S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, `S3_BUCKET`) – configure storage for raw uploads.
63+
- `DOCUMENT_EXTRACTOR_HOST` – base URL of the extractor service.
64+
- `RAG_API_HOST` – base URL of the rag-core API.
65+
- `CHUNKER_CLASS_TYPE_CHUNKER_TYPE` – choose `recursive` (default) or `semantic` chunking.
66+
- `CHUNKER_*` (`CHUNKER_MAX_SIZE`, `CHUNKER_OVERLAP`, `CHUNKER_BREAKPOINT_THRESHOLD_TYPE`, …) – fine-tune chunking behaviour.
67+
- `SUMMARIZER_MAXIMUM_INPUT_SIZE`, `SUMMARIZER_MAXIMUM_CONCURRENCY`, `SUMMARIZER_MAX_RETRIES`, etc. – tune summariser limits and retry behaviour.
68+
- `SOURCE_UPLOADER_TIMEOUT` – adjust how long non-file source ingestions wait before timing out.
69+
- `USECASE_KEYVALUE_HOST` / `USECASE_KEYVALUE_PORT` – configure the KeyDB/Redis instance that persists document status.
70+
71+
The Helm chart forwards these values through `adminBackend.envs.*`, keeping deployments declarative. Local development can rely on `.env` as described in the repository root README.
72+
73+
## Typical usage
74+
75+
```python
76+
from admin_api_lib.main import app as perfect_admin_app
77+
```
78+
79+
The admin frontend (`services/frontend` → Admin app) and automation scripts talk to these endpoints to manage the corpus. Downstream, `rag-core-api` receives the processed information pieces and stores them in the vector database.
80+
81+
## Extending the library
82+
83+
1. Implement a new interface (e.g., `Chunker`, `Summarizer`, `FileService`).
84+
2. Register it in `dependency_container.py` or override via dependency-injector in your service.
85+
3. Update or add API endpoints if you expose new capabilities.
86+
4. Cover the new behaviour with pytest-based unit tests under `libs/admin-api-lib/tests`.
87+
88+
Because components depend on interfaces defined here, downstream services can swap behavior without modifying the public API surface.
89+
90+
## Contributing
91+
92+
Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).
93+
94+
## License
95+
96+
Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.

libs/admin-api-lib/poetry.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/admin-api-lib/pyproject.toml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"
44

55
[tool.poetry]
66
name = "admin-api-lib"
7-
version = "1.0.1"
7+
version = "v3.2.1"
88
description = "The admin backend is responsible for the document management. This includes deletion, upload and returning the source document."
9-
authors = ["STACKIT Data and AI Consulting <[email protected]>"]
9+
authors = [
10+
"STACKIT GmbH & Co. KG <[email protected]>",
11+
]
12+
maintainers = [
13+
"Andreas Klos <[email protected]>",
14+
]
1015
packages = [{ include = "admin_api_lib", from = "src" }]
16+
readme = "README.md"
17+
license = "Apache-2.0"
18+
repository = "https://github.com/stackitcloud/rag-template"
19+
homepage = "https://pypi.org/project/admin-api-lib"
1120

1221
[tool.flake8]
1322
exclude= [".eggs", "./libs/*", "./src/admin_api_lib/models/*", "./src/admin_api_lib/rag_backend_client/*", "./src/admin_api_lib/extractor_api_client/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py"]

libs/extractor-api-lib/README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# extractor-api-lib
2+
3+
Content ingestion layer for the STACKIT RAG template. This library exposes a FastAPI extraction service that ingests raw documents (files or remote sources), extracts and converts (to internal representations) the information, and hands output to [`admin-api-lib`](../admin-api-lib/).
4+
5+
## Responsibilities
6+
7+
- Receive binary uploads and remote source descriptors from the admin backend.
8+
- Route each request through the appropriate extractor (file, sitemap, Confluence, etc.).
9+
- Convert extracted fragments into the shared `InformationPiece` schema expected by downstream services.
10+
11+
## Feature highlights
12+
13+
- **Broad format coverage** – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
14+
- **Consistent output schema** – Information pieces are returned in a unified structure with content type (`TEXT`, `TABLE`, `IMAGE`) and metadata.
15+
- **Swappable extractors** – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
16+
- **Production-grade plumbing** – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.
17+
18+
## Installation
19+
20+
```bash
21+
pip install extractor-api-lib
22+
```
23+
24+
Python 3.13 is required. OCR and computer-vision features expect system packages such as `ffmpeg`, `poppler-utils`, and `tesseract` (see `services/document-extractor/README.md` for the full list).
25+
26+
## Module tour
27+
28+
- `dependency_container.py` – Central dependency-injector wiring. Override providers here to plug in custom extractors, endpoints etc.
29+
- `api_endpoints/` & `impl/api_endpoints/` – Thin FastAPI endpoint abstractions and implementations for file and source (like confluence & sitemaps) extractors.
30+
- `apis/` – Extractor API abstractions and implementations.
31+
- `extractors/` & `impl/extractors/` – Format-specific logic (PDF, DOCX, PPTX, XML, EPUB, Confluence, sitemap) packaged behind the `InformationExtractor`/`InformationFileExtractor` interfaces.
32+
- `mapper/` & `impl/mapper/` – Abstractions and implementations to map langchain documents, internal and external information piece representations to each other.
33+
- `file_services/` – Default S3-compatible storage adapter; replace it if you store files elsewhere.
34+
- `impl/settings/` – Configuration settings for dependency injection container components.
35+
- `table_converter/` & `impl/table_converter/` – Abstractions and implementations to convert `pandas.DataFrame` to markdown and vice versa.
36+
- `impl/types/` - Enums for content-, extractor- and file types.
37+
- `impl/utils/` – Helper functions for hashed datetime and sitemap crawling, header injection, and custom metadata parsing.
38+
39+
## Endpoints provided
40+
41+
- `POST /extract_from_file` – Downloads the file from S3, extracts its contents, and returns normalized `InformationPiece` records.
42+
- `POST /extract_from_source` – Pulls from remote sources (Confluence, sitemap) using credentials and further optional kwargs.
43+
44+
Both endpoints stream their results back to `admin-api-lib`, which takes care of enrichment and persistence.
45+
46+
## How the file extraction endpoint works
47+
48+
1. Download the file from S3
49+
2. Chose suitable file extractor based on the filename ending
50+
3. Extract the content from the file
51+
4. Map the internal representation to the external schema
52+
5. Return the final output
53+
54+
## How the source extraction endpoint works
55+
56+
1. Chose suitable source extractor based on the source type
57+
2. Pull the source content using the provided credentials and parameters
58+
3. Extract the content from the source
59+
4. Map the internal representation to the external schema
60+
5. Return the final output
61+
62+
## Configuration overview
63+
64+
Two `pydantic-settings` models ship with this package:
65+
66+
- **S3 storage** (`S3Settings`) – configure the built-in file service with `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, and `S3_BUCKET`.
67+
- **PDF extraction** (`PDFExtractorSettings`) – adjust footer trimming or diagram export via `PDF_EXTRACTOR_FOOTER_HEIGHT` and `PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME`.
68+
69+
Other extractors accept their parameters at runtime through the request payload (`ExtractionParameters`). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls `/extract_from_source`. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.
70+
71+
The Helm chart exposes the environment variables mentioned above under `documentExtractor.envs.*` so production deployments remain declarative.
72+
73+
## Typical usage
74+
75+
```python
76+
from extractor_api_lib.main import app as perfect_extractor_app
77+
```
78+
79+
`admin-api-lib` calls `/extract_from_file` and `/extract_from_source` to populate the ingestion pipeline.
80+
81+
## Extending the library
82+
83+
1. Implement `InformationFileExtractor` or `InformationExtractor` for your new format/source.
84+
2. Register it in `dependency_container.py` (append to `file_extractors` list or `source_extractors` dict).
85+
3. Update mapper or metadata handling if additional fields are required.
86+
4. Add unit tests under `libs/extractor-api-lib/tests` using fixtures and fake storage providers.
87+
88+
## Contributing
89+
90+
Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).
91+
92+
## License
93+
94+
Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.

libs/extractor-api-lib/poetry.lock

Lines changed: 6 additions & 10 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/extractor-api-lib/pyproject.toml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"
44

55
[tool.poetry]
66
name = "extractor_api_lib"
7-
version = "1.0.1"
7+
version = "v3.2.1"
88
description = "Extracts the content of documents, websites, etc and maps it to a common format."
9-
authors = ["STACKIT Data and AI Consulting <[email protected]>"]
9+
authors = [
10+
"STACKIT GmbH & Co. KG <[email protected]>",
11+
]
12+
maintainers = [
13+
"Andreas Klos <[email protected]>",
14+
]
1015
packages = [{ include = "extractor_api_lib", from = "src" }]
16+
readme = "README.md"
17+
license = "Apache-2.0"
18+
repository = "https://github.com/stackitcloud/rag-template"
19+
homepage = "https://pypi.org/project/extractor-api-lib"
1120

1221
[[tool.poetry.source]]
1322
name = "pytorch_cpu"
@@ -70,7 +79,7 @@ max-line-length = 120
7079
python = "^3.13"
7180
wheel = "^0.45.1"
7281
botocore = "^1.38.10"
73-
fasttext = {git = "https://github.com/cfculhane/fastText", rev = "4a4451337ae6b476b9c584b97776c8c3eb4b27c5"}
82+
fasttext = "^0.9.3"
7483
pytesseract = "^0.3.10"
7584
fastapi = "^0.118.0"
7685
uvicorn = "^0.37.0"
@@ -136,7 +145,7 @@ black = "^25.1.0"
136145
httpx = "^0.28.1"
137146

138147
[tool.pytest.ini_options]
139-
log_cli = 1
148+
log_cli = true
140149
log_cli_level = "DEBUG"
141150
pythonpath = "src"
142151
testpaths = "src/tests"

0 commit comments

Comments
 (0)