stackitcloud
diff --git a/‎libs/admin-api-lib/README.md‎
Lines changed: 96 additions & 0 deletions b/‎libs/admin-api-lib/README.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎libs/admin-api-lib/poetry.lock‎
Lines changed: 2 additions & 2 deletions b/‎libs/admin-api-lib/poetry.lock‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎libs/admin-api-lib/pyproject.toml‎
Lines changed: 11 additions & 2 deletions b/‎libs/admin-api-lib/pyproject.toml‎
Lines changed: 11 additions & 2 deletions
diff --git a/‎libs/extractor-api-lib/README.md‎
Lines changed: 94 additions & 0 deletions b/‎libs/extractor-api-lib/README.md‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎libs/extractor-api-lib/poetry.lock‎
Lines changed: 6 additions & 10 deletions b/‎libs/extractor-api-lib/poetry.lock‎
Lines changed: 6 additions & 10 deletions
diff --git a/‎libs/extractor-api-lib/pyproject.toml‎
Lines changed: 13 additions & 4 deletions b/‎libs/extractor-api-lib/pyproject.toml‎
Lines changed: 13 additions & 4 deletions
diff --git a/‎libs/rag-core-lib/src/rag_core_lib/secret_provider/__init__.py‎ renamed to ‎libs/extractor-api-lib/src/extractor_api_lib/impl/file_services/__init__.py‎ b/‎libs/rag-core-lib/src/rag_core_lib/secret_provider/__init__.py‎ renamed to ‎libs/extractor-api-lib/src/extractor_api_lib/impl/file_services/__init__.py‎
@@ -0,0 +1,96 @@
+# admin-api-lib
+
+Document lifecycle orchestration for the STACKIT RAG template. This library exposes a FastAPI-compatible admin surface that receives raw user content, coordinates extraction, summarisation, chunking, and storage, and finally hands normalized information pieces to the core RAG API.
+
+It powers the [`services/admin-backend`](https://github.com/stackitcloud/rag-template/tree/main/services/admin-backend) deployment and is the primary integration point for operators managing their document corpus.
+
+## Responsibilities
+
+1. **Ingestion** – Accept files or external sources from the admin UI or API clients.
+2. **Extraction** – Call `extractor-api-lib` to obtain normalized information pieces.
+3. **Enhancement** – Summarize and enrich content using LLMs and tracing hooks from `rag-core-lib`.
+4. **Chunking** – Split content via recursive or semantic strategies before vectorization.
+5. **Persistence** – Store raw assets in S3-compatible storage and push processed chunks to `rag-core-api`.
+6. **Status tracking** – Keep track of upload progress and expose document status endpoints backed by KeyDB/Redis.
+
+## Feature highlights
+
+- Ready-to-wire dependency-injector container with sensible defaults for S3 storage, KeyDB status tracking, and background tasks.
+- Pluggable chunkers (`recursive` vs `semantic`) and summariser implementations with shared retry/backoff controls.
+- Rich Pydantic request/response models covering uploads, non-file sources, and document status queries.
+- Thin endpoint implementations that can be swapped or extended while keeping the public API stable.
+- Structured tracing (Langfuse) and logging that mirror the behaviour of the chat backend.
+
+## Installation
+
+```bash
+pip install admin-api-lib
+```
+
+Requires Python 3.13 and `rag-core-lib`.
+
+## Module tour
+
+- `dependency_container.py` – Configures and wires dependency-injection providers. Override registrations here to customise behaviour.
+- `api_endpoints/` & `impl/api_endpoints/` – Endpoints + abstractions for file uploads, source uploads, deletions, document status, and reference retrieval.
+- `apis/` – Admin API abstractions and implementations.
+- `chunker/` & `impl/chunker/` – Abstractions + default text/semantic chunkers and chunker type selection class.
+- `extractor_api_client/` & `rag_backend_client/` – Generated OpenAPI clients to talk to the extractor and rag core API services.
+- `file_services/` & `impl/file_services/` – Abstract and default S3 interface.
+- `summarizer/` & `impl/summarizer/` – Interfaces and LangChain-based summariser that leverage shared retry logic.
+- `information_enhancer/` & `impl/information_enhancer/` – Abstractions + page and summary enhancer. Enhancers are centralized with general enhancer.
+- `impl/key_db/` – KeyDB/Redis client implementation for document status tracking.
+- `impl/mapper/` – Mapper between extractor documents and langchain documents.
+- `impl/settings/` – Configuration settings for dependency injection container components.
+- `prompt_templates/` – Default summarisation prompt shipped with the template.
+- `utils/` – Utility functions and classes.
+
+## Endpoints provided
+
+- `POST /upload_file` – Uploads user selected files
+- `POST /upload_source` - Uploads user selected sources
+- `DELETE /documents/{identification}` – Deletes a document from the system.
+- `GET /document_reference/{identification}` – Retrieves a document reference.
+- `GET /all_documents_status` – Retrieves the status of all documents.
+
+Refer to [`libs/README.md`](../README.md#2-admin-api-lib) for in-depth API documentation.
+
+## Configuration overview
+
+All settings are powered by `pydantic-settings`, so you can use environment variables or instantiate classes manually:
+
+- `S3_*` (`S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, `S3_BUCKET`) – configure storage for raw uploads.
+- `DOCUMENT_EXTRACTOR_HOST` – base URL of the extractor service.
+- `RAG_API_HOST` – base URL of the rag-core API.
+- `CHUNKER_CLASS_TYPE_CHUNKER_TYPE` – choose `recursive` (default) or `semantic` chunking.
+- `CHUNKER_*` (`CHUNKER_MAX_SIZE`, `CHUNKER_OVERLAP`, `CHUNKER_BREAKPOINT_THRESHOLD_TYPE`, …) – fine-tune chunking behaviour.
+- `SUMMARIZER_MAXIMUM_INPUT_SIZE`, `SUMMARIZER_MAXIMUM_CONCURRENCY`, `SUMMARIZER_MAX_RETRIES`, etc. – tune summariser limits and retry behaviour.
+- `SOURCE_UPLOADER_TIMEOUT` – adjust how long non-file source ingestions wait before timing out.
+- `USECASE_KEYVALUE_HOST` / `USECASE_KEYVALUE_PORT` – configure the KeyDB/Redis instance that persists document status.
+
+The Helm chart forwards these values through `adminBackend.envs.*`, keeping deployments declarative. Local development can rely on `.env` as described in the repository root README.
+
+## Typical usage
+
+```python
+from admin_api_lib.main import app as perfect_admin_app
+```
+
+The admin frontend (`services/frontend` → Admin app) and automation scripts talk to these endpoints to manage the corpus. Downstream, `rag-core-api` receives the processed information pieces and stores them in the vector database.
+
+## Extending the library
+
+1. Implement a new interface (e.g., `Chunker`, `Summarizer`, `FileService`).
+2. Register it in `dependency_container.py` or override via dependency-injector in your service.
+3. Update or add API endpoints if you expose new capabilities.
+4. Cover the new behaviour with pytest-based unit tests under `libs/admin-api-lib/tests`.
+
+Because components depend on interfaces defined here, downstream services can swap behavior without modifying the public API surface.
+
+## Contributing
+
+Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).
+
+## License
+
+Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.
@@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"
 
 [tool.poetry]
 name = "admin-api-lib"
-version = "1.0.1"
+version = "v3.2.1"
 description = "The admin backend is responsible for the document management. This includes deletion, upload and returning the source document."
-authors = ["STACKIT Data and AI Consulting <[email protected]>"]
+authors = [
+    "STACKIT GmbH & Co. KG <[email protected]>",
+]
+maintainers = [
+    "Andreas Klos <[email protected]>",
+]
 packages = [{ include = "admin_api_lib", from = "src" }]
+readme = "README.md"
+license = "Apache-2.0"
+repository = "https://github.com/stackitcloud/rag-template"
+homepage = "https://pypi.org/project/admin-api-lib"
 
 [tool.flake8]
 exclude= [".eggs", "./libs/*", "./src/admin_api_lib/models/*", "./src/admin_api_lib/rag_backend_client/*", "./src/admin_api_lib/extractor_api_client/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py"]
 
@@ -0,0 +1,94 @@
+# extractor-api-lib
+
+Content ingestion layer for the STACKIT RAG template. This library exposes a FastAPI extraction service that ingests raw documents (files or remote sources), extracts and converts (to internal representations) the information, and hands output to [`admin-api-lib`](../admin-api-lib/).
+
+## Responsibilities
+
+- Receive binary uploads and remote source descriptors from the admin backend.
+- Route each request through the appropriate extractor (file, sitemap, Confluence, etc.).
+- Convert extracted fragments into the shared `InformationPiece` schema expected by downstream services.
+
+## Feature highlights
+
+- **Broad format coverage** – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
+- **Consistent output schema** – Information pieces are returned in a unified structure with content type (`TEXT`, `TABLE`, `IMAGE`) and metadata.
+- **Swappable extractors** – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
+- **Production-grade plumbing** – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.
+
+## Installation
+
+```bash
+pip install extractor-api-lib
+```
+
+Python 3.13 is required. OCR and computer-vision features expect system packages such as `ffmpeg`, `poppler-utils`, and `tesseract` (see `services/document-extractor/README.md` for the full list).
+
+## Module tour
+
+- `dependency_container.py` – Central dependency-injector wiring. Override providers here to plug in custom extractors, endpoints etc.
+- `api_endpoints/` & `impl/api_endpoints/` – Thin FastAPI endpoint abstractions and implementations for file and source (like confluence & sitemaps) extractors.
+- `apis/` – Extractor API abstractions and implementations.
+- `extractors/` & `impl/extractors/` – Format-specific logic (PDF, DOCX, PPTX, XML, EPUB, Confluence, sitemap) packaged behind the `InformationExtractor`/`InformationFileExtractor` interfaces.
+- `mapper/` & `impl/mapper/` – Abstractions and implementations to map langchain documents, internal and external information piece representations to each other.
+- `file_services/` – Default S3-compatible storage adapter; replace it if you store files elsewhere.
+- `impl/settings/` – Configuration settings for dependency injection container components.
+- `table_converter/` & `impl/table_converter/` – Abstractions and implementations to convert `pandas.DataFrame` to markdown and vice versa.
+- `impl/types/` - Enums for content-, extractor- and file types.
+- `impl/utils/` – Helper functions for hashed datetime and sitemap crawling, header injection, and custom metadata parsing.
+
+## Endpoints provided
+
+- `POST /extract_from_file` – Downloads the file from S3, extracts its contents, and returns normalized `InformationPiece` records.
+- `POST /extract_from_source` – Pulls from remote sources (Confluence, sitemap) using credentials and further optional kwargs.
+
+Both endpoints stream their results back to `admin-api-lib`, which takes care of enrichment and persistence.
+
+## How the file extraction endpoint works
+
+1. Download the file from S3
+2. Chose suitable file extractor based on the filename ending
+3. Extract the content from the file
+4. Map the internal representation to the external schema
+5. Return the final output
+
+## How the source extraction endpoint works
+
+1. Chose suitable source extractor based on the source type
+2. Pull the source content using the provided credentials and parameters
+3. Extract the content from the source
+4. Map the internal representation to the external schema
+5. Return the final output
+
+## Configuration overview
+
+Two `pydantic-settings` models ship with this package:
+
+- **S3 storage** (`S3Settings`) – configure the built-in file service with `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, and `S3_BUCKET`.
+- **PDF extraction** (`PDFExtractorSettings`) – adjust footer trimming or diagram export via `PDF_EXTRACTOR_FOOTER_HEIGHT` and `PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME`.
+
+Other extractors accept their parameters at runtime through the request payload (`ExtractionParameters`). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls `/extract_from_source`. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.
+
+The Helm chart exposes the environment variables mentioned above under `documentExtractor.envs.*` so production deployments remain declarative.
+
+## Typical usage
+
+```python
+from extractor_api_lib.main import app as perfect_extractor_app
+```
+
+`admin-api-lib` calls `/extract_from_file` and `/extract_from_source` to populate the ingestion pipeline.
+
+## Extending the library
+
+1. Implement `InformationFileExtractor` or `InformationExtractor` for your new format/source.
+2. Register it in `dependency_container.py` (append to `file_extractors` list or `source_extractors` dict).
+3. Update mapper or metadata handling if additional fields are required.
+4. Add unit tests under `libs/extractor-api-lib/tests` using fixtures and fake storage providers.
+
+## Contributing
+
+Ensure new endpoints or adapters remain thin and defer to [`rag-core-lib`](../rag-core-lib/) for shared logic. Run `poetry run pytest` and the configured linters before opening a PR. For further instructions see the [Contributing Guide](https://github.com/stackitcloud/rag-template/blob/main/CONTRIBUTING.md).
+
+## License
+
+Licensed under the project license. See the root [`LICENSE`](https://github.com/stackitcloud/rag-template/blob/main/LICENSE) file.
@@ -4,10 +4,19 @@ build-backend = "poetry.core.masonry.api"
 
 [tool.poetry]
 name = "extractor_api_lib"
-version = "1.0.1"
+version = "v3.2.1"
 description = "Extracts the content of documents, websites, etc and maps it to a common format."
-authors = ["STACKIT Data and AI Consulting <[email protected]>"]
+authors = [
+    "STACKIT GmbH & Co. KG <[email protected]>",
+]
+maintainers = [
+    "Andreas Klos <[email protected]>",
+]
 packages = [{ include = "extractor_api_lib", from = "src" }]
+readme = "README.md"
+license = "Apache-2.0"
+repository = "https://github.com/stackitcloud/rag-template"
+homepage = "https://pypi.org/project/extractor-api-lib"
 
 [[tool.poetry.source]]
 name = "pytorch_cpu"
@@ -70,7 +79,7 @@ max-line-length = 120
 python = "^3.13"
 wheel = "^0.45.1"
 botocore = "^1.38.10"
-fasttext = {git = "https://github.com/cfculhane/fastText", rev = "4a4451337ae6b476b9c584b97776c8c3eb4b27c5"}
+fasttext = "^0.9.3"
 pytesseract = "^0.3.10"
 fastapi = "^0.118.0"
 uvicorn = "^0.37.0"
@@ -136,7 +145,7 @@ black = "^25.1.0"
 httpx = "^0.28.1"
 
 [tool.pytest.ini_options]
-log_cli = 1
+log_cli = true
 log_cli_level = "DEBUG"
 pythonpath = "src"
 testpaths = "src/tests"