diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..f79c27c --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,33 @@ +name: Deploy Documentation + +on: + push: + branches: + - main + paths: + - 'backend/docs/**' + - 'backend/mkdocs.yml' + - '.github/workflows/docs.yml' + +jobs: + build-deploy: + runs-on: ubuntu-latest + + steps: + - name: Checkout Code + uses: actions/checkout@v2 + + - name: Set up Python + uses: actions/setup-python@v2 + with: + python-version: '3.10' + + - name: Install Dependencies + run: | + cd backend + pip install mkdocs mkdocs-material mkdocstrings[python] pymdown-extensions + + - name: Build and Deploy + run: | + cd backend + mkdocs gh-deploy --force \ No newline at end of file diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 9a20846..6f9058c 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -48,4 +48,4 @@ jobs: - name: Run unit tests with pytest run: | - cd backend && pytest --color=yes tests \ No newline at end of file + cd backend && pytest --color=yes tests diff --git a/README.md b/README.md index a2e1fdc..3fc514e 100644 --- a/README.md +++ b/README.md @@ -10,10 +10,8 @@ Our goal is to provide a familiar, spreadsheet-like interface for business users For a limited demo, check out the [Knowledge Table Demo](https://knowledge-table-demo.whyhow.ai/). - https://github.com/user-attachments/assets/8e0e5cc6-6468-4bb5-888c-6b552e15b58a - To learn more about WhyHow and our projects, visit our [website](https://whyhow.ai/). ## Table of Contents @@ -102,11 +100,13 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be 4. **Install the dependencies:** For basic installation: + ```sh pip install . ``` For installation with development tools: + ```sh pip install .[dev] ``` @@ -180,6 +180,7 @@ To set up the project for development: black . isort . ``` + --- ## Features @@ -189,12 +190,12 @@ To set up the project for development: - **Chunk Linking** - Link raw source text chunks to the answers for traceability and provenance. - **Extract with natural language** - Use natural language queries to extract structured data from unstructured documents. - **Customizable extraction rules** - Define rules to guide the extraction process and ensure data quality. -- **Custom formatting** - Control the output format of your extracted data. +- **Custom formatting** - Control the output format of your extracted data. Knowledge table current supports text, list of text, number, list of numbers, and boolean formats. - **Filtering** - Filter documents based on metadata or extracted data. - **Exporting as CSV or Triples** - Download extracted data as CSV or graph triples. - **Chained extraction** - Reference previous columns in your extraction questions using @ i.e. "What are the treatments for `@disease`?". - **Split Cell Into Rows** - Turn outputs within a single cell from List of Numbers or List of Values and split it into individual rows to do more complex Chained Extraction - + --- ## Concepts @@ -211,6 +212,15 @@ Each **document** is an unstructured data source (e.g., a contract, article, or A **Question** is the core mechanism for guiding extraction. It defines what data you want to extract from a document. +### Rule + +A **Rule** guides the extraction from the LLM. You can add rules on a column level or on a global level. Currently, the following rule types are supported: + +- **May Return** rules give the LLM examples of answers that can be used to guide the extraction. This is a great way to give more guidance for the LLM on the type of things it should keep an eye out for. +- **Must Return** rules give the LLM an exhaustive list of answers that are allowed to be returned. This is a great way to give guardrails for the LLM to ensure only certain terms are returned. +- **Allowed # of Responses** rules are useful for provide guardrails in the event there are may be a range of potential ‘grey-area’ answers and we want to only restrict and guarantee only a certain number of the top responses are provided. +- **Resolve Entity** rules allow you to resolve values to a specific entity. This is useful for ensuring output conforms to a specific entity type. For example, you can write rules that ensure "blackrock", "Blackrock, Inc.", and "Blackrock Corporation" all resolve to the same entity - "Blackrock". + --- ## Practical Usage @@ -225,12 +235,25 @@ Once you've set up your questions, rules, and documents, the Knowledge Table pro - **Metadata Generation**: Classify and tag information about your documents and files by running targeted questions against the files (i.e. "What project is this email thread about?") --- + ## Export to Triples To create the Schema for the Triples, we use an LLM to consider the Entity Type of the Column, the question that was used to generate the cells, and the values themselves, to create the schema and the triples. The document name is inserted as a node property. The vector chunk ids are also included in the JSON file of the triples, and tied to the triples created. --- +## Rules + +We now have 3 types of [Rules](https://medium.com/enterprise-rag/rules-extraction-guardrails-knowledge-table-studio-e84999ade353) you can now incorporate within your processes, which are: + +- **Entity Resolution Rules**: Resolving discrepencies between Entities or imposing a common terminology on top of Entities + +- **Entity Extraction Rules**: Imposing Guardrails and Context for the Entities that should be detected and returned across Documents + +- **Entity Relationship Rules**: Imposing Guardrails on the types of Patterns that should be returned on the Relationships between the extracted Entities + +--- + ## Extending the Project Knowledge Table is built to be flexible and customizable, allowing you to extend it to fit your workflow: @@ -263,15 +286,15 @@ To use the Unstructured API integration: When the `UNSTRUCTURED_API_KEY` is set, Knowledge Table will automatically use the Unstructured API for document processing. If the key is not set or if there's an issue with the Unstructured API, the system will fall back to the default document loaders. -Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io. ---- +## Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io. ## Roadmap -- [ ] Expansion of Rules System - - [ ] Upload Extraction Rules via CSV - - [ ] Entity Resolution Rules - - [ ] Rules Dashboard +- [x] Expansion of Rules System + - [x] Upload Extraction Rules via CSV + - [x] Entity Resolution Rules + - [x] Rules Dashboard + - [x] Rules Log - [ ] Support for more LLMs - [ ] Azure OpenAI - [ ] Llama3 diff --git a/backend/CHANGELOG.md b/backend/CHANGELOG.md index 2eea371..d6c48bc 100644 --- a/backend/CHANGELOG.md +++ b/backend/CHANGELOG.md @@ -7,11 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## Unreleased +## Added + +- Backedn documentation + +## [v0.1.6] - 2024-11-04 + ### Added - Added support for queries without source data in vector database - Graceful failure of triple export when no chunks are found - Tested Qdrant vector database service +- Added resolve entity rule ### Changed diff --git a/backend/docs/CONTRIBUTING.md b/backend/docs/CONTRIBUTING.md new file mode 120000 index 0000000..f939e75 --- /dev/null +++ b/backend/docs/CONTRIBUTING.md @@ -0,0 +1 @@ +../../CONTRIBUTING.md \ No newline at end of file diff --git a/backend/docs/api/overview.md b/backend/docs/api/overview.md new file mode 100644 index 0000000..b71340e --- /dev/null +++ b/backend/docs/api/overview.md @@ -0,0 +1,59 @@ +# Knowledge Table API Overview + +Welcome to the Knowledge Table API! This summary provides a quick overview of key endpoints, usage guidelines, and how to access the interactive API documentation. + +--- + +**Base URL** + +All API requests should be made to the following base URL for version 1: + +``` +https://api.example.com/v1 +``` + +--- + +**Documentation** + +Explore and test all API endpoints through the interactive docs provided by FastAPI: + +- **Swagger UI**: [http://localhost:8000/docs](http://localhost:8000/docs) – A user-friendly interface for API exploration. +- **ReDoc**: [http://localhost:8000/redoc](http://localhost:8000/redoc) – A clean reference for detailed API information. + +--- + +Knowledge Table currently offers the following backend endpoints for document management, graph export, and query processing: + +**Document** + Upload and manage documents within the Knowledge Table system. + +- **POST** `/document` – Uploads and processes a document. +- **DELETE** `/document/{document_id}` – Deletes a document by its ID. + For details, refer to [Document Endpoints](v1/endpoints/document.md). + +**Graph** + Export structured data from processed documents in the form of triples. + +- **POST** `/graph/export-triples` – Exports triples (subject, predicate, object) based on table data. + More information is available at [Graph Endpoints](v1/endpoints/graph.md). + +**Query** + Run queries to interact with documents using natural language or structured queries. + +- **POST** `/query` – Submits a query and receives a structured response with relevant document data. + See [Query Endpoints](v1/endpoints/query.md) for further details. + +--- + +**Error Codes** + +Standard HTTP status codes are used to indicate request success or failure: + +| Status Code | Error | Description | +| ----------- | ----------------------- | -------------------------------- | +| `200` | `OK` | Successful request | +| `400` | `Bad Request` | Invalid request parameters | +| `401` | `Unauthorized` | Authentication failed or missing | +| `404` | `Not Found` | Resource not found | +| `500` | `Internal Server Error` | Server encountered an error | diff --git a/backend/docs/api/v1/endpoints/document.md b/backend/docs/api/v1/endpoints/document.md new file mode 100644 index 0000000..6ee9234 --- /dev/null +++ b/backend/docs/api/v1/endpoints/document.md @@ -0,0 +1,196 @@ +# Document API Overview + +This section covers endpoints for managing documents in the Knowledge Table backend. You can upload, delete, and interact with document data using these API calls. + +--- + +## **Upload Document** + +`POST /document` + +**Description** +Upload a document to the system for processing. This endpoint parses the document, extracting its content for further use. + +### Request + +**Method**: `POST` + +**URL**: `/document` + +**Headers**: + +- `Content-Type`: `multipart/form-data` + +**Parameters** + +| Name | In | Type | Description | +| ------ | -------- | ------ | --------------------------- | +| `file` | formData | `file` | The document file to upload | + +**Example** + +```python +import requests + +url = "http://localhost:8000/api/v1/document" +files = { + "file": open("/path/to/your/document.pdf", "rb") +} + +response = requests.post(url, files=files) +print(response.json()) +``` + +### Response + +**Status Code**: `200 OK` + +**Content-Type**: `application/json` + +**Body**: + +```json +{ + "document_id": "12345", + "filename": "document.pdf", + "status": "processed", + "metadata": { + "pages": 10, + "title": "Sample Document", + "author": "John Doe" + } +} +``` + +### Error Responses + +| Status Code | Error | Description | +| ----------- | ----------------------- | ----------------------------------------- | +| `400` | `Bad Request` | Malformed request or missing/invalid file | +| `500` | `Internal Server Error` | Error during document processing | + +--- + +## **Delete Document** + +`DELETE /document/{document_id}` + +**Description** +Delete a document from the system by specifying its unique document ID. + +### Request + +**Method**: `DELETE` + +**URL**: `/document/{document_id}` + +**Headers**: + +**Path Parameters** + +| Name | In | Type | Description | +| ------------- | ---- | -------- | ----------------------------- | +| `document_id` | path | `string` | The unique ID of the document | + +**Example** + +```python +import requests + +url = "http://localhost:8000/api/v1/document/12345" + +response = requests.delete(url) +print(response.json()) +``` + +### Response + +**Status Code**: `200 OK` + +**Body**: + +```json +{ + "message": "Document deleted successfully", + "document_id": "12345" +} +``` + +## Error Responses + +| Status Code | Error | Description | +| ----------- | ----------------------- | -------------------------------- | +| `200` | `OK` | Successful request | +| `400` | `Bad Request` | Invalid request parameters | +| `401` | `Unauthorized` | Authentication failed or missing | +| `404` | `Not Found` | Resource not found | +| `500` | `Internal Server Error` | Server encountered an error | + +--- + +## Schemas + +This file defines Pydantic schemas for API requests and responses related to documents. + +**DocumentCreate** + +Schema for creating a new document. + +- **`name`** (str): The name of the document. +- **`author`** (str): The author of the document. +- **`tag`** (str): A tag associated with the document. +- **`page_count`** (Annotated[int, Field(strict=True, gt=0)]): The number of pages in the document. This field is strictly validated to ensure it's a positive integer. + +**DocumentResponse** + +Schema for document response, inheriting from the Document model. + +Inherits all attributes from the `Document` model: + +- **`id`** (str): The unique identifier for the document. +- **`name`** (str): The name of the document. +- **`author`** (str): The author of the document. +- **`tag`** (str): A tag associated with the document. +- **`page_count`** (int): The number of pages in the document. + +**DeleteDocumentResponse** + +Schema for delete document response. + +- **`id`** (str): The ID of the deleted document. +- **`status`** (str): The status of the delete operation. +- **`message`** (str): A message describing the result of the delete operation. + +--- + +**Usage** + +```python +from app.schemas.document import DocumentCreate, DocumentResponse, DeleteDocumentResponse + +# Creating a new document +new_doc = DocumentCreate( + name="Sample Document", + author="John Doe", + tag="sample", + page_count=10 +) + +# Document response +doc_response = DocumentResponse( + id="123", + name="Sample Document", + author="John Doe", + tag="sample", + page_count=10 +) + +# Delete document response +delete_response = DeleteDocumentResponse( + id="123", + status="success", + message="Document successfully deleted" +) +``` + +These schemas are used to validate and structure data for API requests and responses related to queries in the application. diff --git a/backend/docs/api/v1/endpoints/graph.md b/backend/docs/api/v1/endpoints/graph.md new file mode 100644 index 0000000..a8191eb --- /dev/null +++ b/backend/docs/api/v1/endpoints/graph.md @@ -0,0 +1,285 @@ +# Graph API Overview + +This section covers the API endpoints related to exporting graph triples from tables in the Knowledge Table backend. These endpoints allow you to generate triples (subject, predicate, object) based on structured data in tables and export the results as a JSON file. + +**How it works** + +1. The backend receives the table structure (columns, rows, chunks) in the request body. +2. The service validates the table and ensures the data is correctly formatted. +3. Using a Language Model (LLM) service, the system generates a schema for the table. +4. Based on the schema, the service generates triples (subject, predicate, object) and extracts data chunks. +5. The response contains the generated triples and chunks in JSON format, downloadable as a file. + +--- + +## **Export Triples** + +`POST /graph/export-triples` + +**Description** +Export triples (subject, predicate, object) from a table based on the provided data. The results are returned as a JSON file containing both the triples and additional chunks of extracted data. + +### Request + +**Method**: `POST` + +**URL**: `/graph/export-triples` + +**Headers**: + +- `Content-Type`: `application/json` + +**Body**: JSON object representing a table with columns, rows, and chunks. + +**Parameters** + +| Name | In | Type | Description | +| --------- | ---- | ------ | -------------------------------------------------- | +| `columns` | body | `list` | A list of columns | +| `rows` | body | `list` | A list of rows | +| `chunks` | body | `dict` | A dictionary of chunk objects associated with rows | + +**Example** + +```python +import requests + +url = "http://localhost:8000/api/v1/graph/export-triples" +headers = { + "Content-Type": "application/json" +} +data = { + "columns": [ + { + "id": "column1", + "hidden": false, + "entityType": "Person", + "type": "text", + "generate": true, + "query": "Who is mentioned in the document?", + "rules": [ + { + "type": "may_return", + "options": ["Jill", "Jane"] + } + ] + }, + { + "id": "column2", + "hidden": false, + "entityType": "Location", + "type": "text", + "generate": true, + "query": "Where is @[Person](column1) from?", + "rules": [ + { + "type": "must_return", + "options": ["New York", "Los Angeles"] + } + ] + } + ], + "rows": [ + { + "id": "row1", + "sourceData": { + "documentId": "doc123", + "metadata": { + "author": "John Doe", + "title": "Sample Document" + } + }, + "hidden": false, + "cells": { + "column1": "Jill", + "column2": "New York" + } + }, + { + "id": "row2", + "sourceData": "raw_text", + "hidden": "true", + "cells": { + "column1": "Jane", + "column2": "Los Angeles" + } + } + ], + "chunks": { + "row1-column1": [ + { + "content": "This is some content from page 1.", + "page": 1 + }, + { + "content": "Additional content from page 2.", + "page": 2 + } + ] + } +} + +response = requests.post(url, headers=headers, json=data) +print(response.json()) +``` + +### Response + +**Status Code**: `200 OK` + +**Content-Type**: `application/json` + +**Body**: + +```json +{ + "triples": [ + { + "triple_id": "1", + "head": { + "label": "Person", + "name": "Jill" + }, + "tail": { + "label": "Location", + "name": "New York" + }, + "relation": { + "name": "is_from" + }, + "chunk_ids": ["c_1234", "c_5678"] + } + ], + "chunks": [ + { + "chunk_id": "c_1234", + "content": "This is some content from page 1.", + "page": "1", + "triple_id": "1" + }, + { + "chunk_id": "c_5678", + "content": "Additional content from page 2.", + "page": "2", + "triple_id": "1" + } + ] +} +``` + +## Error Responses + +| Status Code | Error | Description | +| ----------- | ----------------------- | ------------------------------------------------------ | +| `400` | `Bad Request` | Invalid JSON in the request body | +| `422` | `Unprocessable Entity` | Validation error in the request data | +| `500` | `Internal Server Error` | An unexpected error occurred during the export process | + +--- + +## Schemas + +This file defines Pydantic schemas for the graph API, including structures for prompts, table components, and export requests/responses. + +**Prompt** + +Represents a prompt used to extract specific information. + +- **`entityType`** (str): The type of entity the prompt is associated with. +- **`id`** (str): Unique identifier for the prompt. +- **`query`** (str): The query text of the prompt. +- **`rules`** (List[Any]): List of rules associated with the prompt. +- **`type`** (str): The type of the prompt. + +**Cell** + +Represents a cell in a table. + +- **`answer`** (Dict[str, Any]): The answer content of the cell. +- **`columnId`** (str): The ID of the column this cell belongs to. +- **`dirty`** (Union[bool, str]): Indicates if the cell has been modified. +- **`rowId`** (str): The ID of the row this cell belongs to. + +**Column** + +Represents a column in a table. + +- **`id`** (str): Unique identifier for the column. +- **`prompt`** (Prompt): The prompt associated with this column. +- **`width`** (Union[int, str]): The width of the column. +- **`hidden`** (Union[bool, str]): Indicates if the column is hidden. + +**Document** + +Represents a document. + +- **`id`** (str): Unique identifier for the document. +- **`name`** (str): The name of the document. +- **`author`** (str): The author of the document. +- **`tag`** (str): A tag associated with the document. +- **`page_count`** (Union[int, str]): The number of pages in the document. + +**Row** + +Represents a row in a table. + +- **`id`** (str): Unique identifier for the row. +- **`document`** (Document): The document associated with this row. +- **`hidden`** (Union[bool, str]): Indicates if the row is hidden. + +**Table** + +Represents a table. + +- **`columns`** (List[Column]): List of columns in the table. +- **`rows`** (List[Row]): List of rows in the table. +- **`cells`** (List[Cell]): List of cells in the table. + +**ExportTriplesRequest** + +Schema for export triples request. + +- **`columns`** (List[Column]): List of columns in the table. +- **`rows`** (List[Row]): List of rows in the table. +- **`cells`** (List[Cell]): List of cells in the table. + +**ExportTriplesResponse** + +Schema for export triples response. + +- **`triples`** (List[Dict[str, Any]]): List of triples exported. +- **`chunks`** (List[Dict[str, Any]]): List of chunks associated with the triples. + +--- + +**Usage** + +```python +from app.schemas.graph import Prompt, Cell, Column, Document, Row, Table, ExportTriplesRequest, ExportTriplesResponse + +# Creating a prompt +prompt = Prompt(entityType="Person", id="1", query="What is the person's name?", rules=[], type="text") + +# Creating a cell +cell = Cell(answer={"text": "John Doe"}, columnId="1", dirty=False, rowId="1") + +# Creating a column +column = Column(id="1", prompt=prompt, width=100, hidden=False) + +# Creating a document +document = Document(id="1", name="Sample Doc", author="Jane Doe", tag="sample", page_count=10) + +# Creating a row +row = Row(id="1", document=document, hidden=False) + +# Creating a table +table = Table(columns=[column], rows=[row], cells=[cell]) + +# Creating an export request +export_request = ExportTriplesRequest(columns=[column], rows=[row], cells=[cell]) + +# Creating an export response +export_response = ExportTriplesResponse(triples=[{"subject": "John", "predicate": "is", "object": "Person"}], chunks=[{"id": "1", "text": "John is a person"}]) +``` + +These schemas are used to validate and structure data for API requests and responses related to queries in the application. diff --git a/backend/docs/api/v1/endpoints/query.md b/backend/docs/api/v1/endpoints/query.md new file mode 100644 index 0000000..f06e5aa --- /dev/null +++ b/backend/docs/api/v1/endpoints/query.md @@ -0,0 +1,203 @@ +# Query API Overview + +This section covers the API endpoints related to querying documents in the Knowledge Table backend. The query system allows you to ask questions about documents using different retrieval methods, including vector search, hybrid search, and decomposed search. These queries utilize a combination of keyword searches and vector-based methods to generate answers from the document data. There are currently three supported methods for retrieving data. + +- **Hybrid Search (Default)** - Combines both keyword and vector searches to retrieve relevant chunks from both methods, creating a more comprehensive response. Useful when the document contains both structured and unstructured data. + +- **Vector Search** - Performs a simple vector search on the document and retrieves the most relevant chunks of text based on the query. Ideal for finding similar passages in large documents. + +- **Decomposed Search** - Breaks the main query into smaller sub-queries, runs vector searches for each sub-query, and then compiles the results into a cohesive answer. Ideal for complex queries that require a step-by-step breakdown. + +--- + +## **Run Query** + +`POST /query` + +**Description** +Run a query against a document using one of three methods: Simple Vector Search, Hybrid Search, or Decomposed Search. + +### Request + +**Method**: `POST` + +**URL**: `/query` + +**Headers**: + +- `Content-Type`: `application/json` + +**Body**: JSON object containing the query details, including the retrieval type, document ID, and the prompt/query itself. + +**Parameters** + +| Name | In | Type | Description | +| ------------- | ---- | -------- | ------------------------------------------------- | +| `document_id` | body | `string` | The ID of the document to query | +| `prompt` | body | `object` | A column prompt in the `QueryPromptSchema` format | + +**QueryPromptSchema Structure** + +| Name | Type | Description | +| ------------- | --------- | ------------------------------------------------------------------------- | +| `id` | `string` | ID of the column | +| `entity_type` | `string` | The name of the entity in the Knowledge Table | +| `query` | `string` | The actual query or question | +| `type` | `Literal` | One of `"int"`, `"str"`, `"bool"`, `"int_array"`, `"str_array"` | +| `rules` | `list` | _(Optional) `"must_return"`, `"may_return"`, `"max_length"`, `"replace"`_ | + +**Example** + +```python +import requests + +url = "http://localhost:8000/api/v1/query" +headers = { + "Authorization": "Bearer YOUR_API_TOKEN", + "Content-Type": "application/json" +} +data = { + "document_id": "abc123", + "prompt": { + "id": "prompt1", + "entity_type": "Disease", + "query": "Which diseases are mentioned in this document?", + "type": "str_array", + "rules": [ + { + "type": "must_return", + "options": ["asthma", "diabetes","cancer"] + } + ] + } +} + +response = requests.post(url, headers=headers, json=data) +print(response.json()) +``` + +### Response + +**Status Code**: `200 OK` + +**Content-Type**: `application/json` + +**Body**: + +```json +{ + "answer": { + "id": "e7f4a6b8c5df4c099f39bdf0e2a1db8e", + "document_id": "abc123", + "prompt_id": "prompt1", + "answer": ["diabetes", "cancer"], + "type": "str_array" + }, + "chunks": [ + { + "content": "This is some content from page 1.", + "page": 1 + }, + { + "content": "Additional content from page 2.", + "page": 2 + } + ], + "resolved_entities": null +} +``` + +## Error Responses + +| Status Code | Error | Description | +| ----------- | ----------------------- | -------------------------------------------------------------------- | +| `400` | `Bad Request` | The request contains an invalid query type or is otherwise malformed | +| `500` | `Internal Server Error` | An error occurred while processing the query | + +--- + +## Schemas + +This file defines Pydantic schemas for API requests and responses related to queries. + +**QueryPrompt** + +Represents a query prompt. + +- **`id`** (str): Unique identifier for the prompt. +- **`query`** (str): The query text. +- **`type`** (Literal["int", "str", "bool", "int_array", "str_array"]): The expected type of the answer. +- **`entity_type`** (str): The type of entity the query is about. +- **`rules`** (Optional[List[Rule]]): Optional list of rules to apply to the query. + +**QueryRequest** + +Represents a query request. + +- **`document_id`** (str): The ID of the document to query. +- **`previous_answer`** (Optional[Union[int, str, bool, List[int], List[str]]]): The previous answer, if any. +- **`prompt`** (QueryPrompt): The query prompt. +- **`rag_type`** (Optional[Literal["vector", "hybrid", "decomposed"]]): The type of retrieval-augmented generation to use. Defaults to "hybrid". + +**VectorResponse** + +Represents a vector response. + +- **`message`** (str): A message associated with the response. +- **`chunks`** (List[Chunk]): List of relevant chunks from the document. +- **`keywords`** (Optional[List[str]]): Optional list of keywords extracted from the query. + +**QueryResponse** + +Represents a query response. + +- **`id`** (str): Unique identifier for the response. +- **`document_id`** (str): The ID of the document queried. +- **`prompt_id`** (str): The ID of the prompt used. +- **`answer`** (Optional[Union[int, str, bool, List[int], List[str]]]): The answer to the query. +- **`chunks`** (List[Chunk]): List of relevant chunks from the document. +- **`type`** (str): The type of the answer. + +--- + +**Usage** + +```python +from app.schemas.query import QueryPrompt, QueryRequest, VectorResponse, QueryResponse +from app.models.query import Rule, Chunk + +# Creating a query prompt +prompt = QueryPrompt( + id="1", + query="What is the capital of France?", + type="str", + entity_type="Location", + rules=[Rule(type="must_return", options=["Paris", "London", "Berlin"])] +) + +# Creating a query request +request = QueryRequest( + document_id="doc123", + prompt=prompt, + rag_type="hybrid" +) + +# Creating a vector response +vector_response = VectorResponse( + message="Retrieved relevant chunks", + chunks=[Chunk(content="Paris is the capital of France.", page=1)], + keywords=["capital", "France"] +) + +# Creating a query response +query_response = QueryResponse( + id="resp1", + document_id="doc123", + prompt_id="1", + answer="Paris", + chunks=[Chunk(content="Paris is the capital of France.", page=1)], + type="str" +) +``` + +These schemas are used to validate and structure data for API requests and responses related to queries in the application. diff --git a/backend/docs/architecture.md b/backend/docs/architecture.md new file mode 100644 index 0000000..c2efc8c --- /dev/null +++ b/backend/docs/architecture.md @@ -0,0 +1,122 @@ +# Architecture Overview + +This section provides an overview of the Knowledge Table backend architecture, covering key components and their interactions. Knowledge Table follows a modular, service-oriented architecture. + +```mermaid +graph TD + Client[Client Application] -->|HTTP Requests| API[API Layer] + API -->|Calls| Services[Service Layer] + Services -->|Uses| Models[Model Layer] + Services -->|Integrates| Ext[External Services] + Ext -->|LLM| OpenAI[OpenAI] + Ext -->|Vector DB| Milvus[Milvus] + Ext -->|Loaders| DocLoaders[Document Loaders] + DocLoaders -->|PDF| PyPDF[PyPDF] + DocLoaders -->|Unstructured| Unstructured[Unstructured] +``` + +## Components + +**API Layer** + +_Handles HTTP requests from clients using FastAPI_ + +- **`/documents/`**: Document upload and retrieval. +- **`/graphs/`**: Knowledge graph management. +- **`/queries/`**: Natural language query processing. + +**Service Layer** + +_Contains core business logic_ + +- **Document Service**: Manages document processing and storage. +- **Graph Service**: Handles knowledge graph creation and querying. +- **LLM Service**: Interfaces with language models for text analysis. +- **Query Service**: Processes queries and returns structured responses. + +**Model Layer** + +_Defines database models for documents, graphs, and queries_ + +**External Integrations** + +_Connects to Language Models (LLMs), vector databases, and document loaders_ + +- **LLM**: Supports OpenAI and is extensible to other providers. +- **Vector Database**: Manages embeddings for similarity search using Milvus. +- **Document Loaders**: Processes PDFs and unstructured documents. + +## Project Structure + +```plaintext +backend/ +├── src/ +│ └── app/ +│ ├── api/ +│ │ └── v1/ +│ │ └── endpoints/ +│ │ ├── document.py +│ │ ├── graph.py +│ │ └── query.py +│ ├── core/ +│ │ ├── config.py +│ │ └── dependencies.py +│ ├── models/ +│ │ ├── document.py +│ │ ├── graph.py +│ │ ├── llm.py +│ │ └── query.py +│ ├── schemas/ +│ │ ├── document.py +│ │ ├── graph.py +│ │ └── query.py +│ └── services/ +│ ├── document_service.py +│ ├── graph_service.py +│ ├── llm_service.py +│ ├── query_service.py +│ ├── llm/ +│ │ ├── base.py +│ │ ├── factory.py +│ │ |── openai_service.py +│ │ └── prompts.py +│ ├── loaders/ +│ │ ├── base.py +│ │ ├── factory.py +│ │ ├── pypdf_service.py +│ │ └── unstructured_service.py +│ └── vector_db/ +│ ├── base.py +│ ├── factory.py +│ └── milvus_service.py +├── tests/ +└── docs/ +``` + +## Data Flow + +### Document Upload + +```mermaid +sequenceDiagram + Client->>API: Upload Document + API->>DocumentService: Process Document + DocumentService->>Loader: Parse Document + Loader->>LLMService: Generate Embeddings + LLMService->>VectorDB: Store Embeddings + DocumentService->>Database: Store Metadata + API->>Client: Confirmation +``` + +### Query Processing + +```mermaid +sequenceDiagram + Client->>API: Submit Query + API->>QueryService: Process Query + QueryService->>LLMService: Generate Query Embedding + QueryService->>VectorDB: Search + VectorDB->>QueryService: Return Documents + QueryService->>LLMService: Generate Response + API->>Client: Structured Response +``` diff --git a/backend/docs/extending/document_loaders.md b/backend/docs/extending/document_loaders.md new file mode 100644 index 0000000..944f953 --- /dev/null +++ b/backend/docs/extending/document_loaders.md @@ -0,0 +1,112 @@ +# Extending Document Loaders + +This guide covers adding new document loaders to the Knowledge Table backend, detailing the setup and configuration. + +--- + +## Steps + +### **1. Create a Loader Service Class** + +In `src/app/services/loaders/`, create a new file, e.g., `your_loader_service.py`. + +```python +# your_loader_service.py +from typing import List, Optional +from langchain.schema import Document +from app.services.loaders.base import LoaderService + +class YourLoader(LoaderService): + def __init__(self, api_key: Optional[str] = None): + self.api_key = api_key + + async def load(self, file_path: str) -> List[Document]: + # Implement loading logic here + pass + + async def load_and_split(self, file_path: str) -> List[Document]: + # Implement loading and splitting logic here + pass +``` + +### **2. Update the Loader Factory** + +In `src/app/services/loaders/factory.py`, add an import and update the factory method. + +```python +# factory.py +from app.services.loaders.your_loader_service import YourLoader + +class LoaderFactory: + @staticmethod + def create_loader() -> Optional[LoaderService]: + loader_type = settings.loader + if loader_type == "your_loader": + return YourLoader(api_key=settings.your_loader_api_key) + # existing loader conditions... +``` + +### **3. Configure the Service** + +In `src/app/core/config.py`, add configurations for the new loader: + +```python +# config.py +from pydantic import BaseSettings + +class Settings(BaseSettings): + loader: str = "unstructured" + your_loader_api_key: Optional[str] = None + +settings = Settings() +``` + +Update your environment variables or `.env` file: + +``` +LOADER=your_loader +YOUR_LOADER_API_KEY=your_api_key_here +``` + +### **4. Implement Loader Logic** + +Define `load` and `load_and_split` in `YourLoader` for your loader requirements. + +```python +async def load(self, file_path: str) -> List[Document]: + raw_document = YourLoaderLibrary(api_key=self.api_key).load_file(file_path) + return [Document(page_content=raw_document, metadata={"source": file_path})] + +async def load_and_split(self, file_path: str) -> List[Document]: + raw_document = YourLoaderLibrary(api_key=self.api_key).load_file(file_path) + splits = YourLoaderLibrary().split_document(raw_document) + return [Document(page_content=split, metadata={"source": file_path, "split": i}) for i, split in enumerate(splits)] +``` + +--- + +## Considerations + +- **Error Handling**: Ensure robust error handling. +- **Testing**: Write unit tests for your loader. +- **Performance**: Optimize for large documents. +- **Metadata**: Capture relevant document metadata. +- **Compatibility**: Return `langchain.schema.Document` objects for system compatibility. + +## Example + +```python +from typing import List +from langchain.schema import Document +from langchain.document_loaders import PyPDFLoader as LangchainPyPDFLoader +from app.services.loaders.base import LoaderService + +class PDFLoader(LoaderService): + async def load(self, file_path: str) -> List[Document]: + loader = LangchainPyPDFLoader(file_path) + return loader.load() + + async def load_and_split(self, file_path: str) -> List[Document]: + loader = LangchainPyPDFLoader(file_path) + return loader.load_and_split() +``` diff --git a/backend/docs/extending/llm_services.md b/backend/docs/extending/llm_services.md new file mode 100644 index 0000000..8632731 --- /dev/null +++ b/backend/docs/extending/llm_services.md @@ -0,0 +1,92 @@ +# Extending LLM Services + +This guide explains how to add support for new Language Model (LLM) services to the Knowledge Table backend. + +--- + +## Steps + +### **1. Create a New LLM Service Class** + +In `src/app/services/llm/`, create a new file, e.g., `your_llm_service.py`. + +```python +# your_llm_service.py +from .base import BaseLLMService + +class YourLLMService(BaseLLMService): + def __init__(self): + super().__init__() + # Initialize your LLM client here + + async def generate_response(self, prompt: str) -> str: + # Implement LLM interaction logic here + response = ... # Replace with actual implementation + return response +``` + +### **2. Update the LLM Factory** + +In `src/app/services/llm/factory.py`, import your service and update the factory method. + +```python +# factory.py +from .your_llm_service import YourLLMService + +class LLMFactory: + @staticmethod + def get_llm_service(service_name: str): + if service_name == "openai": + return OpenAIService() + elif service_name == "your_llm": + return YourLLMService() + else: + raise ValueError(f"Unknown LLM service: {service_name}") +``` + +### **3. Configure the Service** + +In `src/app/core/config.py`, add configuration options for the new LLM. + +```python +# config.py +from pydantic import BaseSettings + +class Settings(BaseSettings): + LLM_SERVICE: str = "openai" + +settings = Settings() +``` + +Update your environment or `.env` file: + +``` +LLM_SERVICE=your_llm +``` + +--- + +## Considerations + +- **Authentication**: Handle any API keys or authentication required by the service. +- **Error Handling**: Ensure robust error handling in your service. +- **Testing**: Write unit tests for your new service. + +## Example + +Here's an example of how you might implement `generate_response` in `YourLLMService`: + +```python +async def generate_response(self, prompt: str) -> str: + # Call your LLM API client with the prompt + try: + response = await self.your_llm_client.generate( + prompt, + api_key=self.api_key, + max_tokens=50 + ) + return response["text"] + except Exception as e: + logger.error(f"Error generating response: {e}") + return "Error in LLM generation" +``` diff --git a/backend/docs/extending/overview.md b/backend/docs/extending/overview.md new file mode 100644 index 0000000..fe90571 --- /dev/null +++ b/backend/docs/extending/overview.md @@ -0,0 +1,57 @@ +# Extending the Knowledge Table Backend + +The Knowledge Table backend is designed for extensibility, allowing developers to add new capabilities and integrate services with ease. + +--- + +## Getting Started + +_For full instructions, please see the [Contributing Guide](../CONTRIBUTING.md)._ + +1. **Create a Branch**: Use Git to create a new branch for your extension. +2. **Develop Your Extension**: Follow guidelines, and add base classes where needed. +3. **Test Your Component**: Implement tests to verify functionality and integration. +4. **Update Configurations and Docs**: Add any new options to config files and provide detailed documentation. (`README.md`, `docs/`, `.env.sample`, etc.) +5. **Submit a Pull Request**: Submit your PR with tests, documentation, and configurations for review. + +--- + +## Areas for Extension + +Here are some key areas where we are looking to extend the Knowledge Table backend: + +**Document Loaders** + +- **New Formats**: Support additional file types (e.g., PDFs, DOCX, HTML). +- **Custom Parsing**: Use external libraries or custom logic for parsing. +- **Document Preprocessing**: Implement chunking, metadata extraction, or content filtering. + +**LLM Services** + +- **LLM Integrations**: Add support for different LLM providers (e.g., OpenAI, Anthropic). +- **Prompt Strategies**: Modify prompts for specific tasks and use cases. +- **Efficiency**: Optimize usage for cost-effectiveness and performance. + +**Vector Databases** + +- **Database Options**: Support for additional vector databases (e.g., Pinecone, Weaviate). +- **Indexing and Search**: Implement custom indexing and search strategies. +- **Performance Tuning**: Optimize for handling large datasets. + +--- + +## Principles + +1. **Modularity**: Design new components as self-contained units that integrate smoothly without impacting other areas. +2. **Consistency**: Follow established design patterns and conventions to maintain cohesion with existing architecture. +3. **Configuration**: Use environment variables or configuration files for settings, avoiding hardcoding. +4. **Error Handling**: Provide meaningful error messages and logging. +5. **Testing**: Write unit and integration tests covering edge cases and common inputs. +6. **Documentation**: Include usage instructions, configuration steps, and any dependencies. +7. **Performance**: Design for scalability, especially for data-intensive processes. + +--- + +## Contributions and Support + +We welcome community contributions! Check the [Contributing Guide](../CONTRIBUTING.md) for more on how to contribute. For help, join our [Discord community](https://discord.gg/PAgGMxfhKd) or contact us at team@whyhow.ai. diff --git a/backend/docs/extending/vector_databases.md b/backend/docs/extending/vector_databases.md new file mode 100644 index 0000000..e26e264 --- /dev/null +++ b/backend/docs/extending/vector_databases.md @@ -0,0 +1,140 @@ +# Extending Vector DB Services + +This guide explains how to add support for new Vector Database services. + +--- + +## Steps + +### **1. Create a Vector DB Service Class** + +In `src/app/services/vector_db/`, create a new file, e.g., `your_vector_db_service.py`. + +```python +# your_vector_db_service.py +from typing import Any, Dict, List +from langchain.schema import Document as LangchainDocument +from app.models.query import Rule +from app.schemas.query import VectorResponse +from app.services.vector_db.base import VectorDBService +from app.services.llm_service import LLMService + +class YourVectorDBService(VectorDBService): + def __init__(self, llm_service: LLMService): + self.llm_service = llm_service + + async def upsert_vectors(self, vectors: List[Dict[str, Any]]) -> Dict[str, str]: + # Implement vector upsert logic + pass + + async def vector_search(self, queries: List[str], document_id: str) -> VectorResponse: + # Implement vector search logic + pass + + async def keyword_search(self, query: str, document_id: str, keywords: List[str]) -> VectorResponse: + # Implement keyword search logic + pass + + async def hybrid_search(self, query: str, document_id: str, rules: List[Rule]) -> VectorResponse: + # Implement hybrid search logic + pass + + async def decomposed_search(self, query: str, document_id: str, rules: List[Rule]) -> Dict[str, Any]: + # Implement decomposed search logic + pass + + async def delete_document(self, document_id: str) -> Dict[str, str]: + # Implement document deletion logic + pass + + async def ensure_collection_exists(self) -> None: + # Implement collection creation logic + pass + + async def prepare_chunks(self, document_id: str, chunks: List[LangchainDocument]) -> List[Dict[str, Any]]: + # Implement chunk preparation logic + pass +``` + +### **2. Update the Vector DB Factory** + +In `src/app/services/vector_db/factory.py`, add an import and update the factory method. + +```python +# factory.py +from app.services.vector_db.your_vector_db_service import YourVectorDBService + +class VectorDBFactory: + @staticmethod + def create_vector_db_service(provider: str, llm_service: LLMService) -> Optional[VectorDBService]: + logger.info(f"Creating vector database service with provider: {provider}") + if provider.lower() == "milvus-lite": + return MilvusService(llm_service) + elif provider.lower() == "your_vector_db": + return YourVectorDBService(llm_service) + # Add other vector database providers here + logger.warning(f"Unsupported vector database provider: {provider}") + return None +``` + +### **3. Configure the Service** + +In `src/app/core/config.py`, add a configuration option for your Vector DB. + +```python +# config.py +from pydantic import BaseSettings + +class Settings(BaseSettings): + VECTOR_DB_PROVIDER: str = "milvus-lite" # Default to Milvus-Lite + +settings = Settings() +``` + +Update your environment variables or `.env` file: + +``` +VECTOR_DB_PROVIDER=your_vector_db +``` + +--- + +## Considerations + +- **Authentication**: Ensure you handle any API keys or authentication required by your Vector DB service. +- **Error Handling**: Implement proper error handling in your service. +- **Testing**: Write unit tests for your new service. +- **Performance**: Optimize for large-scale vector operations. +- **Compatibility**: Ensure compatibility with the existing LLM service and document processing pipeline. + +## Example + +Here's an example of how you might implement the `vector_search` method for a hypothetical Vector DB: + +```python +async def vector_search(self, queries: List[str], document_id: str) -> VectorResponse: + results = [] + for query in queries: + # Get embeddings for the query + query_embedding = await self.llm_service.get_embeddings(query) + + # Perform the search in your Vector DB + search_results = self.vector_db_client.search( + collection_name="your_collection", + query_vector=query_embedding, + filter=f"document_id == '{document_id}'", + limit=10 + ) + + # Process and format the results + for result in search_results: + results.append(Chunk( + content=result.text, + page=result.metadata.get('page_number', 0) + )) + + return VectorResponse( + message="Query processed successfully.", + chunks=results + ) +``` diff --git a/backend/docs/getting-started/docker.md b/backend/docs/getting-started/docker.md new file mode 100644 index 0000000..c80aa1e --- /dev/null +++ b/backend/docs/getting-started/docker.md @@ -0,0 +1,65 @@ +# Docker Deployment + +This guide explains how to deploy Knowledge Table using Docker. + +> ## Prerequisites +> +> - Docker & Docker Compose installed + +## Steps + +> **Step 1:** Clone the Repository + +```sh +git clone https://github.com/yourusername/knowledge-table.git +cd knowledge-table +``` + +> **Step 2:** Set Up Environment + +```sh +cp .env.sample .env +``` + +Open `.env`, add your OpenAI API key: + +``` +OPENAI_API_KEY=your_api_key_here +``` + +> **Step 3:** Build and Start Containers + +```sh +docker-compose up -d --build +``` + +- Frontend: `http://localhost:3000` +- Backend: `http://localhost:8000` + +> **Step 4:** Stop the Application + +```sh +docker-compose down +``` + +## Troubleshooting + +- Check container logs with: + ```sh + docker-compose logs + ``` +- Verify all required variables in `.env`. + +## Updating + +> **Step 1:** Pull the latest changes + +```sh +git pull origin main +``` + +> **Step 2:** Rebuild and restart: + +```sh +docker-compose up -d --build +``` diff --git a/backend/docs/getting-started/installation.md b/backend/docs/getting-started/installation.md new file mode 100644 index 0000000..8b1df4d --- /dev/null +++ b/backend/docs/getting-started/installation.md @@ -0,0 +1,120 @@ +# Installation Guide + +This guide will walk you through setting up and running the Knowledge Table backend and frontend. + +> **Prerequisites** +> +> - Python 3.10+ +> - Git +> - [Bun](https://bun.sh/) (for frontend) + +--- + +## Setup and Run + +### Backend + +> **Step 1:** Clone the Knowledge Table repository + +```bash +git clone https://github.com/whyhow-ai/knowledge-table.git +cd knowledge-table/backend/ +``` + +> **Step 2:** Create and activate a virtual environment + +```bash +python3 -m venv venv +source venv/bin/activate # On Windows, use `venv\Scripts\activate` +``` + +> **Step 3:** Install the necessary packages + +```bash +pip install . +``` + +_To install additional development dependencies:_ + +```bash +pip install .[dev] +``` + +> **Step 4:** Launch the backend server + +```bash +uvicorn app.main:app --reload +``` + +**API** available at: [http://localhost:8000](http://localhost:8000) + +--- + +### Frontend + +> **Step 1:** Navigate to the frontend directory and install dependencies + +```bash +cd ../frontend +bun install +``` + +> **Step 2:** Start the frontend + +```bash +bun start +``` + +**Frontend** available at: [http://localhost:5173](http://localhost:5173) + +--- + +## Configuration + +The backend uses environment variables for configuration. + +> **Create a `.env` file** + +```bash +cp .env.sample .env +``` + +> **Add API keys** + +Open `.env` and set up your OpenAI API key: + +```dotenv +OPENAI_API_KEY=sk-yourkeyhere +``` + +_Optional: Add the Unstructured API key:_ + +```dotenv +UNSTRUCTURED_API_KEY=your-unstructured-api-key +``` + +| Variable | Description | +| ---------------------- | -------------------------------------- | +| `OPENAI_API_KEY` | Your OpenAI API key | +| `UNSTRUCTURED_API_KEY` | API key for Unstructured.io (optional) | + +--- + +## Explore the API + +Once the application is running, you can access the interactive API documentation: + +- **Swagger UI**: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) +- **ReDoc**: [http://127.0.0.1:8000/redoc](http://127.0.0.1:8000/redoc) + +Use these interfaces to try out API requests directly in your browser. + +--- + +## Run Tests + +To ensure that everything is working correctly, you can run the unit tests: + +```bash +pytest +``` diff --git a/backend/docs/images/cover-image.png b/backend/docs/images/cover-image.png new file mode 100644 index 0000000..fbae50c Binary files /dev/null and b/backend/docs/images/cover-image.png differ diff --git a/backend/docs/images/wh-logo.png b/backend/docs/images/wh-logo.png new file mode 100644 index 0000000..49815e5 Binary files /dev/null and b/backend/docs/images/wh-logo.png differ diff --git a/backend/docs/index.md b/backend/docs/index.md new file mode 100644 index 0000000..00a7759 --- /dev/null +++ b/backend/docs/index.md @@ -0,0 +1,70 @@ +# Knowledge Table + +Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents. This site provides all the information you need to understand, use, and extend Knowledge Table. + +**Follow the [Installation Guide](getting-started/installation.md) to get started.** + +![cover](images/cover-image.png) + +## Features + +- **Extract with natural language** - Use natural language queries to extract structured data from unstructured documents. +- **Chunk Linking** - Link raw source text chunks to the answers for traceability and provenance. +- **Customizable extraction rules** - Define rules to guide the extraction process and ensure data quality. +- **Custom formatting** - Control the output format of your extracted data. Knowledge table current supports text, list of text, number, list of numbers, and boolean formats. +- **Filtering** - Filter documents based on metadata or extracted data. +- **Exporting as CSV or Triples** - Download extracted data as CSV or graph triples. +- **Chained extraction** - Reference previous columns in your extraction questions using '@' i.e. "What are the treatments for @disease?". +- **Split Cell Into Rows** - Turn outputs within a single cell from List of Numbers or List of Values and split it into individual rows to do more complex Chained Extraction + +## Concepts + +### Tables + +Like a spreadsheet, a **table** is a collection of rows and columns that store structured data. Each row represents a **document**, and each column represents an **entity** that is extracted and formatted with a **question**. + +### Documents + +Each **document** is an unstructured data source (e.g., a contract, article, or report) uploaded to the Knowledge Table. When you upload a document, it is split into chunks, the chunks are embedded and tagged with metadata, and stored in a vector database. + +### Question + +A **Question** is the core mechanism for guiding extraction. It defines what data you want to extract from a document. + +### Rule + +A **Rule** guides the extraction from the LLM. You can add rules on a column level or on a global level. Currently, the following rule types are supported: + +- **May Return** rules give the LLM examples of answers that can be used to guide the extraction. This is a great way to give more guidance for the LLM on the type of things it should keep an eye out for. +- **Must Return** rules give the LLM an exhaustive list of answers that are allowed to be returned. This is a great way to give guardrails for the LLM to ensure only certain terms are returned. +- **Allowed # of Responses** rules are useful for provide guardrails in the event there are may be a range of potential ‘grey-area’ answers and we want to only restrict and guarantee only a certain number of the top responses are provided. +- **Resolve Entity** rules allow you to resolve values to a specific entity. This is useful for ensuring output conforms to a specific entity type. For example, you can write rules that ensure "blackrock", "Blackrock, Inc.", and "Blackrock Corporation" all resolve to the same entity - "Blackrock". + +## APIs + +- [Document API](api/v1/endpoints/document.md) +- [Graph API](api/v1/endpoints/graph.md) +- [Query API](api/v1/endpoints/query.md) + +## Services + +- [Document Service](services/document_service.md) +- [Graph Service](services/graph_service.md) +- [Query Service](services/query_service.md) +- [LLM Service](services/llm_service.md) + +## Extending the Backend + +[Click here](extending/overview.md) to learn more about extending the backend. + +## Testing + +[Click here](testing/testing.md) to learn more about testing. + +## Contributing + +We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for more information on how to get involved. + +## Support + +For support, join our [Discord community](https://discord.gg/PAgGMxfhKd) or contact us at team@whyhow.ai. diff --git a/backend/docs/services/document_service.md b/backend/docs/services/document_service.md new file mode 100644 index 0000000..f82024f --- /dev/null +++ b/backend/docs/services/document_service.md @@ -0,0 +1,138 @@ +# Document Service + +The `DocumentService` handles document uploads, processing, and storage in the Knowledge Table backend. + +The service uses: + +- **VectorDBService**: For storing document vectors. +- **LLMService**: For language model operations. +- **LoaderFactory** and **RecursiveCharacterTextSplitter**: For document loading and splitting. + +--- + +## Core Functions + +**upload_document** + Uploads and processes a document. + +```python +async def upload_document(self, filename: str, file_content: bytes) -> Optional[str] +``` + +Parameters: + +- `filename`: The name of the file. +- `file_content`: Binary file content. + +Returns: + +- Document ID (`str`) if successful, otherwise `None`. + +Process: + +1. Generates a document ID. +2. Saves file content temporarily. +3. Splits the document into chunks. +4. Uses LLM service (if available) to process and store chunks. +5. Deletes temporary file. + +**\_process_document** + Loads and splits a document into chunks. + +```python +async def _process_document(self, file_path: str) -> List[LangchainDocument] +``` + +**\_load_document** + Loads the document using the designated loader. + +```python +async def _load_document(self, file_path: str) -> List[LangchainDocument] +``` + +_Raises `ValueError` if no loader is found._ + +**\_generate_document_id** + Generates a unique document ID. + +```python +@staticmethod +def _generate_document_id() -> str +``` + +--- + +## Usage + +```python +vector_db_service = VectorDBFactory.create_vector_db_service(...) +llm_service = LLMFactory.get_llm_service(...) +document_service = DocumentService(vector_db_service, llm_service) + +document_id = await document_service.upload_document("example.pdf", file_content) +if document_id: + print(f"Document uploaded successfully with ID: {document_id}") +else: + print("Failed to upload document") +``` + +--- + +## Configuration + +Settings used by `DocumentService`: + +- `chunk_size`: Size for document chunks. +- `chunk_overlap`: Overlap between chunks. +- `loader`: Document loader type. + +Configure these in the application’s configuration file. + +--- + +## Error Handling + +Includes error handling and logging for: + +- Upload and processing failures +- LLM service unavailability +- Document loading issues + +Errors are logged with levels (`INFO`, `WARNING`, `ERROR`) to aid debugging. Exceptions are managed to ensure stability, with detailed messages in `upload_document` for specific failure scenarios. + +--- + +## Dependencies + +- **settings** from `app.core.config`: Manages configuration options like `chunk_size`, `chunk_overlap`, and loader type. +- **LoaderFactory** from `app.services.loaders.factory`: Creates loaders based on configuration. +- **VectorDBService** and **LLMService**: Provide vector embedding and language model functionality. +- **RecursiveCharacterTextSplitter** from `langchain.text_splitter`: Splits documents into manageable chunks. + +--- + +## Models + +### Document + +The `Document` model, built with Pydantic, validates and represents document data within the application. + +```python +from app.models.document import Document + +doc = Document( + id="123", + name="Sample Document", + author="John Doe", + tag="sample", + page_count=10 +) +``` + +Attributes: + +- `id` (str): Unique identifier for the document. +- `name` (str): Document name. +- `author` (str): Document author. +- `tag` (str): Tag associated with the document. +- `page_count` (int): Total number of pages in the document. diff --git a/backend/docs/services/graph_service.md b/backend/docs/services/graph_service.md new file mode 100644 index 0000000..a7331b5 --- /dev/null +++ b/backend/docs/services/graph_service.md @@ -0,0 +1,249 @@ +# Graph Service + +The Graph Service processes table data, generates schemas, and creates triples (subject-predicate-object relationships) to represent data as a graph. + +--- + +## Core Functions + +**parse_table** + Prepares table data for schema generation. + +```python +async def parse_table(data: Table) -> Dict[str, Any] +``` + +**generate_triples** + Generates triples and chunks based on the schema and table data. + +```python +async def generate_triples(schema: SchemaResponseModel, table_data: Table) -> Dict[str, Any] +``` + +**generate_triples_for_relationship** + Creates triples for a specific relationship across all rows. + +```python +def generate_triples_for_relationship(relationship: SchemaRelationship, table_data: Table) -> List[Triple] +``` + +**create_triple_for_row** + Generates a single triple for a relationship and row. + +```python +def create_triple_for_row(relationship: SchemaRelationship, row: Row, table_data: Table) -> Optional[Triple] +``` + +**generate_chunks_for_triples** + Generates content chunks for each triple. + +```python +def generate_chunks_for_triples(triples: List[Triple], table_data: Table) -> List[Dict[str, Any]] +``` + +--- + +## Helper Functions + +**get_cell_value** + Retrieves the cell value for a specific entity type and row. + +```python +def get_cell_value(entity_type: str, row: Row, table_data: Table) -> Optional[str] +``` + +**get_label** + Provides a label for an entity type, returning 'Document' if applicable. + +```python +def get_label(entity_type: str) -> str +``` + +**generate_chunks_for_triple** + Generates chunks for a single triple, associating content snippets with it. + +```python +def generate_chunks_for_triple(triple: Triple, table_data: Table) -> List[Dict[str, Any]] +``` + +--- + +## Usage + +The primary function to use is `process_table_and_generate_triples`, which processes table data, generates a schema, and creates triples. + +```python +table_data = Table(...) # Your table data +export_data = await process_table_and_generate_triples(table_data) +``` + +--- + +## Configuration + +The Graph Service relies on configurations set for processing data, schema generation, and LLM integration. Adjust these settings in the application’s configuration file. + +--- + +## Error Handling + +The Graph Service includes logging and error handling for: + +- Data processing errors +- LLM service unavailability +- Table parsing and triple generation issues + +Errors are logged for debugging, and exceptions are managed to maintain stability. Logs are categorized by levels (`INFO`, `WARNING`, `ERROR`) to trace processing steps. + +--- + +## Dependencies + +- **whyhow**: Provides `Node`, `Relation`, and `Triple` classes. +- **app.core.dependencies**: Includes `get_llm_service` for LLM integration. +- **app.services.llm_service**: Uses `generate_schema` for schema generation. + +--- + +## Models + +### Chunk + +Represents a chunk of content with associated metadata. + +```python +from app.models.graph import Chunk + +chunk = Chunk( + chunk_id="chunk1", + content="John lives in New York.", + page=1, + triple_id="123" +) +``` + +Attributes: + +- `chunk_id` (str): Unique identifier for the chunk. +- `content` (str): The content of the chunk. +- `page` (Union[int, str]): The page number or identifier. +- `triple_id` (str): The ID of the associated triple. + +--- + +### Document + +Represents a document in the system. + +```python +from app.models.graph import Document + +doc = Document( + id="doc1", + name="New York Travel Guide" +) +``` + +Attributes: + +- `id` (str): Unique identifier for the document. +- `name` (str): The name of the document. + +--- + +### Node + +Represents a node in the knowledge graph. + +```python +from app.models.graph import Node + +node = Node( + label="Person", + name="John" +) +``` + +Attributes: + +- `label` (str): The type of entity (e.g., "Person"). +- `name` (str): The name of the node. + +--- + +### Relation + +Represents a relationship between two nodes in the knowledge graph. + +```python +from app.models.graph import Relation + +relation = Relation( + name="lives_in" +) +``` + +Attributes: + +- `name` (str): The name of the relation (e.g., "lives_in"). + +--- + +### Triple + +Represents a triple in the knowledge graph. + +```python +from app.models.graph import Triple + +triple = Triple( + triple_id="123", + head=Node(label="Person", name="John"), + tail=Node(label="City", name="New York"), + relation=Relation(name="lives_in"), + chunk_ids=["chunk1"] +) +``` + +Attributes: + +- `triple_id` (str): Unique identifier for the triple. +- `head` (`Node`): The head node of the triple. +- `tail` (`Node`): The tail node of the triple. +- `relation` (`Relation`): The relation between the head and tail nodes. +- `chunk_ids` (List[str]): List of associated chunk IDs. + +--- + +### ExportData + +Represents the exported data containing triples and content chunks. + +```python +from app.models.graph import ExportData, Triple, Node, Relation, Chunk + +export_data = ExportData( + triples=[ + Triple( + triple_id="123", + head=Node(label="Person", name="John"), + tail=Node(label="City", name="New York"), + relation=Relation(name="lives_in"), + chunk_ids=["chunk1", "chunk2"] + ) + ], + chunks=[ + { + "chunk_id": "chunk1", + "content": "John lives in New York.", + "page": 1, + "triple_id": "123" + } + ] +) +``` + +Attributes: + +- `triples` (List[`Triple`]): List of triples in the exported data. +- `chunks` (List[Dict[str, Any]]): List of chunks in the exported data, where each chunk includes `chunk_id`, `content`, `page`, and `triple_id`. diff --git a/backend/docs/services/llm_service.md b/backend/docs/services/llm_service.md new file mode 100644 index 0000000..93d3c5a --- /dev/null +++ b/backend/docs/services/llm_service.md @@ -0,0 +1,257 @@ +# LLM Service + +The LLM Service generates responses from a language model (LLM) based on various prompts and input data. It supports query decomposition, schema generation, and keyword extraction. + +--- + +## Core Functions + +**generate_response** + Generates a response from the LLM based on a query and specified format. + +```python +async def generate_response(llm_service: LLMService, query: str, chunks: str, rules: list[Rule], format: Literal["int", "str", "bool", "int_array", "str_array"]) -> dict[str, Any] +``` + +**get_keywords** + Extracts keywords from a query using the LLM. + +```python +async def get_keywords(llm_service: LLMService, query: str) -> dict[str, list[str] | None] +``` + +**get_similar_keywords** + Retrieves keywords similar to the provided list from text chunks. + +```python +async def get_similar_keywords(llm_service: LLMService, chunks: str, rule: list[str]) -> dict[str, Any] +``` + +**decompose_query** + Breaks down a complex query into simpler sub-queries. + +```python +async def decompose_query(llm_service: LLMService, query: str) -> dict[str, Any] +``` + +**generate_schema** + Generates a schema for a table based on column information and prompts. + +```python +async def generate_schema(llm_service: LLMService, data: Table) -> dict[str, Any] +``` + +--- + +## Helper Functions + +**\_get_str_rule_line** + Formats instructions for string-based rules, incorporating specific response rules into the prompt. + +```python +def _get_str_rule_line(str_rule: Rule | None, query: str) -> str +``` + +**\_get_int_rule_line** + Generates instructions for integer-based rules, specifying the item limit in responses. + +```python +def _get_int_rule_line(int_rule: Rule | None) -> str +``` + +--- + +## Usage + +These functions are used with an LLM service to handle queries, extract keywords, and generate schemas. + +```python +llm_service = get_llm_service() # Assume this function exists to get an LLM service +query = "What is the capital of France?" +chunks = "Paris is the capital and most populous city of France." +rules = [Rule(type="must_return", options=["Paris", "London", "Berlin"])] + +response = await generate_response(llm_service, query, chunks, rules, format="str") +print(response) # {'answer': 'Paris'} + +keywords = await get_keywords(llm_service, query) +print(keywords) # {'keywords': ['capital', 'France']} +``` + +--- + +## Configuration + +Settings for LLM-related operations, such as response formatting and rule handling, can be configured in the application's settings file. + +--- + +## Error Handling + +The LLM Service includes error handling and logging for: + +- Query processing errors +- Schema generation issues +- Keyword extraction failures + +Functions return either a dictionary with an error message or `None` in case of failure. Logging is enabled for key operations to aid in debugging. + +--- + +## Dependencies + +- **app.models.llm**: For response models like `BoolResponseModel`, `IntArrayResponseModel`, etc. +- **app.models.query**: For the `Rule` model. +- **app.schemas.graph**: For `Table` schema. +- **app.services.llm.base**: For `LLMService`. +- **app.services.llm.prompts**: For prompt templates like `BASE_PROMPT`, `SCHEMA_PROMPT`, etc. + +--- + +## Models + +### BoolResponseModel + +Validates boolean responses. + +```python +from app.models.llm import BoolResponseModel + +bool_response = BoolResponseModel(answer=True) +``` + +Attributes: + +- `answer` (Optional[bool]): The boolean answer to the query. + +--- + +### IntResponseModel + +Validates integer responses. + +```python +from app.models.llm import IntResponseModel + +int_response = IntResponseModel(answer=42) +``` + +Attributes: + +- `answer` (Optional[int]): The integer answer to the query. + +--- + +### IntArrayResponseModel + +Validates integer array responses. + +```python +from app.models.llm import IntArrayResponseModel + +int_array_response = IntArrayResponseModel(answer=[1, 2, 3]) +``` + +Attributes: + +- `answer` (Optional[List[int]]): The list of integer answers to the query. + +--- + +### StrArrayResponseModel + +Validates string array responses. + +```python +from app.models.llm import StrArrayResponseModel + +str_array_response = StrArrayResponseModel(answer=["apple", "banana", "cherry"]) +``` + +Attributes: + +- `answer` (Optional[List[str]]): The list of string answers to the query. + +--- + +### StrResponseModel + +Validates string responses. + +```python +from app.models.llm import StrResponseModel + +str_response = StrResponseModel(answer="Hello, World!") +``` + +Attributes: + +- `answer` (Optional[str]): The string answer to the query. + +--- + +### KeywordsResponseModel + +Validates keyword responses. + +```python +from app.models.llm import KeywordsResponseModel + +keywords_response = KeywordsResponseModel(keywords=["AI", "machine learning", "data science"]) +``` + +Attributes: + +- `keywords` (Optional[List[str]]): The extracted keywords from the query. + +--- + +### SubQueriesResponseModel + +Validates sub-query responses. + +```python +from app.models.llm import SubQueriesResponseModel + +sub_queries_response = SubQueriesResponseModel(sub_queries=["What is AI?", "How does ML work?"]) +``` + +Attributes: + +- `sub_queries` (Optional[List[str]]): The decomposed sub-queries. + +--- + +### SchemaRelationship + +Represents a schema relationship. + +```python +from app.models.llm import SchemaRelationship + +schema_relationship = SchemaRelationship(head="Person", relation="works_at", tail="Company") +``` + +Attributes: + +- `head` (str): The head entity of the relationship. +- `relation` (str): The relation between the head and tail entities. +- `tail` (str): The tail entity of the relationship. + +--- + +### SchemaResponseModel + +Validates schema responses. + +```python +from app.models.llm import SchemaResponseModel, SchemaRelationship + +schema_response = SchemaResponseModel(relationships=[ + SchemaRelationship(head="Person", relation="lives_in", tail="City") +]) +``` + +Attributes: + +- `relationships` (Optional[List[`SchemaRelationship`]]): The relationships in the schema. diff --git a/backend/docs/services/query_service.md b/backend/docs/services/query_service.md new file mode 100644 index 0000000..b7de0fb --- /dev/null +++ b/backend/docs/services/query_service.md @@ -0,0 +1,184 @@ +# Query Service + +The `QueryService` processes various types of queries, leveraging vector database searches and language model responses. It supports `decomposition`, `hybrid`, and `simple_vector` query types. + +--- + +## Core Functions + +**get_vector_db_service** + Retrieves the vector database service instance. + +```python +async def get_vector_db_service() -> Any +``` + +**Returns**: + +- `Any`: An instance of the vector database service. + +**Raises**: + +- `ValueError`: If vector database service creation fails. + +**process_query** + Processes a query based on the specified type. + +```python +async def process_query( + query_type: Literal["decomposition", "hybrid", "simple_vector"], + query: str, + document_id: str, + rules: List[Rule], + format: Literal["int", "str", "bool", "int_array", "str_array"], + llm_service: LLMService, +) -> Dict[str, Any] +``` + +**Parameters**: + +- `query_type`: Type of query (`decomposition`, `hybrid`, or `simple_vector`). +- `query`: Query text to process. +- `document_id`: ID of the document to search. +- `rules`: Rules to apply during query processing. +- `format`: Desired answer format. +- `llm_service`: Language model service for generating responses. + +**Returns**: + +- `Dict[str, Any]`: Contains `answer` and relevant `chunks`. + +**decomposition_query** + Wrapper for processing a `decomposition` type query. + +```python +async def decomposition_query(...) -> Dict[str, Any] +``` + +**hybrid_query** + Wrapper for processing a `hybrid` type query. + +```python +async def hybrid_query(...) -> Dict[str, Any] +``` + +**simple_vector_query** + Wrapper for processing a `simple_vector` type query. + +```python +async def simple_vector_query(...) -> Dict[str, Any] +``` + +--- + +## Usage + +Each query type is implemented as a separate function but calls the main `process_query` function. + +Example: + +```python +llm_service = get_llm_service() # Assume this function exists +query = "What is the capital of France?" +document_id = "doc123" +rules = [Rule(type="must_return", options=["city name"])] + +result = await decomposition_query(query, document_id, rules, format="str", llm_service=llm_service) +print(result) # {'answer': 'Paris', 'chunks': [...]} +``` + +--- + +## Configuration + +Settings affecting the `QueryService` include: + +- **vector_db_provider**: Specifies the provider used by `VectorDBFactory` to create the vector database service. + +These configurations are accessed through `get_settings()` and can be customized in the application’s configuration file. + +--- + +## Error Handling + +The service includes error handling for: + +- Vector database service creation: Raises `ValueError` if creation fails. +- Relies on underlying services (`VectorDBService`, `LLMService`) for additional error handling during query processing and response generation. + +Errors are logged for debugging, and exceptions are managed to ensure service stability. + +--- + +## Dependencies + +- **app.core.dependencies**: Provides access to `get_llm_service` and `get_settings`. +- **app.models.query**: Defines `Rule`, used to apply constraints during query processing. +- **app.services.llm_service**: Provides `LLMService` and `generate_response` for response generation. +- **app.services.vector_db.factory**: Creates the vector database service with `VectorDBFactory`. + +--- + +## Models + +### Rule + +Represents a rule for query processing. + +```python +from app.models.query import Rule + +rule = Rule(type="must_return", options=["option1", "option2"]) +``` + +Attributes: + +- `type` (Literal["must_return", "may_return", "max_length"]): The type of rule. +- `options` (Optional[List[str]]): Possible options for the rule. +- `length` (Optional[int]): The length constraint for the rule. + +--- + +### Chunk + +Represents a chunk of content in a query response. + +```python +from app.models.query import Chunk + +chunk = Chunk(content="Sample content", page=1) +``` + +Attributes: + +- `content` (str): The content of the chunk. +- `page` (int): The page number where the chunk is found. + +--- + +### Answer + +Represents an answer to a query. + +```python +from app.models.query import Answer, Chunk + +chunk = Chunk(content="Sample content", page=1) +answer = Answer( + id="123", + document_id="doc1", + prompt_id="prompt1", + answer="Sample answer", + chunks=[chunk], + type="text" +) +``` + +Attributes: + +- `id` (str): Unique identifier for the answer. +- `document_id` (str): The ID of the document containing the answer. +- `prompt_id` (str): The ID of the prompt used to generate the answer. +- `answer` (Optional[Union[int, str, bool, List[int], List[str]]]): The actual answer content. +- `chunks` (List[`Chunk`]): List of chunks supporting the answer. +- `type` (str): The type of the answer. diff --git a/backend/docs/testing/testing.md b/backend/docs/testing/testing.md new file mode 100644 index 0000000..726b5d5 --- /dev/null +++ b/backend/docs/testing/testing.md @@ -0,0 +1,110 @@ +# Testing + +This guide covers how to run tests for the Knowledge Table project. + +## Running Tests + +We use pytest for our test suite. To run the tests, follow these steps: + +**1. Activate your virtual environment** (if not already activated): + +```sh +source venv/bin/activate # On Windows use `venv\Scripts\activate` +``` + +**2. Navigate to the backend directory**: + +```sh +cd path/to/knowledge-table/backend +``` + +**3. Run the tests**: + +```sh +pytest +``` + +This command will discover and run all tests in the project. + +### Running Specific Tests + +To run tests in a specific file: + +```sh +pytest tests/path/to/test_file.py +``` + +To run a specific test function: + +```sh +pytest tests/path/to/test_file.py::test_function_name +``` + +## Coverage + +To run tests with coverage reporting: + +**1. Install pytest-cov** (if not already installed): + +```sh +pip install pytest-cov +``` + +**2. Run tests with coverage**: + +```sh +pytest --cov=app tests/ +``` + +This will run the tests and display a coverage report in the terminal. + +**3. Generate an HTML coverage report**: + +```sh +pytest --cov=app --cov-report=html tests/ +``` + +This creates a `htmlcov` directory. Open `htmlcov/index.html` in a web browser to view the detailed coverage report. + +## Writing Tests + +When writing new tests: + +1. Place test files in the `tests` directory. +2. Name test files with the prefix `test_`. +3. Name test functions with the prefix `test_`. +4. Use descriptive names for test functions to clearly indicate what they're testing. + +Example: + +```python +# tests/test_document_service.py + +def test_upload_document_success(): + # Test code here + pass + +def test_upload_document_invalid_file(): + # Test code here + pass +``` + +## Continuous Integration + +We use GitHub Actions for continuous integration. The CI pipeline runs all tests automatically on every push and pull request. + +To view the CI results: + +1. Go to the GitHub repository. +2. Click on the "Actions" tab. +3. Select the workflow run you want to inspect. + +## Troubleshooting + +If you encounter any issues while running tests: + +1. Ensure your virtual environment is activated and all dependencies are installed. +2. Check that your `.env` file is properly configured. +3. Verify that you're in the correct directory when running the tests. + +If problems persist, please open an issue on the GitHub repository with details about the error you're encountering. diff --git a/backend/docs/tutorials/extracting-data.md b/backend/docs/tutorials/extracting-data.md new file mode 100644 index 0000000..7db2706 --- /dev/null +++ b/backend/docs/tutorials/extracting-data.md @@ -0,0 +1,124 @@ +# Tutorial + +This brief tutorial offers a brief walkthrough of Knowledge Table features. + +## Upload Documents + +You can drag and drop `.pdf` and `.txt` files into the Knowledge Table using the frontend. In this case, we'll upload various simple txt files containing information about different public companies. When you upload them, they are automatically chunked and made available for querying. + +![upload](images/image-20.png) + +## Configure column + +Next, we'll configure a column to extract our first piece of data. We'll create a column called "Company" and we'll wrtie a prompt to extract the name of the company mentioned in each document. The column type is set to "text," and answer generation is enabled, instructing Knowledge Table to pull relevant information based on the prompt and format it as a string. Once done, we'll hit 'Save' to save the column. + +![configure](images/image-21.png) + +## Resolve Entities + +Before we run the column, we're going to add a few entity resolution rules to ensure that company names are resolved to the name we expect. To do this, we'll add a few global rules indicating that names like "Apple Inc." or "Google LLC" should be resolved to "Apple" and "Google." + +![resolve](images/image-7.png) + +## Run & Rerun Columns, Rows, and Cells + +Once the rules are set, we can run the column by right-clicking on the column header and selecting 'Rerun Column.' When you run a column, Knowledge Table will run the prompt for each row. + +You can also run by row by right-clicking the document and selecting 'Rerun row.' Or you can run an individual cell by right-clicking the cell and selecting 'Rerun cell.' + +**Rerun** +![rerun-before](images/image-8.png) + +**In Progress** +![rerun-during](images/image-9.png) + +**Result** +![rerun-after](images/image-10.png) + +As we can see, the company names have been extracted from the documents and the names have been resolved according to the rules we set. If you would like to see the entities that have been resolved, click 'Resolved Entities' and a drawer will appear showing the resolved entities. + +![resolved](images/image-11.png) + +To undo the entity resolution, just click the red 'x' next to the resolved entity, and the answer will revert to the original text. + +## View Chunks + +You can view the chunks from which answers have been extracted by right-clicking on a cell and selecting 'View Chunks.' This will display the relevant chunks that are related to the answer which was generated. + +![chunks](images/image-19.png) + +## Chain Extraction + +You can chain extraction by creating a new column and writing a prompt that references another column. In this case, we'll create a new column called "Employees" and make it type "number" and write a prompt that references the "Company" column by using '@'. This will automatically load the answer from the "Company" column into the prompt. + +![chain-company](images/image-13.png) + +Then, we're going to add one more column called "Offerings" and write a prompt that references the "Company" column. This time, we'll use the "Company" column to extract the products and services offered by each company. For this column, we'll specify List of Text as the column type, this way our output will be a list of strings extracted from the document. + +![chain-offering](images/image-14.png) + +## Filter + +Let's say we want to see the companies that offer consumer electronics products. We can add filter on the "Offerings" column and specify on the column header and selecting "Filter." We can then write a prompt to filter the column by companies that offer consumer electronics products. + +![filter-before](images/image-15.png) + +![filter-after](images/image-16.png) + +## Split cells into rows + +We can split the answer into rows to perform further analysis on each individual product or service. To do this, we can right-click the cell and select "Split into rows." This will split each of the cells into multiple rows, each containing a single product or service. + +![split-before](images/image-17.png) + +![split-after](images/image-18.png) + +Now, we can perform additional extraction and analysis on each disease by chaining extraction like we did before + +## Export + +Once we've extracted all the data we need, we can export the data in a couple different ways. + +**Export CSV** + +You can download the data as a CSV file by clicking "Download CSV" and you'll get a file that looks like this: + +``` +Document,Company,Employees,Offerings +"Disney.txt",The Walt Disney Company,223000,TV shows +"Disney.txt",The Walt Disney Company,223000,movies +"Disney.txt",The Walt Disney Company,223000,Amusement parks +"Netflix.txt",Netflix, Inc.,12800,TV shows +"Netflix.txt",Netflix, Inc.,12800,movies +"Netflix.txt",Netflix, Inc.,12800,documentaries +"Tesla.txt",Tesla, Inc.,99000,Electric vehicles +``` + +**Export Triples** + +You can also export the data in triples format by clicking "Download Triples." When you do, the table data will be sent to the backend where a schema will be generated, triples will be built according to that schema, and chunks will be linked to each triple accordingly. The output will look somthing like this: + +```json +{ + "triples": [ + { + "triple_id": "t6ecedb7b-ef77-4b86-8e0d-37d928f1d475", + "head": { + "label": "Company", + "name": "The Walt Disney Company", + "properties": { "document": "Disney.txt" } + }, + "tail": { + "label": "Offerings", + "name": "TV shows", + "properties": { "document": "Disney.txt" } + }, + "relation": { "name": "Offers" }, + "chunk_ids": [] + }, + ...], + "chunks": [...] +} +``` + +You can use these triples to [build a knowledge graph](https://whyhow-ai.github.io/whyhow-sdk-docs/examples/create_graph_from_knowledge_table/) using the WhyHow SDK. diff --git a/backend/docs/tutorials/images/image-10.png b/backend/docs/tutorials/images/image-10.png new file mode 100644 index 0000000..7b6e38b Binary files /dev/null and b/backend/docs/tutorials/images/image-10.png differ diff --git a/backend/docs/tutorials/images/image-11.png b/backend/docs/tutorials/images/image-11.png new file mode 100644 index 0000000..a271ff5 Binary files /dev/null and b/backend/docs/tutorials/images/image-11.png differ diff --git a/backend/docs/tutorials/images/image-13.png b/backend/docs/tutorials/images/image-13.png new file mode 100644 index 0000000..95339cb Binary files /dev/null and b/backend/docs/tutorials/images/image-13.png differ diff --git a/backend/docs/tutorials/images/image-14.png b/backend/docs/tutorials/images/image-14.png new file mode 100644 index 0000000..0f25f08 Binary files /dev/null and b/backend/docs/tutorials/images/image-14.png differ diff --git a/backend/docs/tutorials/images/image-15.png b/backend/docs/tutorials/images/image-15.png new file mode 100644 index 0000000..0fec6a8 Binary files /dev/null and b/backend/docs/tutorials/images/image-15.png differ diff --git a/backend/docs/tutorials/images/image-16.png b/backend/docs/tutorials/images/image-16.png new file mode 100644 index 0000000..fbd2990 Binary files /dev/null and b/backend/docs/tutorials/images/image-16.png differ diff --git a/backend/docs/tutorials/images/image-17.png b/backend/docs/tutorials/images/image-17.png new file mode 100644 index 0000000..d981cb8 Binary files /dev/null and b/backend/docs/tutorials/images/image-17.png differ diff --git a/backend/docs/tutorials/images/image-18.png b/backend/docs/tutorials/images/image-18.png new file mode 100644 index 0000000..cb94fad Binary files /dev/null and b/backend/docs/tutorials/images/image-18.png differ diff --git a/backend/docs/tutorials/images/image-19.png b/backend/docs/tutorials/images/image-19.png new file mode 100644 index 0000000..2603f7a Binary files /dev/null and b/backend/docs/tutorials/images/image-19.png differ diff --git a/backend/docs/tutorials/images/image-20.png b/backend/docs/tutorials/images/image-20.png new file mode 100644 index 0000000..3b6906b Binary files /dev/null and b/backend/docs/tutorials/images/image-20.png differ diff --git a/backend/docs/tutorials/images/image-21.png b/backend/docs/tutorials/images/image-21.png new file mode 100644 index 0000000..a508c19 Binary files /dev/null and b/backend/docs/tutorials/images/image-21.png differ diff --git a/backend/docs/tutorials/images/image-7.png b/backend/docs/tutorials/images/image-7.png new file mode 100644 index 0000000..68ef2ca Binary files /dev/null and b/backend/docs/tutorials/images/image-7.png differ diff --git a/backend/docs/tutorials/images/image-8.png b/backend/docs/tutorials/images/image-8.png new file mode 100644 index 0000000..3960440 Binary files /dev/null and b/backend/docs/tutorials/images/image-8.png differ diff --git a/backend/docs/tutorials/images/image-9.png b/backend/docs/tutorials/images/image-9.png new file mode 100644 index 0000000..9bea7da Binary files /dev/null and b/backend/docs/tutorials/images/image-9.png differ diff --git a/backend/mkdocs.yml b/backend/mkdocs.yml new file mode 100644 index 0000000..dde54ed --- /dev/null +++ b/backend/mkdocs.yml @@ -0,0 +1,90 @@ +site_name: Knowledge Table Documentation +site_url: https://whyhow-ai.github.io/knowledge-table/ +repo_url: https://github.com/whyhow-ai/knowledge-table +edit_uri: blob/main/backend/docs/ + +theme: + name: material + logo: images/wh-logo.png + features: + - navigation.tabs + - navigation.sections + - toc.integrate + - search.suggest + - search.highlight + - content.tabs.link + - content.code.annotation + - content.code.copy + palette: + - scheme: default + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - scheme: slate + toggle: + icon: material/brightness-4 + name: Switch to light mode + +nav: + - Getting Started: + - Welcome to Knowledge Table: index.md + - Installation: getting-started/installation.md + - Docker: getting-started/docker.md + - Tutorial: tutorials/extracting-data.md + - API Reference: + - Overview: api/overview.md + - Endpoints: + - Document: api/v1/endpoints/document.md + - Graph: api/v1/endpoints/graph.md + - Query: api/v1/endpoints/query.md + - Developer Guide: + - Architecture Overview: architecture.md + - Services: + - Document Service: services/document_service.md + - Graph Service: services/graph_service.md + - LLM Service: services/llm_service.md + - Query Service: services/query_service.md + - Extending the Backend: + - Overview: extending/overview.md + - Add LLM Services: extending/llm_services.md + - Add Vector DB Services: extending/vector_databases.md + - Add Document Loaders: extending/document_loaders.md + - Testing: + - Run Tests: testing/testing.md + - Contributing: CONTRIBUTING.md + +plugins: + - search + - mkdocstrings: + handlers: + python: + rendering: + show_source: false + - git-revision-date-localized: + fallback_to_build_date: true + - minify: + minify_html: true + +markdown_extensions: + - admonition + - attr_list + - codehilite + - footnotes + - pymdownx.betterem: + - pymdownx.caret + - pymdownx.details + - pymdownx.highlight + - pymdownx.inlinehilite + - pymdownx.magiclink + - pymdownx.mark + - pymdownx.smartsymbols + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format + - pymdownx.tasklist + - pymdownx.tilde + - pymdownx.emoji: + emoji_index: !!python/name:pymdownx.emoji.twemoji + emoji_generator: !!python/name:pymdownx.emoji.to_svg diff --git a/backend/pyproject.toml b/backend/pyproject.toml index 1d1fb89..9808a83 100644 --- a/backend/pyproject.toml +++ b/backend/pyproject.toml @@ -118,6 +118,17 @@ dev = [ "pytest-html", "pytest-mock", ] +docs = [ + "mkdocs", + "mkdocs-material", + "mkdocs-git-revision-date-localized-plugin", + "mkdocs-autorefs", + "mkdocs-get-deps", + "mkdocs-material-extensions", + "mkdocs-minify-plugin", + "mkdocstrings", + "mkdocstrings-python" +] [project.scripts] knowledge-table-locate = "app.main:locate"