Skip to content

Commit af5cdc7

Browse files
committed
Initial ingest logic
1 parent 029485e commit af5cdc7

21 files changed

+620
-0
lines changed

.gitignore

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
app/__pycache__/
4+
*.py[cod]
5+
*$py.class
6+
*.pyc
7+
8+
# C extensions
9+
*.so
10+
11+
# VS Code
12+
.vscode/
13+
14+
# Distribution / packaging
15+
.Python
16+
build/
17+
develop-eggs/
18+
dist/
19+
downloads/
20+
eggs/
21+
.eggs/
22+
lib64/
23+
parts/
24+
sdist/
25+
var/
26+
wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.coverage
46+
.coverage.*
47+
.cache
48+
nosetests.xml
49+
coverage.xml
50+
*.cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
63+
# Flask stuff:
64+
instance/
65+
.webassets-cache
66+
67+
# Scrapy stuff:
68+
.scrapy
69+
70+
# Sphinx documentation
71+
docs/_build/
72+
73+
# PyBuilder
74+
target/
75+
76+
# Jupyter Notebook
77+
.ipynb_checkpoints
78+
79+
# pyenv
80+
.python-version
81+
82+
# celery beat schedule file
83+
celerybeat-schedule
84+
85+
# SageMath parsed files
86+
*.sage.py
87+
88+
# Environments
89+
.env
90+
.venv
91+
env/
92+
venv/
93+
ENV/
94+
env.bak/
95+
venv.bak/
96+
97+
# Spyder project settings
98+
.spyderproject
99+
.spyproject
100+
101+
# Rope project settings
102+
.ropeproject
103+
104+
# mkdocs documentation
105+
/site
106+
107+
# mypy
108+
.mypy_cache/

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,17 @@ An open-source datalake to ingest, organize and efficiently store all data contr
55
### Architecture
66
The core datalake architecture is a simple HTTP API (written in FastAPI) that ingests JSON in a fixed schema, performs some integrity checking and stores it. This JSON is transformed into storage efficient Arrow/Parquet files and stored in a target filesystem. A light-weight index of the entire parquet filesystem is maintained with DuckDB.
77

8+
#### Data formats
9+
- Data is stored on disk in parquet files in subdirectories organized by day. These parquet files have a standardized schema allowing for easy manipulation in any programming language.
10+
-
11+
812
### Open sourcing the data.
913
Nomic AI will provide automatic snapshots of this raw parquet data.
1014
You will be able to interact with the snapshots:
1115
- In automatic [Atlas](https://atlas.nomic.ai/) maps over its raw, cleaned and curated form.
1216
- Through highly-processed downloads where the data has been curated, de-duplicated and cleaned for LLM training/finetuning.
1317

18+
1419
### Data Privacy
1520

1621

api/.gitignore

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
app/__pycache__/
4+
*.py[cod]
5+
*$py.class
6+
*.pyc
7+
8+
# C extensions
9+
*.so
10+
11+
# VS Code
12+
.vscode/
13+
14+
# Distribution / packaging
15+
.Python
16+
build/
17+
develop-eggs/
18+
dist/
19+
downloads/
20+
eggs/
21+
.eggs/
22+
lib64/
23+
parts/
24+
sdist/
25+
var/
26+
wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.coverage
46+
.coverage.*
47+
.cache
48+
nosetests.xml
49+
coverage.xml
50+
*.cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
63+
# Flask stuff:
64+
instance/
65+
.webassets-cache
66+
67+
# Scrapy stuff:
68+
.scrapy
69+
70+
# Sphinx documentation
71+
docs/_build/
72+
73+
# PyBuilder
74+
target/
75+
76+
# Jupyter Notebook
77+
.ipynb_checkpoints
78+
79+
# pyenv
80+
.python-version
81+
82+
# celery beat schedule file
83+
celerybeat-schedule
84+
85+
# SageMath parsed files
86+
*.sage.py
87+
88+
# Environments
89+
.env
90+
.venv
91+
env/
92+
venv/
93+
ENV/
94+
env.bak/
95+
venv.bak/
96+
97+
# Spyder project settings
98+
.spyderproject
99+
.spyproject
100+
101+
# Rope project settings
102+
.ropeproject
103+
104+
# mkdocs documentation
105+
/site
106+
107+
# mypy
108+
.mypy_cache/

api/Dockerfile.buildkit

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# syntax=docker/dockerfile:1.0.0-experimental
2+
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8
3+
4+
ARG image_version=gpt4all_datalake-0.0.1
5+
ENV IMAGE_VERSION=$image_version
6+
7+
# Put first so anytime this file changes other cached layers are invalidated.
8+
COPY api/requirements.txt /requirements.txt
9+
10+
RUN pip install --upgrade pip
11+
12+
# Run various pip install commands.
13+
RUN --mount=type=ssh pip install -r /requirements.txt && \
14+
rm -Rf /root/.cache && rm -Rf /tmp/pip-install*
15+
16+
17+
# Finally, copy app and client.
18+
COPY api/app /app

api/app/__init__.py

Whitespace-only changes.

api/app/api_v1/api.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from api_v1.routes import health, ingest
2+
from fastapi import APIRouter
3+
4+
router = APIRouter()
5+
6+
router.include_router(health.router)
7+
router.include_router(ingest.router)

api/app/api_v1/events.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import logging
2+
3+
from fastapi import HTTPException
4+
from fastapi.responses import JSONResponse
5+
from starlette.requests import Request
6+
7+
from api_v1.settings import settings
8+
9+
log = logging.getLogger(__name__)
10+
11+
startup_msg_fmt = """
12+
GPT4All Datalake started with {settings.app_environment} with version {settings.image_version}.
13+
"""
14+
15+
16+
async def on_http_error(request: Request, exc: HTTPException):
17+
return JSONResponse({'detail': exc.detail}, status_code=exc.status_code)
18+
19+
20+
async def on_startup(app):
21+
startup_msg = startup_msg_fmt.format(settings=settings)
22+
log.info(startup_msg)
23+
24+
25+
def startup_event_handler(app):
26+
async def start_app() -> None:
27+
await on_startup(app)
28+
29+
return start_app
30+

api/app/api_v1/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .models import *

api/app/api_v1/models/models.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
from enum import Enum
2+
3+
from typing import List, Optional
4+
from pydantic import BaseModel, Field
5+
6+
7+
class SuccessResponse(BaseModel):
8+
result = 'ok'
9+
10+
11+
class Rating(str, Enum):
12+
positive = "positive"
13+
negative = "negative"
14+
15+
class IngestMetadata(BaseModel):
16+
source: Optional[str] = Field(None, description='The source contributing the ingest data.', example='gpt4all-chat')
17+
submitter_id: str = Field(None, description='An identifier for the entity that submitted the ingest', example='EliteHacker#42')
18+
19+
class ChatItem(BaseModel):
20+
content: str = Field(..., description='The textual contents of the chat turn', example='Hello, how can I assist you today?')
21+
role: Optional[str] = Field(None, description='The role of the entity that generated content.', example='assistant')
22+
rating: Optional[Rating] = Field(None, description='A rating of the chat item', example='negative')
23+
edited_content: Optional[str] = Field(None, description='An optional edited version of the content.', example='Hello, how may I assist you today?')
24+
25+
26+
class ChatIngestRequest(IngestMetadata):
27+
agent_id: str = Field(..., description='An identifier for the entity in the conversation', example='gpt4all-j-v1.2-jazzy')
28+
conversation: List[ChatItem] = Field(..., description='The conversation history.',
29+
example=[{'content': 'Hello, how can I assist you today?',
30+
'role': 'assistant',
31+
'rating': 'negative',
32+
'edited_content': 'Hello, how may I assist you today?'},
33+
{'content': 'Write me python code to contribute data to the GPT4All Datalake!', 'role': 'user'}])
34+
35+
class ChatIngestResponse(BaseModel):
36+
ingest_id: str = Field(..., description='The id of the ingest', example='d920b363-abab-4d19-a5b6-89d182115f82')

api/app/api_v1/routes/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)