feat: Add example python project #197

Askir · 2024-11-05T05:17:08Z

I built a small Demo application to figure out how well pgai integrates with existing python tooling. It's not done but the first endpoint works. It's a FastAPI service using SQLAlchemy that enables semantic code search using pgai features like the automatic embeddings.

The idea is to keep track of a code base through file watchers and input changes into postgres and immediately embed the files. You can then use these embeddings to find relevant code files for any LLM queries related to improvements on that code base without having to manually copy all the related code for it each time.

Changes you make based on those results will then immediately propagate into the store -> repeat.
Might not be my greatest startup idea but I needed something to start working 😄

Status:
Currently there is a single API endpoint that allows to send a query and retrieve relevant code files based on the query (see tests for how it works).

Example usage:

GET /search?query="how to authenticate users"&limit=5

I left a review on this with some thoughts but would continue building for a bit longer before prioritizing work items.

Askir · 2024-11-05T05:21:20Z

examples/code-llm-sync/alembic/versions/2024_11_04_1147-497e69a2bca9_add_code_files_table.py

+Revises:
+Create Date: 2024-11-04 11:47:57.345379
+
+"""


This file was generated with alembic revision --autogenerate which compares the database to the sqlalchemy models and generates a naive version of a diff. Works quite well for adding new tables. This of course doesn't work for our views.

Askir · 2024-11-05T05:22:18Z

examples/code-llm-sync/alembic/versions/2024_11_04_1148-bb79790304f7_add_embeddings_to_table.py

+    SELECT ai.create_vectorizer(
+        'code_files'::regclass,
+        destination => 'code_files_embeddings',
+        embedding => ai.embedding_openai('text-embedding-3-small', 768),
+        chunking => ai.chunking_recursive_character_text_splitter(
+            'contents',
+            chunk_size => 1000,
+            chunk_overlap => 200
+        ),
+        formatting => ai.formatting_python_template(
+            'File: $file_name\n\nContents:\n$chunk'
+        )
+    );
+    """)


This is how I am creating a vectorizer via migrations. I guess it works but it's not as pretty as the native alembic functions. It's probably possible to have nice wrappers around those somehow.

Askir · 2024-11-05T05:24:17Z

examples/code-llm-sync/db/models.py

+
+class CodeFileEmbedding(Base):
+    """
+    Model representing the view created by pgai vectorizer.
+    This maps to the automatically created view 'code_files_embeddings'
+    which joins the original code_files table with its embeddings.
+    """
+
+    __tablename__ = "code_files_embeddings"
+
+    # We make this a view model by setting it as such
+    __table_args__ = {"info": {"is_view": True}}
+
+    # Original CodeFile columns
+    id = Column(Integer, ForeignKey("code_files.id"), primary_key=True)
+    file_name = Column(String(255), nullable=False)
+    updated_at = Column(DateTime, nullable=True)
+    contents = Column(Text, nullable=True)
+
+    # Embedding specific columns added by pgai
+    embedding_uuid = Column(String, primary_key=True)
+    chunk = Column(Text, nullable=False)
+    embedding = Column(
+        Vector(768), nullable=False
+    )  # 768 dimensions for text-embedding-3-small
+    chunk_seq = Column(Integer, nullable=False)
+
+    # Relationship back to original CodeFile
+    code_file = relationship("CodeFile", foreign_keys=[id])
+
+    @override
+    def __repr__(self) -> str:
+        return f"<CodeFileEmbedding(file_name='{self.file_name}', chunk_seq={self.chunk_seq})>"


It is nice that this works at all, but having to define all fields twice is a bit ugly I feel like. We can maybe provide a helper annotation or similar here. Like @embedded on the original model and then it automatically injects the embedding and chunk field, etc.

Askir · 2024-11-05T05:25:56Z

examples/code-llm-sync/main.py

+def parse_embedding_string(embedding_str: str) -> np.ndarray:
+    """Convert a pgai embedding string to a numpy array"""
+    # Remove brackets and split on commas
+    values = embedding_str.strip("[]").split(",")
+    # Convert to float array
+    return np.array([float(x) for x in values], dtype=">f4")


This function is currently not in use. But when I originally called openai_embed() it simply returned a string representation of the vector (float array). Which is extremely unfortunate if you want to use this embedding in any way afterwards. Maybe I setup pgvector-python in the wrong way though.

Askir · 2024-11-05T05:31:34Z

examples/code-llm-sync/main.py

+class OpenAIEmbed(FunctionElement):
+    inherit_cache = True
+
+
+class PGAIFunction(expression.FunctionElement):
+    def __init__(self, model: str, text: str, dimensions: int):
+        self.model = model
+        self.text = literal(text)
+        self.dimensions = dimensions
+        super().__init__()
+
+
+@compiles(PGAIFunction)
+def _compile_pgai_embed(element, compiler, **kw):
+    return "ai.openai_embed('%s', %s, dimensions => %d)" % (
+        element.model,
+        compiler.process(element.text),
+        element.dimensions,
+    )


This is already part of a proposed solution, it was quite hard to pass named arguments to sqlalchemy. It does allow basically arbitrary function calling but I couldn't define the dimensions parameter somehow (maybe this is some SQL magic I don't understand).

This implementation also has another benefit though, where you can define return value and parameter types. I think this is quite the low hanging fruit, and mostly a bit of busy work to build. But has some nice benefits.

Askir · 2024-11-05T05:32:29Z

examples/code-llm-sync/main.py

+    results = await session.execute(
+        select(
+            CodeFileEmbedding.file_name,
+            CodeFileEmbedding.chunk,
+            CodeFileEmbedding.chunk_seq,
+            similarity_score,
+        )
+        .order_by(similarity_score.desc())
+        .limit(limit)
+    )


btw does putting the function twice in the code like here make it execute twice? Or is the Optimizer smart enough to understand that the result is the same since it's an idempotent function and only executes it once?

Askir · 2024-11-05T05:34:07Z

examples/code-llm-sync/tests/conftest.py

+# Test database configuration
+TEST_DB_URL = "postgresql+asyncpg://postgres:postgres@localhost/postgres"
+project_root = Path(__file__).parent.parent
+


Testing in general was a bit of a pain. But mostly because I struggled with setting up migrations and a database setup. However I am also not trying to mock anything here, which makes it a little easier.

Askir · 2024-11-05T05:35:39Z

examples/code-llm-sync/tests/conftest.py

+    docker_client: DockerClient, load_dotenv
+) -> Generator[Container, None, None]:
+    """Start vectorizer worker after database is ready"""
+    # Configure container
+    container_config = {
+        "image": "timescale/pgai-vectorizer-worker:0.1.0",
+        "environment": {
+            "PGAI_VECTORIZER_WORKER_DB_URL": "postgres://postgres:[email protected]:5432/postgres",
+            "OPENAI_API_KEY": os.environ["OPENAI_API_KEY"],
+        },
+        "command": ["--poll-interval", "5s"],
+        "extra_hosts": {
+            "host.docker.internal": "host-gateway"
+        },  # Allow container to connect to host postgres
+    }
+
+    # Start container
+    container = docker_client.containers.run(**container_config, detach=True)
+
+    # Wait for container to be running
+    container.reload()
+    assert container.status == "running"
+
+    yield container
+
+    # Cleanup
+    container.stop()
+    container.remove()


This is probably not how you'd actually want to test this, instead you can depend on pgai and call the vectorizer-worker from there. But Claude came up with this (as most of the code in this PR) and it worked, so I'm keeping it for now.

Add initial example python project

9711696

Askir commented Nov 5, 2024

View reviewed changes

Askir added 2 commits November 5, 2024 16:11

feat: add /files api

8c6a5ec

feat: Add file-watcher that posts changes to app

c77be7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add example python project #197

feat: Add example python project #197

Askir commented Nov 5, 2024 •

edited

Loading

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

Askir Nov 5, 2024

feat: Add example python project #197

Are you sure you want to change the base?

feat: Add example python project #197

Conversation

Askir commented Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Askir commented Nov 5, 2024 •

edited

Loading