Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add example python project #197

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Nov 5, 2024

I built a small Demo application to figure out how well pgai integrates with existing python tooling. It's not done but the first endpoint works. It's a FastAPI service using SQLAlchemy that enables semantic code search using pgai features like the automatic embeddings.

The idea is to keep track of a code base through file watchers and input changes into postgres and immediately embed the files. You can then use these embeddings to find relevant code files for any LLM queries related to improvements on that code base without having to manually copy all the related code for it each time.

Changes you make based on those results will then immediately propagate into the store -> repeat.
Might not be my greatest startup idea but I needed something to start working 😄

Status:
Currently there is a single API endpoint that allows to send a query and retrieve relevant code files based on the query (see tests for how it works).

Example usage:

GET /search?query="how to authenticate users"&limit=5

I left a review on this with some thoughts but would continue building for a bit longer before prioritizing work items.

Revises:
Create Date: 2024-11-04 11:47:57.345379

"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was generated with alembic revision --autogenerate which compares the database to the sqlalchemy models and generates a naive version of a diff. Works quite well for adding new tables. This of course doesn't work for our views.

Comment on lines +28 to +41
SELECT ai.create_vectorizer(
'code_files'::regclass,
destination => 'code_files_embeddings',
embedding => ai.embedding_openai('text-embedding-3-small', 768),
chunking => ai.chunking_recursive_character_text_splitter(
'contents',
chunk_size => 1000,
chunk_overlap => 200
),
formatting => ai.formatting_python_template(
'File: $file_name\n\nContents:\n$chunk'
)
);
""")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I am creating a vectorizer via migrations. I guess it works but it's not as pretty as the native alembic functions. It's probably possible to have nice wrappers around those somehow.

Comment on lines 22 to 54

class CodeFileEmbedding(Base):
"""
Model representing the view created by pgai vectorizer.
This maps to the automatically created view 'code_files_embeddings'
which joins the original code_files table with its embeddings.
"""

__tablename__ = "code_files_embeddings"

# We make this a view model by setting it as such
__table_args__ = {"info": {"is_view": True}}

# Original CodeFile columns
id = Column(Integer, ForeignKey("code_files.id"), primary_key=True)
file_name = Column(String(255), nullable=False)
updated_at = Column(DateTime, nullable=True)
contents = Column(Text, nullable=True)

# Embedding specific columns added by pgai
embedding_uuid = Column(String, primary_key=True)
chunk = Column(Text, nullable=False)
embedding = Column(
Vector(768), nullable=False
) # 768 dimensions for text-embedding-3-small
chunk_seq = Column(Integer, nullable=False)

# Relationship back to original CodeFile
code_file = relationship("CodeFile", foreign_keys=[id])

@override
def __repr__(self) -> str:
return f"<CodeFileEmbedding(file_name='{self.file_name}', chunk_seq={self.chunk_seq})>"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice that this works at all, but having to define all fields twice is a bit ugly I feel like. We can maybe provide a helper annotation or similar here. Like @embedded on the original model and then it automatically injects the embedding and chunk field, etc.

Comment on lines +74 to +79
def parse_embedding_string(embedding_str: str) -> np.ndarray:
"""Convert a pgai embedding string to a numpy array"""
# Remove brackets and split on commas
values = embedding_str.strip("[]").split(",")
# Convert to float array
return np.array([float(x) for x in values], dtype=">f4")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is currently not in use. But when I originally called openai_embed() it simply returned a string representation of the vector (float array). Which is extremely unfortunate if you want to use this embedding in any way afterwards. Maybe I setup pgvector-python in the wrong way though.

Comment on lines 82 to 100
class OpenAIEmbed(FunctionElement):
inherit_cache = True


class PGAIFunction(expression.FunctionElement):
def __init__(self, model: str, text: str, dimensions: int):
self.model = model
self.text = literal(text)
self.dimensions = dimensions
super().__init__()


@compiles(PGAIFunction)
def _compile_pgai_embed(element, compiler, **kw):
return "ai.openai_embed('%s', %s, dimensions => %d)" % (
element.model,
compiler.process(element.text),
element.dimensions,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already part of a proposed solution, it was quite hard to pass named arguments to sqlalchemy. It does allow basically arbitrary function calling but I couldn't define the dimensions parameter somehow (maybe this is some SQL magic I don't understand).

This implementation also has another benefit though, where you can define return value and parameter types. I think this is quite the low hanging fruit, and mostly a bit of busy work to build. But has some nice benefits.

Comment on lines +122 to +131
results = await session.execute(
select(
CodeFileEmbedding.file_name,
CodeFileEmbedding.chunk,
CodeFileEmbedding.chunk_seq,
similarity_score,
)
.order_by(similarity_score.desc())
.limit(limit)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw does putting the function twice in the code like here make it execute twice? Or is the Optimizer smart enough to understand that the result is the same since it's an idempotent function and only executes it once?

# Test database configuration
TEST_DB_URL = "postgresql+asyncpg://postgres:postgres@localhost/postgres"
project_root = Path(__file__).parent.parent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing in general was a bit of a pain. But mostly because I struggled with setting up migrations and a database setup. However I am also not trying to mock anything here, which makes it a little easier.

Comment on lines +50 to +77
docker_client: DockerClient, load_dotenv
) -> Generator[Container, None, None]:
"""Start vectorizer worker after database is ready"""
# Configure container
container_config = {
"image": "timescale/pgai-vectorizer-worker:0.1.0",
"environment": {
"PGAI_VECTORIZER_WORKER_DB_URL": "postgres://postgres:[email protected]:5432/postgres",
"OPENAI_API_KEY": os.environ["OPENAI_API_KEY"],
},
"command": ["--poll-interval", "5s"],
"extra_hosts": {
"host.docker.internal": "host-gateway"
}, # Allow container to connect to host postgres
}

# Start container
container = docker_client.containers.run(**container_config, detach=True)

# Wait for container to be running
container.reload()
assert container.status == "running"

yield container

# Cleanup
container.stop()
container.remove()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not how you'd actually want to test this, instead you can depend on pgai and call the vectorizer-worker from there. But Claude came up with this (as most of the code in this PR) and it worked, so I'm keeping it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant