Skip to content

Document and validate search scoring implementation using chunk embeddings#35

Open
Copilot wants to merge 3 commits intomainfrom
copilot/fix-a6acafd1-e01c-4f8b-a8bb-c4ce81807437
Open

Document and validate search scoring implementation using chunk embeddings#35
Copilot wants to merge 3 commits intomainfrom
copilot/fix-a6acafd1-e01c-4f8b-a8bb-c4ce81807437

Conversation

Copy link
Contributor

Copilot AI commented Sep 24, 2025

This PR addresses the requirement that "search scoring should use the distance between the query embedding and the returned chunk's embedding" by documenting and validating that the current implementation already correctly implements this behavior.

Analysis

Upon investigation, the search implementation in SearchWithOptions already correctly uses chunk-level embeddings for distance calculation:

vec_distance_cosine(e.embedding, ?) as distance
FROM chunks c
JOIN embeddings e ON c.chunk_id = e.chunk_id

The system:

  • Calculates cosine distance between query embedding and each chunk's embedding
  • Converts distance to score using score = 1.0 - distance
  • Operates at chunk-level (not document-level) for precise matching
  • Properly sorts results by score in descending order

Changes Made

Enhanced Code Documentation

  • Added detailed comments explaining the chunk embedding distance calculation in the SQL query
  • Clarified the score conversion formula with examples of distance-to-score mapping

Comprehensive Testing

  • Added TestSQLiteStorage_SearchScoring_ChunkEmbeddingDistance test that validates:
    • Identical embeddings achieve score ≈ 1.0
    • Similar embeddings get high scores (0.9+)
    • Orthogonal embeddings get score ≈ 0.0
    • Opposite embeddings get negative scores
    • Results are properly sorted by semantic similarity

Documentation

  • Created SEARCH_SCORING.md explaining the implementation details, score interpretation, and performance characteristics
  • Documented the chunk-level vs document-level approach

Validation

The implementation was tested with known embedding vectors to verify mathematical correctness:

Query: [1.0, 0.0, 0.0]
Results:
1. exact_match [1.0, 0.0, 0.0]: score=1.000000 ✓
2. close_match [0.9, 0.1, 0.0]: score=0.993884 ✓  
3. orthogonal [0.0, 1.0, 0.0]: score=0.000000 ✓
4. opposite [-1.0, 0.0, 0.0]: score=-1.000000 ✓

All existing tests continue to pass, confirming no regressions were introduced.

Conclusion

The search scoring implementation was already correct and meeting the specified requirements. This PR adds explicit documentation, validation, and clarity to make the chunk embedding distance calculation behavior transparent and well-tested.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@streed streed marked this pull request as ready for review September 24, 2025 17:38
Copilot AI review requested due to automatic review settings September 24, 2025 17:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copilot AI and others added 2 commits September 24, 2025 17:38
Co-authored-by: streed <805140+streed@users.noreply.github.com>
…dings

Co-authored-by: streed <805140+streed@users.noreply.github.com>
Copilot AI changed the title [WIP] The search scoring should use the distance between the query embedding and the returned chunk's embedding. Document and validate search scoring implementation using chunk embeddings Sep 24, 2025
Copilot AI requested a review from streed September 24, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants