Skip to content

Distribution Distortion in Hybrid Retrievial #1

@chakka-guna-sekhar-venkata-chennaiah

Description

Hey hi @ALucek ,
I observed somthing called distribution distribution occurs when we are trying to leverage both the semantic and bm25 with alpha parameters as weightage. I mean BM25 index is fixed during preprocessing, whereas at runtime, the user query is split into tokens. These tokens are then checked against the pre-built BM25 index to identify which documents contain them. The documents are scored based on token presence and then re-ranked. These BM25 scores are then mixed with cosine similarity scores from semantic search. It means irrespective of semantic needs of the question, those docs appears in bm25 search because of the most of the user query tokes appers in the pre build index right. Then if they mixed with the semantic scores these two distributions gets affected right. May be I'm wrong though I'm some other directions. But I'm interested to share this with you. I'm requesting you to checkout the attached doc below and share your thoughts and feedback on it.

Doc:- https://docs.google.com/document/d/1uxdz5dtJfNUW5TvzbxZWRRGX1Kid79_p7OtWmap9mfU/edit?usp=sharing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions