Skip to content

Latest commit

 

History

History
181 lines (150 loc) · 5.09 KB

File metadata and controls

181 lines (150 loc) · 5.09 KB

The user is asking about fine-tuning the thresholds for the cosine similarity and other parameters in the semantic search implementation. Let me examine what parameters can be adjusted and how they can be fine-tuned.

Parameters that can be adjusted:

  1. Cosine similarity threshold
  2. Number of results returned
  3. Vector index configuration
  4. Text preprocessing

Fine-tuning Semantic Search Parameters

Fine-tune the similarity thresholds and other parameters. Here's how:

1. Adjusting Similarity Threshold

// In your API call
const { data } = await supabase.rpc('search_posts_by_embedding', {
  query_embedding: embedding,
  similarity_threshold: 0.75, // Increase for higher precision, decrease for better recall
  max_results: 10
});

You can adjust this threshold based on:

  • Higher threshold (e.g., 0.8-0.9): More precise results but fewer matches
  • Lower threshold (e.g., 0.5-0.6): More results but potentially less relevant
  • Typical range: 0.65-0.75 works well for most applications

2. Creating a Configurable Search Function

Make your search function configurable by passing parameters:

-- Enhanced search function with more parameters
CREATE OR REPLACE FUNCTION advanced_semantic_search(
  query_embedding VECTOR(1536),
  similarity_threshold FLOAT DEFAULT 0.7,
  max_results INTEGER DEFAULT 10,
  filter_subreddit TEXT DEFAULT NULL,
  min_score INTEGER DEFAULT NULL,
  include_nsfw BOOLEAN DEFAULT FALSE
)
RETURNS TABLE (
  id TEXT,
  subreddit TEXT,
  title TEXT,
  content TEXT,
  score INTEGER,
  similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    p.id,
    p.subreddit,
    p.title,
    p.content,
    p.score,
    1 - (p.embedding <=> query_embedding) AS similarity
  FROM reddit_posts p
  WHERE 1 - (p.embedding <=> query_embedding) > similarity_threshold
    AND (filter_subreddit IS NULL OR p.subreddit = filter_subreddit)
    AND (min_score IS NULL OR p.score >= min_score)
    AND (include_nsfw OR NOT p.is_nsfw)
  ORDER BY similarity DESC
  LIMIT max_results;
END;
$$;

3. Adjusting Vector Index Configuration

You can tune the IVF index for better performance:

-- You can adjust the 'lists' parameter based on your data size
-- Higher lists = faster search but potentially less accurate
CREATE INDEX idx_reddit_posts_embedding ON reddit_posts 
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- For a larger dataset (millions of records), increase lists:
-- WITH (lists = 1000)

4. Creating a Testing Endpoint

Build a testing endpoint to experiment with different thresholds:

// app/api/search/test/route.ts
export async function POST(request: Request) {
  const { query, thresholds = [0.6, 0.7, 0.8], maxResults = 10 } = await request.json();
  
  // Generate embedding for search query
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  
  const embedding = embeddingResponse.data[0].embedding;
  
  // Test different thresholds
  const results = {};
  
  for (const threshold of thresholds) {
    const { data } = await supabase.rpc('search_posts_by_embedding', {
      query_embedding: embedding,
      similarity_threshold: threshold,
      max_results: maxResults
    });
    
    results[threshold] = data;
  }
  
  return Response.json({ 
    query,
    results
  });
}

5. Implementing a Hybrid Search

Combine vector and keyword search for better results:

CREATE OR REPLACE FUNCTION hybrid_search(
  search_query TEXT,
  similarity_threshold FLOAT DEFAULT 0.7,
  max_results INTEGER DEFAULT 10
)
RETURNS TABLE (
  id TEXT,
  title TEXT,
  content TEXT,
  similarity FLOAT,
  rank FLOAT
)
LANGUAGE plpgsql
AS $$
DECLARE
  query_embedding VECTOR(1536);
  search_terms TSQUERY;
BEGIN
  -- Get embedding via a helper function or API call
  -- For this example, we assume generate_embeddings is a function
  SELECT embedding INTO query_embedding FROM generate_embeddings(search_query);
  
  -- Create text search query
  search_terms := to_tsquery('english', search_query);
  
  RETURN QUERY
  SELECT
    p.id,
    p.title,
    p.content,
    COALESCE(1 - (p.embedding <=> query_embedding), 0) AS similarity,
    -- Combined score using both vector similarity and text search rank
    (COALESCE(1 - (p.embedding <=> query_embedding), 0) * 0.7) + 
    (COALESCE(ts_rank(p.search_vector, search_terms), 0) * 0.3) AS rank
  FROM reddit_posts p
  WHERE (1 - (p.embedding <=> query_embedding) > similarity_threshold)
     OR (p.search_vector @@ search_terms)
  ORDER BY rank DESC
  LIMIT max_results;
END;
$$;

Best Practices for Tuning

  1. Start with a baseline (0.7 is a good default threshold)
  2. Test with representative queries from your expected user base
  3. Gradually adjust one parameter at a time and compare results
  4. Consider creating a feedback loop where user interactions help tune parameters
  5. Log search queries and results to analyze performance over time

By implementing these adjustments, you can fine-tune your semantic search to achieve the optimal balance between precision and recall for your specific Reddit data.