The user is asking about fine-tuning the thresholds for the cosine similarity and other parameters in the semantic search implementation. Let me examine what parameters can be adjusted and how they can be fine-tuned.

Parameters that can be adjusted:

Cosine similarity threshold
Number of results returned
Vector index configuration
Text preprocessing

Fine-tuning Semantic Search Parameters

Fine-tune the similarity thresholds and other parameters. Here's how:

1. Adjusting Similarity Threshold

// In your API call
const { data } = await supabase.rpc('search_posts_by_embedding', {
  query_embedding: embedding,
  similarity_threshold: 0.75, // Increase for higher precision, decrease for better recall
  max_results: 10
});

You can adjust this threshold based on:

Higher threshold (e.g., 0.8-0.9): More precise results but fewer matches
Lower threshold (e.g., 0.5-0.6): More results but potentially less relevant
Typical range: 0.65-0.75 works well for most applications

2. Creating a Configurable Search Function

Make your search function configurable by passing parameters:

-- Enhanced search function with more parameters
CREATE OR REPLACE FUNCTION advanced_semantic_search(
  query_embedding VECTOR(1536),
  similarity_threshold FLOAT DEFAULT 0.7,
  max_results INTEGER DEFAULT 10,
  filter_subreddit TEXT DEFAULT NULL,
  min_score INTEGER DEFAULT NULL,
  include_nsfw BOOLEAN DEFAULT FALSE
)
RETURNS TABLE (
  id TEXT,
  subreddit TEXT,
  title TEXT,
  content TEXT,
  score INTEGER,
  similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    p.id,
    p.subreddit,
    p.title,
    p.content,
    p.score,
    1 - (p.embedding <=> query_embedding) AS similarity
  FROM reddit_posts p
  WHERE 1 - (p.embedding <=> query_embedding) > similarity_threshold
    AND (filter_subreddit IS NULL OR p.subreddit = filter_subreddit)
    AND (min_score IS NULL OR p.score >= min_score)
    AND (include_nsfw OR NOT p.is_nsfw)
  ORDER BY similarity DESC
  LIMIT max_results;
END;
$$;

3. Adjusting Vector Index Configuration

You can tune the IVF index for better performance:

-- You can adjust the 'lists' parameter based on your data size
-- Higher lists = faster search but potentially less accurate
CREATE INDEX idx_reddit_posts_embedding ON reddit_posts 
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- For a larger dataset (millions of records), increase lists:
-- WITH (lists = 1000)

4. Creating a Testing Endpoint

Build a testing endpoint to experiment with different thresholds:

// app/api/search/test/route.ts
export async function POST(request: Request) {
  const { query, thresholds = [0.6, 0.7, 0.8], maxResults = 10 } = await request.json();
  
  // Generate embedding for search query
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  
  const embedding = embeddingResponse.data[0].embedding;
  
  // Test different thresholds
  const results = {};
  
  for (const threshold of thresholds) {
    const { data } = await supabase.rpc('search_posts_by_embedding', {
      query_embedding: embedding,
      similarity_threshold: threshold,
      max_results: maxResults
    });
    
    results[threshold] = data;
  }
  
  return Response.json({ 
    query,
    results
  });
}

5. Implementing a Hybrid Search

Combine vector and keyword search for better results:

CREATE OR REPLACE FUNCTION hybrid_search(
  search_query TEXT,
  similarity_threshold FLOAT DEFAULT 0.7,
  max_results INTEGER DEFAULT 10
)
RETURNS TABLE (
  id TEXT,
  title TEXT,
  content TEXT,
  similarity FLOAT,
  rank FLOAT
)
LANGUAGE plpgsql
AS $$
DECLARE
  query_embedding VECTOR(1536);
  search_terms TSQUERY;
BEGIN
  -- Get embedding via a helper function or API call
  -- For this example, we assume generate_embeddings is a function
  SELECT embedding INTO query_embedding FROM generate_embeddings(search_query);
  
  -- Create text search query
  search_terms := to_tsquery('english', search_query);
  
  RETURN QUERY
  SELECT
    p.id,
    p.title,
    p.content,
    COALESCE(1 - (p.embedding <=> query_embedding), 0) AS similarity,
    -- Combined score using both vector similarity and text search rank
    (COALESCE(1 - (p.embedding <=> query_embedding), 0) * 0.7) + 
    (COALESCE(ts_rank(p.search_vector, search_terms), 0) * 0.3) AS rank
  FROM reddit_posts p
  WHERE (1 - (p.embedding <=> query_embedding) > similarity_threshold)
     OR (p.search_vector @@ search_terms)
  ORDER BY rank DESC
  LIMIT max_results;
END;
$$;

Best Practices for Tuning

Start with a baseline (0.7 is a good default threshold)
Test with representative queries from your expected user base
Gradually adjust one parameter at a time and compare results
Consider creating a feedback loop where user interactions help tune parameters
Log search queries and results to analyze performance over time

By implementing these adjustments, you can fine-tune your semantic search to achieve the optimal balance between precision and recall for your specific Reddit data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning Semantic Search Parameters

1. Adjusting Similarity Threshold

2. Creating a Configurable Search Function

3. Adjusting Vector Index Configuration

4. Creating a Testing Endpoint

5. Implementing a Hybrid Search

Best Practices for Tuning

FilesExpand file tree

finetuning-params.md

Latest commit

History

finetuning-params.md

File metadata and controls

Fine-tuning Semantic Search Parameters

1. Adjusting Similarity Threshold

2. Creating a Configurable Search Function

3. Adjusting Vector Index Configuration

4. Creating a Testing Endpoint

5. Implementing a Hybrid Search

Best Practices for Tuning