Use Case Demo: Build a content recommendation system using vector search

This repository contains code for an example content recommendation system using vector search in Tinybird.

Vector search is a great way to approach content matching and recommendations. You can calculate embeddings based on multi-modal analysis of text, images, and other media, then calculate vector distances between embeddings to recommend matching content.

Key Features

HuggingFace Embeddings: Uses HuggingFace all-MiniLM-L6-v2 (384 dimensions) - free and runs locally
Standalone Scripts: Python and Node.js scripts that work independently
Slug-based Queries: Queries are based on the post slug to ensure consistent results
Production-Ready: Multi-node Tinybird pipes for optimal performance

How It Works

Generate Embeddings: Scripts use HuggingFace models to generate embeddings locally
Store in Tinybird: Embeddings are sent to Tinybird via Events API along with post metadata
Query Similar Posts: The Tinybird pipe finds posts with similar embeddings using cosine similarity
Return Results: Related posts are returned sorted by similarity score

Architecture

┌─────────────┐
│   Posts     │
│  (Script)   │
└──────┬──────┘
       │
       ├──► Generate Embeddings
       │    (HuggingFace all-MiniLM-L6-v2)
       │
       ▼
┌─────────────────┐
│   Tinybird      │
│  Data Source    │
│  (posts)        │
└──────┬──────────┘
       │
       ▼
┌─────────────────┐
│  Tinybird Pipe  │
│ (similar_posts) │
└──────┬──────────┘
       │
       ▼
┌─────────────────┐
│  Tinybird API   │
│  Endpoint       │
└─────────────────┘

Setup

1. Install and Authenticate Tinybird CLI

First, install the Tinybird CLI (if not already installed):

cd tinybird
curl https://tinybird.co | sh

Then authenticate with your Tinybird account:

tb login

This will open your browser where you can create a new workspace or select an existing one.

2. Build and Deploy Tinybird Resources

# Build the project (builds all datasources and pipes)
tb build

# Deploy to Tinybird Cloud
tb --cloud deploy

See the Tinybird Quick Start guide for more details.

3. Install Dependencies

For Python:

cd scripts/python
pip install -r requirements.txt

For Node.js:

cd scripts/node
npm install

4. Set Environment Variables

You'll need tokens with the appropriate scopes:

DATASOURCES:WRITE scope to send events to Tinybird
PIPES:READ scope to query the pipe endpoint

See the Tinybird Tokens documentation for instructions on creating tokens.

export TB_HOST=https://api.tinybird.co  # or your Tinybird host
export TB_TOKEN=your_tinybird_token_here

5. Generate Embeddings

The scripts automatically load posts from sample-data/posts.json by default. You can customize the source using the POSTS_SOURCE environment variable to point to a different file or URL.

Posts JSON file format:

The scripts expect a JSON file with a direct array of posts:

[
  {
    "slug": "my-post",
    "title": "My Post Title",
    "excerpt": "Post excerpt...",
    "content": "Full content...",
    "categories": ["tech"],
    "published_on": "2025-01-15",
    "status": "published",
    "updated_at": "2025-01-20"
  }
]

Using Python

Run with default sample data

cd scripts/python
python generate_embeddings.py

Run with custom posts file or URL

# From a local file
POSTS_SOURCE=../custom/posts.json python generate_embeddings.py

# From a URL (e.g., your CMS API)
POSTS_SOURCE=https://your-cms.com/api/posts python generate_embeddings.py

Use as a library

import sys
sys.path.append('scripts/python')
from generate_embeddings import send_posts_to_tinybird, get_related_posts, load_posts_from_source
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load posts from file or URL
posts = load_posts_from_source('sample-data/posts.json')
# Or from URL:
# posts = load_posts_from_source('https://your-cms.com/api/posts')

# Generate embeddings and send to Tinybird
send_posts_to_tinybird(posts, model)

# Get related posts
related = get_related_posts("my-post", limit=10)

Using Node.js

Run with default sample data

cd scripts/node
node generate_embeddings.js

Run with custom posts file or URL

# From a local file
POSTS_SOURCE=../custom/posts.json node generate_embeddings.js

# From a URL (e.g., your CMS API)
POSTS_SOURCE=https://your-cms.com/api/posts node generate_embeddings.js

Use as a library

const { sendPostsToTinybird, getRelatedPosts, loadPostsFromSource } = require('./scripts/node/generate_embeddings');
const { pipeline } = require('@xenova/transformers');

// Load model
const model = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Load posts from file or URL
const posts = await loadPostsFromSource('sample-data/posts.json');
// Or from URL:
// const posts = await loadPostsFromSource('https://your-cms.com/api/posts');

// Generate embeddings and send to Tinybird
await sendPostsToTinybird(posts, model);

// Get related posts
const related = await getRelatedPosts("my-post", 10);

6. Query Related Posts

Once embeddings are in Tinybird, you can query the pipe endpoint:

curl --compressed \
  -H "Authorization: Bearer $TB_TOKEN" \
  "https://<your_host>/v0/pipes/similar_posts.json?slug=my-post&limit=10"

7. Automated Workflows (GitHub Actions)

This repository includes a GitHub Actions workflow (.github/workflows/tinybird_recommendations.yml) as an example for automating the embedding generation process.

Important: The workflow is provided as a template. To use it with your actual blog content, you need to:

Configure the posts source: The scripts load from sample-data/posts.json by default. To use your own data source, set the POSTS_SOURCE environment variable in the workflow:
- For a local file: POSTS_SOURCE: sample-data/posts.json
- For a URL/API: POSTS_SOURCE: https://your-cms.com/api/posts
You can also create a script that fetches posts from your CMS and saves them to a JSON file before running the embedding script.
Configure GitHub Secrets:
- TB_TOKEN: Your Tinybird token with DATASOURCES:WRITE and PIPES:READ scopes
- TB_HOST (optional): Your Tinybird host (defaults to https://api.tinybird.co)
- POSTS_SOURCE (optional): Custom posts file path or URL (defaults to sample-data/posts.json)
Customize the schedule: The workflow runs twice daily (midnight and noon UTC) by default. Adjust the cron expression in the workflow file to match your needs.

The workflow will automatically:

Install Python dependencies
Run the embedding generation script (loading posts from the configured source)
Send embeddings to your Tinybird data source

Example workflow modification: To fetch posts from a CMS before generating embeddings, add a step:

- name: Fetch posts from CMS
  run: |
    curl -o posts.json https://your-cms.com/api/posts
- name: Generate recommendations
  env:
    TB_TOKEN: ${{ secrets.TB_TOKEN }}
    TB_HOST: ${{ secrets.TB_HOST }}
    POSTS_SOURCE: posts.json
  run: python scripts/python/generate_embeddings.py

Embedding Models

This example uses HuggingFace all-MiniLM-L6-v2 (384 dimensions), which:

Runs locally (no API keys required)
Is free to use
Provides good quality embeddings
Has 384 dimensions (smaller than OpenAI, but sufficient for most use cases)

Alternative Models

You can use any embedding model—whether from HuggingFace or another source—by updating the script implementation to load and apply your preferred model. Just make sure to adjust the processing code and pipeline to accommodate the output from your chosen embedding model.

Important: If you change models, update the pipe's dimension check (currently length(embedding) = 384) and regenerate all embeddings.

Example: Using OpenAI Embeddings (Node.js)

You can also use OpenAI's API for higher-dimensional embeddings (e.g., text-embedding-3-small, 1536 dims). Note that this requires an OPENAI_API_KEY and may incur API costs.

Here's a basic example (JavaScript / Node.js, using openai npm package):

import { OpenAI } from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function getOpenAIEmbedding(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  // Returns array of floats (length 1536)
  return response.data[0].embedding;
}

// Example usage:
const post = {
  slug: "my-post",
  title: "My Post Title",
  excerpt: "Post excerpt...",
  content: "Full content...",
  categories: ["tech"],
  published_on: "2025-01-15",
  status: "published",
};

const textToEmbed = `${post.title} ${post.excerpt} ${post.content}`;
const embedding = await getOpenAIEmbedding(textToEmbed);
// Now use `embedding` in the event you send to Tinybird
// Only send: { timestamp, slug, embedding, status }

If using OpenAI, ensure your Tinybird pipes and queries expect 1536 dimensions: update any checks from length(embedding) = 384 to length(embedding) = 1536, and regenerate all embeddings accordingly.

API Reference

Tinybird Pipe: `similar_posts`

Get related posts for a given slug.

Query Parameters:

slug (required): Post slug to find related posts for
limit (optional): Maximum number of results (default: 10)
min_similarity (optional): Minimum similarity threshold (default: 0.1)

Response:

{
  "data": [
    {
      "slug": "related-post",
      "status": "published",
      "similarity": 0.85
    }
  ]
}

Real-World Example

See the Tinybird Blog for a production implementation of this system. Each blog post uses vector search to show related content at the bottom of the page.

Need help?

Read the Tinybird Vector Search Guide
Join the Tinybird Slack Community for support
Explore Tinybird Documentation for more use cases

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
sample-data		sample-data
scripts		scripts
tinybird		tinybird
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Use Case Demo: Build a content recommendation system using vector search

Key Features

How It Works

Architecture

Setup

1. Install and Authenticate Tinybird CLI

2. Build and Deploy Tinybird Resources

3. Install Dependencies

4. Set Environment Variables

5. Generate Embeddings

Using Python

Run with default sample data

Run with custom posts file or URL

Use as a library

Using Node.js

Run with default sample data

Run with custom posts file or URL

Use as a library

6. Query Related Posts

7. Automated Workflows (GitHub Actions)

Embedding Models

Alternative Models

Example: Using OpenAI Embeddings (Node.js)

API Reference

Tinybird Pipe: similar_posts

Real-World Example

Need help?

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Tinybird Pipe: `similar_posts`

Packages