any-embedding

A high-performance control plane for self-hosted embeddings.

In production for flow-like.com
Serverless gateway on Google Cloud Run, with isolated CPU and GPU embedding workers behind it.

any-embedding gives you one OpenAI-compatible endpoint in front of a fleet of sentence-transformers models, with per-model workers, baked model images, and a clean path from local Docker Compose to a production-grade Google Cloud Run deployment.

It is built for the reality of modern embedding infrastructure:

the leaderboard changes constantly
CPU and GPU models need different deployment shapes
clients want one stable API, not a zoo of custom services

Why teams reach for this

One API, many models. Route by model name and keep client integration fixed.
Config-driven fleet. Add or remove models in config.yaml, then regenerate local compose or deploy with Terraform.
No cold-start downloads. Worker images bake model weights at build time.
GPU where it matters. Heavy models can run on dedicated GPU workers; smaller models stay on cheaper CPU workers.
OpenAI-compatible surface. Existing SDKs and integrations can usually point at this with minimal adaptation.
Single-command cloud deploy. Build, push, and apply infrastructure with one mise task.
Clean separation of concerns. Gateway handles auth and routing. Workers only load and serve one model.

How it works

Client
  |
  v
Gateway (FastAPI, Cloud Run or local)
  - API key auth
  - OpenAI-compatible /v1/embeddings
  - model registry loaded from config.yaml
  - routes request to the correct worker
  |
  +--> Worker: BAAI/bge-large-en-v1.5
  +--> Worker: Qwen/Qwen3-Embedding-0.6B
  +--> Worker: gte-multilingual-base
  +--> Worker: CLIP / multimodal model

Each worker loads exactly one model and exposes an internal /embed endpoint. The gateway stays thin, predictable, and cheap. The workers scale independently based on the actual model mix you serve.

Production posture

This repo is no longer just a demo wrapper around sentence-transformers. The current codebase already includes meaningful production controls:

Constant-time API key comparison in the gateway.
In-memory rate limiting at the public edge.
Structured audit logging for API requests.
Cloud Run identity tokens for gateway-to-worker calls.
Internal-only worker ingress on GCP.
SSRF protection for remote image URLs in multimodal workers.
Secret Manager and CMEK-backed secret storage in the GCP deployment.
Monitoring, uptime checks, and alerting in the GCP Terraform stack.

Model catalog

The entire stack is driven by config.yaml.

models:
  - name: "bge-large-en-v1.5"
    model: "BAAI/bge-large-en-v1.5"
    type: "text"
    max_tokens: 512
    dimensions: 1024

  - name: "qwen3-embedding-0.6b"
    model: "Qwen/Qwen3-Embedding-0.6B"
    type: "text"
    max_tokens: 32768
    dimensions: 1024
    gpu: true
    cpu: "4"
    memory: "16Gi"

Supported per-model overrides:

cpu
memory
gpu
max_instances
min_instances
sentence_transformers_version
transformers_version

This is the core design choice in the repo: configuration defines the fleet, and both local development and cloud deployment consume the same source of truth.

Quickstart

Run the full stack locally with Docker

This is the fastest way to validate the whole system end to end.

./test.sh

Useful variants:

./test.sh --up
./test.sh --down

The test script generates docker-compose.yaml, builds the images, starts the gateway and workers, runs integration checks, and tears the stack down unless you keep it running.

If you want gated models to build locally, copy .env.example to .env and set HF_TOKEN first.

Develop locally without Docker

With mise:

mise install
mise run install

Start a worker:

MODEL_NAME="BAAI/bge-large-en-v1.5" mise run worker

Start the gateway in another terminal:

API_KEY="test-key" \
WORKER_URL_BGE_LARGE_EN_V1_5="http://localhost:8081" \
mise run gateway

Without mise, you can install directly:

uv pip install -e '.[gateway,worker]'

Available task entry points live in mise.toml, including local run, test, and GCP deploy flows.

API

`POST /v1/embeddings`

Text request:

curl https://<gateway-url>/v1/embeddings \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-large-en-v1.5",
    "input": "query: serverless GPU embeddings"
  }'

Image request for multimodal models:

curl https://<gateway-url>/v1/embeddings \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "clip-vit-large-patch14",
    "input": {
      "type": "image",
      "image": {
        "type": "image_base64",
        "image_base64": "<base64-encoded-image>"
      }
    }
  }'

The API supports:

single text inputs
batch text inputs
image inputs via image_base64
image inputs via image_url

`GET /v1/models`

List all configured models through an OpenAI-style response.

curl https://<gateway-url>/v1/models \
  -H "Authorization: Bearer <API_KEY>"

`GET /health`

Basic health check for the gateway.

Why the architecture is strong

The gateway is stateless. It only validates auth, loads config, and forwards requests.
The workers are isolated. One bad model does not poison the rest of the fleet.
The images are reproducible. You know exactly which model ships with which service.
The deployment is boring in the best possible way. Local compose, test automation, deploy automation, and Terraform all follow the same model registry.

This is the kind of setup that scales attention well: it is easy to explain, easy to extend, and does not collapse into custom one-off infrastructure as the model set grows.

Deploy to Google Cloud

Google Cloud Run deployment with serverless GPU workers

Google Cloud is the right deployment target for this project because Cloud Run is the only serverless option in this architecture that lets you keep the operational simplicity of a fully managed platform while still attaching GPUs to the workers that need them.

This is also the deployment story behind flow-like.com: one public gateway, isolated model workers, and serverless GPU capacity when heavier embedding models need it.

Why that matters here:

Serverless plus GPU is the hard problem. CPU-only embeddings are easy; serious embedding fleets usually are not.
Cloud Run lets the gateway stay lightweight and public while model workers scale independently behind IAM.
The Terraform in deployment/gcp/main.tf already maps gpu: true models to NVIDIA L4-backed Cloud Run services.
You avoid running Kubernetes just to serve one or two heavier embedding models.
The project architecture matches Cloud Run cleanly: small gateway, isolated workers, image-based deploys, explicit service-to-service auth.
The deployment path now includes Secret Manager, CMEK, long-term audit logs, uptime checks, and alerting without changing the application shape.

In short: if you want managed infrastructure and GPU-backed model workers without operating a cluster, Google Cloud is the story.

Prerequisites

A GCP project with billing enabled
Docker
Terraform 1.5+
Authenticated gcloud access to push container images
An Artifact Registry repository for your gateway and worker images
A local terraform.tfvars copied from deployment/gcp/terraform.tfvars.example

Environment setup

Copy the example .env file and add your Hugging Face token (required for gated models like google/embeddinggemma-300m):

cp .env.example .env

Edit .env and set your token:

HF_TOKEN=hf_your_token_here

The deploy script in deployment/gcp/deploy.py loads .env automatically, passes HF_TOKEN as a BuildKit secret for image builds, and avoids baking that secret into the final image layers. The .env file is git-ignored.

Deploy

A single command builds all Docker images (gateway + every worker), pushes them to your registry, and runs terraform apply:

mise run deploy:gcp

To preview changes without applying:

mise run deploy:gcp:plan

To re-apply Terraform without rebuilding images:

mise run deploy:gcp:tf-only

To test the deployed backend after rollout:

./deployment/gcp/test_backend.sh

Terraform will:

create one Cloud Run service per model
attach GPUs for models marked with gpu: true
create the gateway service account
create a dedicated worker service account
wire worker invocation permissions
store API keys and optional Hugging Face tokens in Secret Manager
encrypt secrets and retained logs with Cloud KMS
keep workers internal-only while exposing the gateway publicly
configure audit-log retention, uptime checks, and alerting
publish the gateway service and output its URL

GCP layout

deployment/gcp/deploy.py builds and pushes images, then runs Terraform.
deployment/gcp/main.tf defines Cloud Run services, IAM, secrets, logging, and monitoring.
deployment/gcp/test_backend.sh validates the live deployment against every model in config.yaml.
deployment/gcp/terraform.tfvars.example is the starting point for project-specific values.

Project structure

app/gateway.py - auth, model registry, request forwarding
app/worker.py - model loading and embedding inference
config.yaml - model fleet definition
generate_compose.py - local stack generation
deployment/gcp/deploy.py - build, push, and deploy orchestration for GCP
deployment/gcp/main.tf - Cloud Run services, IAM, secrets, logging, monitoring, GPU configuration
deployment/gcp/test_backend.sh - live backend validation after deploy
.env.example - optional Hugging Face token setup for gated models
mise.toml - local development, test, and deploy tasks
test.sh - end-to-end local validation

Notes

Clients must add any model-specific prefixes such as query: or passage: themselves.
Some models are gated and require a valid HF_TOKEN with accepted license terms.
Some Jina models pin an older transformers version for compatibility.
The public gateway uses a preshared API key for authentication.
Multimodal image URL inputs are restricted to avoid internal-network fetches.

What to build next

If this project gets the attention it deserves, the obvious next layer is not a rewrite. It is observability, request accounting, and model-level traffic policy on top of the same gateway/worker split.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
app		app
deployment/gcp		deployment/gcp
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.gateway		Dockerfile.gateway
Dockerfile.worker		Dockerfile.worker
Dockerfile.worker-gpu		Dockerfile.worker-gpu
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
config.yaml		config.yaml
generate_compose.py		generate_compose.py
main.py		main.py
mise.toml		mise.toml
pyproject.toml		pyproject.toml
test.sh		test.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

any-embedding

Why teams reach for this

How it works

Production posture

Model catalog

Quickstart

Run the full stack locally with Docker

Develop locally without Docker

API

`POST /v1/embeddings`

`GET /v1/models`

`GET /health`

Why the architecture is strong

Deploy to Google Cloud

Prerequisites

Environment setup

Deploy

GCP layout

Project structure

Notes

What to build next

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

any-embedding

Why teams reach for this

How it works

Production posture

Model catalog

Quickstart

Run the full stack locally with Docker

Develop locally without Docker

API

POST /v1/embeddings

GET /v1/models

GET /health

Why the architecture is strong

Deploy to Google Cloud

Prerequisites

Environment setup

Deploy

GCP layout

Project structure

Notes

What to build next

About

Topics

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`POST /v1/embeddings`

`GET /v1/models`

`GET /health`