A high-performance control plane for self-hosted embeddings.
In production for flow-like.com
Serverless gateway on Google Cloud Run, with isolated CPU and GPU embedding workers behind it.
any-embedding gives you one OpenAI-compatible endpoint in front of a fleet of sentence-transformers models, with per-model workers, baked model images, and a clean path from local Docker Compose to a production-grade Google Cloud Run deployment.
It is built for the reality of modern embedding infrastructure:
- the leaderboard changes constantly
- CPU and GPU models need different deployment shapes
- clients want one stable API, not a zoo of custom services
- One API, many models. Route by model name and keep client integration fixed.
- Config-driven fleet. Add or remove models in config.yaml, then regenerate local compose or deploy with Terraform.
- No cold-start downloads. Worker images bake model weights at build time.
- GPU where it matters. Heavy models can run on dedicated GPU workers; smaller models stay on cheaper CPU workers.
- OpenAI-compatible surface. Existing SDKs and integrations can usually point at this with minimal adaptation.
- Single-command cloud deploy. Build, push, and apply infrastructure with one
misetask. - Clean separation of concerns. Gateway handles auth and routing. Workers only load and serve one model.
Client
|
v
Gateway (FastAPI, Cloud Run or local)
- API key auth
- OpenAI-compatible /v1/embeddings
- model registry loaded from config.yaml
- routes request to the correct worker
|
+--> Worker: BAAI/bge-large-en-v1.5
+--> Worker: Qwen/Qwen3-Embedding-0.6B
+--> Worker: gte-multilingual-base
+--> Worker: CLIP / multimodal model
Each worker loads exactly one model and exposes an internal /embed endpoint. The gateway stays thin, predictable, and cheap. The workers scale independently based on the actual model mix you serve.
This repo is no longer just a demo wrapper around sentence-transformers. The current codebase already includes meaningful production controls:
- Constant-time API key comparison in the gateway.
- In-memory rate limiting at the public edge.
- Structured audit logging for API requests.
- Cloud Run identity tokens for gateway-to-worker calls.
- Internal-only worker ingress on GCP.
- SSRF protection for remote image URLs in multimodal workers.
- Secret Manager and CMEK-backed secret storage in the GCP deployment.
- Monitoring, uptime checks, and alerting in the GCP Terraform stack.
The entire stack is driven by config.yaml.
models:
- name: "bge-large-en-v1.5"
model: "BAAI/bge-large-en-v1.5"
type: "text"
max_tokens: 512
dimensions: 1024
- name: "qwen3-embedding-0.6b"
model: "Qwen/Qwen3-Embedding-0.6B"
type: "text"
max_tokens: 32768
dimensions: 1024
gpu: true
cpu: "4"
memory: "16Gi"Supported per-model overrides:
cpumemorygpumax_instancesmin_instancessentence_transformers_versiontransformers_version
This is the core design choice in the repo: configuration defines the fleet, and both local development and cloud deployment consume the same source of truth.
This is the fastest way to validate the whole system end to end.
./test.shUseful variants:
./test.sh --up
./test.sh --downThe test script generates docker-compose.yaml, builds the images, starts the gateway and workers, runs integration checks, and tears the stack down unless you keep it running.
If you want gated models to build locally, copy .env.example to .env and set HF_TOKEN first.
With mise:
mise install
mise run installStart a worker:
MODEL_NAME="BAAI/bge-large-en-v1.5" mise run workerStart the gateway in another terminal:
API_KEY="test-key" \
WORKER_URL_BGE_LARGE_EN_V1_5="http://localhost:8081" \
mise run gatewayWithout mise, you can install directly:
uv pip install -e '.[gateway,worker]'Available task entry points live in mise.toml, including local run, test, and GCP deploy flows.
Text request:
curl https://<gateway-url>/v1/embeddings \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-large-en-v1.5",
"input": "query: serverless GPU embeddings"
}'Image request for multimodal models:
curl https://<gateway-url>/v1/embeddings \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "clip-vit-large-patch14",
"input": {
"type": "image",
"image": {
"type": "image_base64",
"image_base64": "<base64-encoded-image>"
}
}
}'The API supports:
- single text inputs
- batch text inputs
- image inputs via
image_base64 - image inputs via
image_url
List all configured models through an OpenAI-style response.
curl https://<gateway-url>/v1/models \
-H "Authorization: Bearer <API_KEY>"Basic health check for the gateway.
- The gateway is stateless. It only validates auth, loads config, and forwards requests.
- The workers are isolated. One bad model does not poison the rest of the fleet.
- The images are reproducible. You know exactly which model ships with which service.
- The deployment is boring in the best possible way. Local compose, test automation, deploy automation, and Terraform all follow the same model registry.
This is the kind of setup that scales attention well: it is easy to explain, easy to extend, and does not collapse into custom one-off infrastructure as the model set grows.
Google Cloud Run deployment with serverless GPU workers
Google Cloud is the right deployment target for this project because Cloud Run is the only serverless option in this architecture that lets you keep the operational simplicity of a fully managed platform while still attaching GPUs to the workers that need them.
This is also the deployment story behind flow-like.com: one public gateway, isolated model workers, and serverless GPU capacity when heavier embedding models need it.
Why that matters here:
- Serverless plus GPU is the hard problem. CPU-only embeddings are easy; serious embedding fleets usually are not.
- Cloud Run lets the gateway stay lightweight and public while model workers scale independently behind IAM.
- The Terraform in deployment/gcp/main.tf already maps
gpu: truemodels to NVIDIA L4-backed Cloud Run services. - You avoid running Kubernetes just to serve one or two heavier embedding models.
- The project architecture matches Cloud Run cleanly: small gateway, isolated workers, image-based deploys, explicit service-to-service auth.
- The deployment path now includes Secret Manager, CMEK, long-term audit logs, uptime checks, and alerting without changing the application shape.
In short: if you want managed infrastructure and GPU-backed model workers without operating a cluster, Google Cloud is the story.
- A GCP project with billing enabled
- Docker
- Terraform 1.5+
- Authenticated
gcloudaccess to push container images - An Artifact Registry repository for your gateway and worker images
- A local
terraform.tfvarscopied from deployment/gcp/terraform.tfvars.example
Copy the example .env file and add your Hugging Face token (required for gated models like google/embeddinggemma-300m):
cp .env.example .envEdit .env and set your token:
HF_TOKEN=hf_your_token_here
The deploy script in deployment/gcp/deploy.py loads .env automatically, passes HF_TOKEN as a BuildKit secret for image builds, and avoids baking that secret into the final image layers. The .env file is git-ignored.
A single command builds all Docker images (gateway + every worker), pushes them to your registry, and runs terraform apply:
mise run deploy:gcpTo preview changes without applying:
mise run deploy:gcp:planTo re-apply Terraform without rebuilding images:
mise run deploy:gcp:tf-onlyTo test the deployed backend after rollout:
./deployment/gcp/test_backend.shTerraform will:
- create one Cloud Run service per model
- attach GPUs for models marked with
gpu: true - create the gateway service account
- create a dedicated worker service account
- wire worker invocation permissions
- store API keys and optional Hugging Face tokens in Secret Manager
- encrypt secrets and retained logs with Cloud KMS
- keep workers internal-only while exposing the gateway publicly
- configure audit-log retention, uptime checks, and alerting
- publish the gateway service and output its URL
- deployment/gcp/deploy.py builds and pushes images, then runs Terraform.
- deployment/gcp/main.tf defines Cloud Run services, IAM, secrets, logging, and monitoring.
- deployment/gcp/test_backend.sh validates the live deployment against every model in config.yaml.
- deployment/gcp/terraform.tfvars.example is the starting point for project-specific values.
- app/gateway.py - auth, model registry, request forwarding
- app/worker.py - model loading and embedding inference
- config.yaml - model fleet definition
- generate_compose.py - local stack generation
- deployment/gcp/deploy.py - build, push, and deploy orchestration for GCP
- deployment/gcp/main.tf - Cloud Run services, IAM, secrets, logging, monitoring, GPU configuration
- deployment/gcp/test_backend.sh - live backend validation after deploy
- .env.example - optional Hugging Face token setup for gated models
- mise.toml - local development, test, and deploy tasks
- test.sh - end-to-end local validation
- Clients must add any model-specific prefixes such as
query:orpassage:themselves. - Some models are gated and require a valid
HF_TOKENwith accepted license terms. - Some Jina models pin an older
transformersversion for compatibility. - The public gateway uses a preshared API key for authentication.
- Multimodal image URL inputs are restricted to avoid internal-network fetches.
If this project gets the attention it deserves, the obvious next layer is not a rewrite. It is observability, request accounting, and model-level traffic policy on top of the same gateway/worker split.