This document outlines the architecture for a highly available, serverless Model Context Protocol (MCP) server deployed on Cloudflare infrastructure. The system provides AI agents with read/write access to a personal knowledge base (e.g. a Git-backed Obsidian vault). Edge compute eliminates cold-start latency and local network ingress complexity, and provides high-speed vector retrieval independent of local hardware state.
The system runs entirely on Cloudflare's native edge services, replacing traditional containerized compute and dedicated database clusters.
- Compute / API Gateway: Cloudflare Workers (V8 Isolates)
- Relational State & Metadata: Cloudflare D1 (Serverless SQLite)
- Semantic Indexing: Cloudflare Vectorize
- Native Inference: Cloudflare Workers AI —
@cf/baai/bge-base-en-v1.5(768-dim embeddings) - Zero Trust Security: Cloudflare Access (service tokens for machine clients)
D1 is the authoritative source for document content, chunks, and metadata. Two tables: documents (one row per captured note) and chunks (one row per embedded segment, joined by document_id).
documents
| Column | Type | Description |
|---|---|---|
| id | TEXT (UUID) | Primary key, generated via Web Crypto API in the Worker. |
| content | TEXT | Raw markdown content. |
| content_hash | TEXT | SHA-256 of content. Indexed. Guarantees idempotency on re-capture. |
| source | TEXT | Free-form origin tag (e.g. obsidian:vault/note.md, voice-memo, web-clip). |
| metadata | TEXT (JSON) | Frontmatter, extracted entities, tags. |
| summary | TEXT | One-line auto-summary (≤140 chars) generated by Llama 3 at capture time. Nullable — pre-enrichment rows are NULL. |
| tags | TEXT (JSON) | JSON array of 3-7 kebab-case topic strings derived from content by Llama 3. Nullable — pre-enrichment rows are NULL. |
| created_at | INTEGER | Unix epoch ms, set on insert. |
| updated_at | INTEGER | Unix epoch ms, bumped on re-capture of the same hash. |
chunks
| Column | Type | Description |
|---|---|---|
| id | TEXT (UUID) | Primary key. Also the Vectorize vector ID. |
| document_id | TEXT (UUID) | FK → documents.id. Indexed. Cascade on document delete. |
| ordinal | INTEGER | 0-based position of the chunk within the source document. |
| content | TEXT | Chunk text — the exact string that was embedded. |
Notes are chunked before embedding to preserve retrieval granularity:
- Target size: ~512 tokens per chunk (matches
bge-base-en-v1.5's 512-token input limit). - Overlap: ~64 tokens between adjacent chunks to preserve cross-boundary context.
- Splitter: Paragraph-aware — split on blank lines first, then pack paragraphs into chunks until the target size is reached. Falls back to sentence/character splitting only when a single paragraph exceeds the budget.
- Short notes (≤ ~512 tokens) produce a single chunk and skip splitting entirely.
- Dimensions: 768 (matches
@cf/baai/bge-base-en-v1.5). - Metric: Cosine similarity.
- Vector ID: the
chunks.idUUID — read-through to D1 for hydration. - Metadata payload:
{ document_id, ordinal, tags[] }for filtered search and result grouping.
Switching to a 1024- or 1536-dim provider later requires a Vectorize re-index; the surrounding schema stays the same.
The Worker exposes its capabilities to agents via MCP over Streamable HTTP (the standard transport for remote MCP servers). The same Worker also exposes a small REST surface for non-agent ingest clients (CI/CD pipelines, mobile capture apps).
| Tool | Inputs | Behavior |
|---|---|---|
capture_thought |
content, source?, metadata? |
Ingest a markdown note. Idempotent on content_hash. |
semantic_search |
query, top_k?, tags? |
Embed query, retrieve nearest chunks, hydrate from D1. |
get_thought |
id |
Fetch a single document and its chunks by ID. |
delete_thought |
id |
Remove a document, its chunks, and all associated vectors. |
list_recent |
limit? |
Return recent documents by created_at. |
- Client (local agent, CI/CD pipeline, or capture app) sends
contentto the Worker. - Worker computes SHA-256 of
contentand looks upcontent_hashin D1.- Hit: bump
updated_aton the existing row and return its ID. No re-embedding. - Miss: continue.
- Hit: bump
- Worker chunks the content per §3.2.
- Worker runs enrichment and embedding in parallel (
Promise.all):- Enrichment:
@cf/meta/llama-3.1-8b-instruct(temperature 0.2) extracts asummary(≤140 chars) and 3-7tags(kebab-case). Non-blocking — any AI or parse error silently yields empty fields rather than failing capture. - Embedding:
@cf/baai/bge-base-en-v1.5embeds all chunks. Sync — bounded latency for personal-scale notes.
- Enrichment:
- Worker writes the
documentsrow (includingsummaryandtags) andchunksrows to D1, then upserts vectors into Vectorize. Vectorize metadata includestags[]for future filtered search. D1 and Vectorize are separate bindings with no cross-resource transaction; on Vectorize failure the Worker rolls back the D1 inserts.
Note — Vectorize propagation lag. Newly upserted vectors are not immediately queryable; observed lag is 10–30s. v0 stance is accept and document: capture returns sub-second, but a search issued in the same window may miss the just-captured content. Revisit if read-after-write becomes a problem in practice.
- Agent sends a query string via MCP.
- Worker embeds the query via Workers AI.
- Worker queries Vectorize for top-K nearest neighbors, optionally pre-filtered by
tags. - Worker hydrates each hit by reading the corresponding
chunksanddocumentsrows in D1. - Worker returns structured results:
[{ document_id, chunk_content, document_content, score, metadata }].
- Agent calls
delete_thoughtwith a document ID. - Worker reads the document's chunk IDs, deletes the corresponding vectors from Vectorize, then deletes the chunks and the document row from D1 (cascade).
The Worker is shielded behind Cloudflare Access. Automated clients authenticate via Service Tokens (CF-Access-Client-Id / CF-Access-Client-Secret). This prevents unauthorized endpoint probing without requiring open inbound ports on the local gateway.
To minimize time to a working end-to-end system, v0 deliberately defers the more involved pieces:
In v0:
- A single Worker exposing REST routes:
POST /capture,POST /search,GET /thought/:id,DELETE /thought/:id. - Bound D1, Vectorize, and Workers AI.
bge-base-en-v1.5at 768 dims, paragraph-aware chunking per §3.2.- Bearer-token auth using a single shared secret (Wrangler secret). (removed in v1)
In v1:
- Cloudflare Access service tokens replacing the bearer secret. The Worker verifies the
Cf-Access-Jwt-AssertionJWT against CF's JWKS endpoint for defense-in-depth. The bearer shim (AUTH_SECRET) has been removed entirely — hard cutover, no fallback. - MCP Streamable HTTP server mounted at
/mcpin the same Worker. Exposes 5 tools:capture_thought,semantic_search,get_thought,delete_thought,list_recent. Implemented withMcpServer(@modelcontextprotocol/sdk) +createMcpHandlerfrom the Cloudflareagentspackage (stateless — no Durable Objects required). list_recentREST route:GET /thoughts?limit=N&before=<unix_ms>— returns recent documents bycreated_at DESC. The optionalbeforecursor returns docs strictly older than the given Unix epoch ms, letting backup/export jobs walk the full corpus inlimit-sized batches.- Auth: same check as REST routes applies to the
/mcpendpoint. - Capture-time enrichment via Llama 3:
@cf/meta/llama-3.1-8b-instructruns in parallel with embedding at capture time, producing asummary(≤140 chars) and 3-7 kebab-casetags. Persisted on thedocumentsrow and surfaced in all read paths (get_thought,list_recent,semantic_search). Tags also stored in Vectorize metadata for future filtered search. Non-blocking — enrichment failure yields empty fields, never breaks capture.
Deferred to v2+:
- Re-ranking pass on search results.
- CF Queues for async embedding — only if sync embedding latency becomes a problem at capture time.
A running register of decisions to validate during the prototype phase. Each entry records the current direction and what evidence would change it.
Direction: Vocabulary-aware enrichment (Option A). On capture, fetch the top-N most-frequent existing tags from D1 and pass them to the Llama prompt as preferred reuse candidates. New tags are still allowed, but the LLM is biased toward the existing vocabulary so the corpus converges on its own ontology rather than fragmenting (k8s vs kubernetes, vector-database vs vector-db).
To validate during prototype:
- Does drift actually decrease in practice, or does the LLM ignore the hint and invent synonyms anyway?
- Does the extra D1 read add meaningful capture latency? (Expected: cheap with an index on
tags.) - What's the right N? Too few → no anchor; too many → context bloat.
- Cold-start behavior: how messy is the first ~50 captures before vocabulary stabilizes?
Would reconsider if: drift persists despite the hint (fall back to Option B, post-hoc canonical map), or if capture latency regresses noticeably.
Direction: Two-pass eager backfill. Pass 1 enriches all pre-0002 rows with no vocabulary anchor (results discarded except as vocabulary seed). Pass 2 re-enriches with the vocabulary-aware prompt from §7.1; only pass 2's output is persisted. Self-bootstrapping.
To validate: quota cost on the current corpus, and whether two passes is actually better than one pass once the corpus is small enough that drift is bounded anyway.
Direction: Undecided. With AUTH_SECRET removed, a botched service-token rotation has no admin shim. Need a documented recovery path before this becomes load-bearing.
Candidates: dashboard-side service-token recreate (probably enough for a personal brain); a separate emergency token bound to a stricter Access policy; or a wrangler-only admin route gated on direct-invoke (no Access).
Direction: Captures-only. Memex is the capture-side surface (raw, in-the-moment thoughts from voice, CLI, agents); Obsidian is the canonical long-term vault (synthesized, edited, organized). The two are complementary, not parallel. The AI reads and writes only memex captures — it does not see vault content.
Why: capture is for raw; vault is for synthesized. Letting the AI reason over draft/synthesized vault notes mixed with raw captures pollutes both: search results become noisy, and enrichment vocabulary drifts toward whatever ad-hoc tagging exists in the vault rather than converging on the capture corpus's own ontology. Keeping the corpus to captures-only also keeps the trust boundary simple — one source of truth for the AI, with a clear human-in-the-loop step (manual promotion to vault) for anything load-bearing.
To validate during prototype:
- Does "captures-only" feel too narrow once the corpus grows? (i.e. do useful answers actually live in the vault that the AI can't reach?)
- Is there a lightweight promotion workflow that doesn't blur the boundary? Manual copy works today; build something only if rate-of-promotion justifies it.
Would reconsider if: repeated searches return shallow capture-only results when the answer demonstrably lives in a vault note, and a read-only vault-import path can be added without polluting the enrichment vocabulary (e.g. separate source: "vault" namespace + tag-isolated retrieval).