Skip to content

Latest commit

 

History

History
175 lines (122 loc) · 12.5 KB

File metadata and controls

175 lines (122 loc) · 12.5 KB

Design Document: Edge-Hosted "Second Brain" MCP Agent

1. Objective & Context

This document outlines the architecture for a highly available, serverless Model Context Protocol (MCP) server deployed on Cloudflare infrastructure. The system provides AI agents with read/write access to a personal knowledge base (e.g. a Git-backed Obsidian vault). Edge compute eliminates cold-start latency and local network ingress complexity, and provides high-speed vector retrieval independent of local hardware state.

2. System Architecture

The system runs entirely on Cloudflare's native edge services, replacing traditional containerized compute and dedicated database clusters.

  • Compute / API Gateway: Cloudflare Workers (V8 Isolates)
  • Relational State & Metadata: Cloudflare D1 (Serverless SQLite)
  • Semantic Indexing: Cloudflare Vectorize
  • Native Inference: Cloudflare Workers AI — @cf/baai/bge-base-en-v1.5 (768-dim embeddings)
  • Zero Trust Security: Cloudflare Access (service tokens for machine clients)

3. Data Schema

3.1. D1 Relational Schema (SQLite)

D1 is the authoritative source for document content, chunks, and metadata. Two tables: documents (one row per captured note) and chunks (one row per embedded segment, joined by document_id).

documents

Column Type Description
id TEXT (UUID) Primary key, generated via Web Crypto API in the Worker.
content TEXT Raw markdown content.
content_hash TEXT SHA-256 of content. Indexed. Guarantees idempotency on re-capture.
source TEXT Free-form origin tag (e.g. obsidian:vault/note.md, voice-memo, web-clip).
metadata TEXT (JSON) Frontmatter, extracted entities, tags.
summary TEXT One-line auto-summary (≤140 chars) generated by Llama 3 at capture time. Nullable — pre-enrichment rows are NULL.
tags TEXT (JSON) JSON array of 3-7 kebab-case topic strings derived from content by Llama 3. Nullable — pre-enrichment rows are NULL.
created_at INTEGER Unix epoch ms, set on insert.
updated_at INTEGER Unix epoch ms, bumped on re-capture of the same hash.

chunks

Column Type Description
id TEXT (UUID) Primary key. Also the Vectorize vector ID.
document_id TEXT (UUID) FK → documents.id. Indexed. Cascade on document delete.
ordinal INTEGER 0-based position of the chunk within the source document.
content TEXT Chunk text — the exact string that was embedded.

3.2. Chunking Strategy

Notes are chunked before embedding to preserve retrieval granularity:

  • Target size: ~512 tokens per chunk (matches bge-base-en-v1.5's 512-token input limit).
  • Overlap: ~64 tokens between adjacent chunks to preserve cross-boundary context.
  • Splitter: Paragraph-aware — split on blank lines first, then pack paragraphs into chunks until the target size is reached. Falls back to sentence/character splitting only when a single paragraph exceeds the budget.
  • Short notes (≤ ~512 tokens) produce a single chunk and skip splitting entirely.

3.3. Vectorize Schema

  • Dimensions: 768 (matches @cf/baai/bge-base-en-v1.5).
  • Metric: Cosine similarity.
  • Vector ID: the chunks.id UUID — read-through to D1 for hydration.
  • Metadata payload: { document_id, ordinal, tags[] } for filtered search and result grouping.

Switching to a 1024- or 1536-dim provider later requires a Vectorize re-index; the surrounding schema stays the same.

4. Request Lifecycle & MCP Tools

The Worker exposes its capabilities to agents via MCP over Streamable HTTP (the standard transport for remote MCP servers). The same Worker also exposes a small REST surface for non-agent ingest clients (CI/CD pipelines, mobile capture apps).

4.1. MCP Tools

Tool Inputs Behavior
capture_thought content, source?, metadata? Ingest a markdown note. Idempotent on content_hash.
semantic_search query, top_k?, tags? Embed query, retrieve nearest chunks, hydrate from D1.
get_thought id Fetch a single document and its chunks by ID.
delete_thought id Remove a document, its chunks, and all associated vectors.
list_recent limit? Return recent documents by created_at.

4.2. The capture_thought Flow

  1. Client (local agent, CI/CD pipeline, or capture app) sends content to the Worker.
  2. Worker computes SHA-256 of content and looks up content_hash in D1.
    • Hit: bump updated_at on the existing row and return its ID. No re-embedding.
    • Miss: continue.
  3. Worker chunks the content per §3.2.
  4. Worker runs enrichment and embedding in parallel (Promise.all):
    • Enrichment: @cf/meta/llama-3.1-8b-instruct (temperature 0.2) extracts a summary (≤140 chars) and 3-7 tags (kebab-case). Non-blocking — any AI or parse error silently yields empty fields rather than failing capture.
    • Embedding: @cf/baai/bge-base-en-v1.5 embeds all chunks. Sync — bounded latency for personal-scale notes.
  5. Worker writes the documents row (including summary and tags) and chunks rows to D1, then upserts vectors into Vectorize. Vectorize metadata includes tags[] for future filtered search. D1 and Vectorize are separate bindings with no cross-resource transaction; on Vectorize failure the Worker rolls back the D1 inserts.

Note — Vectorize propagation lag. Newly upserted vectors are not immediately queryable; observed lag is 10–30s. v0 stance is accept and document: capture returns sub-second, but a search issued in the same window may miss the just-captured content. Revisit if read-after-write becomes a problem in practice.

4.3. The semantic_search Flow

  1. Agent sends a query string via MCP.
  2. Worker embeds the query via Workers AI.
  3. Worker queries Vectorize for top-K nearest neighbors, optionally pre-filtered by tags.
  4. Worker hydrates each hit by reading the corresponding chunks and documents rows in D1.
  5. Worker returns structured results: [{ document_id, chunk_content, document_content, score, metadata }].

4.4. Delete Flow

  1. Agent calls delete_thought with a document ID.
  2. Worker reads the document's chunk IDs, deletes the corresponding vectors from Vectorize, then deletes the chunks and the document row from D1 (cascade).

5. Security & Access Control

The Worker is shielded behind Cloudflare Access. Automated clients authenticate via Service Tokens (CF-Access-Client-Id / CF-Access-Client-Secret). This prevents unauthorized endpoint probing without requiring open inbound ports on the local gateway.

6. Prototype Scope (v0)

To minimize time to a working end-to-end system, v0 deliberately defers the more involved pieces:

In v0:

  • A single Worker exposing REST routes: POST /capture, POST /search, GET /thought/:id, DELETE /thought/:id.
  • Bound D1, Vectorize, and Workers AI.
  • bge-base-en-v1.5 at 768 dims, paragraph-aware chunking per §3.2.
  • Bearer-token auth using a single shared secret (Wrangler secret). (removed in v1)

In v1:

  • Cloudflare Access service tokens replacing the bearer secret. The Worker verifies the Cf-Access-Jwt-Assertion JWT against CF's JWKS endpoint for defense-in-depth. The bearer shim (AUTH_SECRET) has been removed entirely — hard cutover, no fallback.
  • MCP Streamable HTTP server mounted at /mcp in the same Worker. Exposes 5 tools: capture_thought, semantic_search, get_thought, delete_thought, list_recent. Implemented with McpServer (@modelcontextprotocol/sdk) + createMcpHandler from the Cloudflare agents package (stateless — no Durable Objects required).
  • list_recent REST route: GET /thoughts?limit=N&before=<unix_ms> — returns recent documents by created_at DESC. The optional before cursor returns docs strictly older than the given Unix epoch ms, letting backup/export jobs walk the full corpus in limit-sized batches.
  • Auth: same check as REST routes applies to the /mcp endpoint.
  • Capture-time enrichment via Llama 3: @cf/meta/llama-3.1-8b-instruct runs in parallel with embedding at capture time, producing a summary (≤140 chars) and 3-7 kebab-case tags. Persisted on the documents row and surfaced in all read paths (get_thought, list_recent, semantic_search). Tags also stored in Vectorize metadata for future filtered search. Non-blocking — enrichment failure yields empty fields, never breaks capture.

Deferred to v2+:

  • Re-ranking pass on search results.
  • CF Queues for async embedding — only if sync embedding latency becomes a problem at capture time.

7. Open Design Questions

A running register of decisions to validate during the prototype phase. Each entry records the current direction and what evidence would change it.

7.1. Tag normalization

Direction: Vocabulary-aware enrichment (Option A). On capture, fetch the top-N most-frequent existing tags from D1 and pass them to the Llama prompt as preferred reuse candidates. New tags are still allowed, but the LLM is biased toward the existing vocabulary so the corpus converges on its own ontology rather than fragmenting (k8s vs kubernetes, vector-database vs vector-db).

To validate during prototype:

  • Does drift actually decrease in practice, or does the LLM ignore the hint and invent synonyms anyway?
  • Does the extra D1 read add meaningful capture latency? (Expected: cheap with an index on tags.)
  • What's the right N? Too few → no anchor; too many → context bloat.
  • Cold-start behavior: how messy is the first ~50 captures before vocabulary stabilizes?

Would reconsider if: drift persists despite the hint (fall back to Option B, post-hoc canonical map), or if capture latency regresses noticeably.

7.2. Backfill strategy for pre-enrichment rows

Direction: Two-pass eager backfill. Pass 1 enriches all pre-0002 rows with no vocabulary anchor (results discarded except as vocabulary seed). Pass 2 re-enriches with the vocabulary-aware prompt from §7.1; only pass 2's output is persisted. Self-bootstrapping.

To validate: quota cost on the current corpus, and whether two passes is actually better than one pass once the corpus is small enough that drift is bounded anyway.

7.3. Access break-glass recovery

Direction: Undecided. With AUTH_SECRET removed, a botched service-token rotation has no admin shim. Need a documented recovery path before this becomes load-bearing.

Candidates: dashboard-side service-token recreate (probably enough for a personal brain); a separate emergency token bound to a stricter Access policy; or a wrangler-only admin route gated on direct-invoke (no Access).

7.4. Memex vs Obsidian — scope of the AI corpus

Direction: Captures-only. Memex is the capture-side surface (raw, in-the-moment thoughts from voice, CLI, agents); Obsidian is the canonical long-term vault (synthesized, edited, organized). The two are complementary, not parallel. The AI reads and writes only memex captures — it does not see vault content.

Why: capture is for raw; vault is for synthesized. Letting the AI reason over draft/synthesized vault notes mixed with raw captures pollutes both: search results become noisy, and enrichment vocabulary drifts toward whatever ad-hoc tagging exists in the vault rather than converging on the capture corpus's own ontology. Keeping the corpus to captures-only also keeps the trust boundary simple — one source of truth for the AI, with a clear human-in-the-loop step (manual promotion to vault) for anything load-bearing.

To validate during prototype:

  • Does "captures-only" feel too narrow once the corpus grows? (i.e. do useful answers actually live in the vault that the AI can't reach?)
  • Is there a lightweight promotion workflow that doesn't blur the boundary? Manual copy works today; build something only if rate-of-promotion justifies it.

Would reconsider if: repeated searches return shallow capture-only results when the answer demonstrably lives in a vault note, and a read-only vault-import path can be added without polluting the enrichment vocabulary (e.g. separate source: "vault" namespace + tag-isolated retrieval).