Skip to content

Latest commit

 

History

History
320 lines (279 loc) · 15.7 KB

File metadata and controls

320 lines (279 loc) · 15.7 KB

CLAUDE.md — edgepilot

What this is

A live demo for the AI Engineer Summit. Two AI agents (recon + action) drive real cloud resources through the stackql MCP server, using SQL as the only interface to Cloudflare and Confluent Kafka. The talking point: infrastructure as data.

Target run time: ~3 minutes recorded; the agent loop itself completes in ~30-60 seconds.

Architecture

Demo flow

  1. loadgen.py drives synthetic + AI-crawler-UA traffic at stackql.xyz (a throwaway Cloudflare zone) so analytics have something real to report
  2. recon agent SELECTs against Cloudflare GraphQL Analytics + the active rate-limit rule + the active Kafka cluster
  3. action agent UPDATEs the rate-limit threshold (Cloudflare) and INSERTs a decision record (Confluent Kafka)
  4. Records are visible live in the Confluent Cloud UI Messages tab

Provisioning — two-stack split

Forced by Confluent's auth surface: control-plane keys can't hit data-plane endpoints, and data-plane keys can't exist until the cluster does.

Stack 1 — infrastructure/control-plane/ (provider: confluent)

  • kafka_cluster — managed BASIC cluster, single AZ
  • service_account — principal for data-plane writes
  • sa_cluster_admin — role binding (CloudClusterAdmin) — MUST run before cluster_api_key so the vended key inherits perms at mint time
  • cluster_api_key — vended cluster-scoped key, secret captured via return_vals.create + RETURNING * (then unpacked in iql exports via {{ this.kafka_api_key_spec_json }})
  • Stack-level exports surface cluster ids + the vended creds to .stackql-deploy-exports (auto-written by stackql-deploy)

Phase 1.5 (bootstrap.sh) — polls pkc-*.../kafka/v3/clusters/*/topics with the vended key until 200 (handles RBAC propagation delay, usually 1-2 attempts).

Stack 2 — infrastructure/data-plane/ (providers: kafka, cloudflare)

  • decision_log_topic — Kafka topic via kafka.kafka.topics
  • canary_record — non-empty bootstrap record via kafka.kafka.records
  • rate_limit_rule — Cloudflare rate-limit via cloudflare.rulesets.phases (modern rulesets, NOT legacy — see gotchas)

Bootstrap orchestration: infrastructure/bootstrap.sh

  • Runs stack 1, sources .stackql-deploy-exports, polls data plane, runs stack 2, all credentials passed via -e flags.

Teardown: infrastructure/teardown.sh — tears down stack 1 only; deleting the Kafka cluster cascades topic + records + ACLs on Confluent's side. Cloudflare rate-limit needs separate cleanup if you care.

.env shape

Shell-sourced (set -a; . .env; set +a). NOT loaded via python-dotenv — that dependency has been removed from requirements.txt. The .env uses shell variable expansion to alias the same Confluent key into the Kafka provider's expected env vars where applicable (though for this demo we ended up with separate keys per Confluent's design).

Required:

  • CONFLUENT_CLOUD_API_KEY / _SECRET — Cloud key bound to "My account" with Global scope (OrganizationAdmin). For control plane only.
  • CONFLUENT_ENVIRONMENT_ID
  • CLOUDFLARE_API_TOKEN — needs Account WAF Write, Account Rulesets Write, Account Firewall Access Rules Write, Zone Firewall Services Edit, Zone WAF Edit, plus broad reads. Bot Fight Mode is toggled in UI.
  • CLOUDFLARE_ZONE_ID
  • ANTHROPIC_API_KEY — for demo.py only

KAFKA_API_KEY / KAFKA_API_SECRET are MINTED by bootstrap.sh (not manually set) and propagated to stack 2 via the exports file.

Hard rules

These are real constraints learned the hard way. Don't relitigate:

  • Never monkey-patch provider specs in .stackql/src/ — those files are refreshed from the registry on pull. Fixes must go upstream in the provider repos.
  • Never delete files with secret-shaped content (*.env*, *.secret, *.key, *credential*) without explicit instruction. I deleted infrastructure/.env once thinking it was redundant; it wasn't. The user had to regenerate every secret. Don't.
  • Don't add python-dotenv — env is sourced via bash, not Python. Adding the dependency back will fragment env loading and the user will push back.
  • Never amend git commits — always new commits.
  • Don't push without explicit user instruction.
  • The Anthropic / Cloudflare / Confluent secrets in this conversation transcript should all be considered compromised — they were pasted in curl outputs and build logs. Treat them as rotation-pending. The user is aware and rotates manually after demo iterations.

Known gotchas — already discovered

stackql-deploy quirks

  • auth: blocks at resource level are silently ignored. Documented but non-functional. Don't try to use per-resource credential switching. Issue needs to be filed against stackql-deploy-rs.
  • protected: redacts in export_vars but RETURNING capture log line leaks the raw response. Look for RETURNING [spec] for [X] captured as [this.Y] = [<unredacted JSON>]. Issue needs to be filed.
  • /*+ createorupdate */ is the right anchor for resources where the underlying API verb is REPLACE/PUT (no separate create-vs-update). Don't use /*+ update */ — stackql-deploy errors with "iql file must include either 'create' or 'createorupdate' anchor."
  • Script resource run: blocks DO get templated. They run under sh -c. To export values from a script, print a JSON object to stdout with the keys matching exports: entries.
  • Process env cannot be mutated mid-stackql-deploy. A script resource cannot export to the parent's env. Persist values to a file and source externally (e.g. bootstrap.sh between stacks).
  • Stack exports only emit variables named in the manifest's top-level exports: list to .stackql-deploy-exports. They can reference any variable that was set by any resource — not just the last.
  • return_vals.create captures RETURNING into resource-scoped context (this.X), accessible in the SAME resource's iql exports block via {{ this.X }}. Cross-resource scope ({{ other_resource.X }}) works in iql but NOT in inline sql: on a type: query resource.

stackql parser / SQL semantics

  • REPLACE follows UPDATE shape: REPLACE <table> SET col = val WHERE ... NOT REPLACE INTO <table>(cols) SELECT .... Confirmed in stackql-parser/go/vt/sqlparser/sql.y (update_or_replace: UPDATE | REPLACE).
  • JSON_EXTRACT is NOT supported in RETURNING projections. Capture the raw column (e.g. RETURNING id, spec) and JSON_EXTRACT in a downstream exports SELECT.
  • No subqueries in DELETEDELETE FROM x WHERE id IN (SELECT ...) errors.
  • JSON('[...]') wrapper is REQUIRED for SET values whose schema type is array (and likely object too). Without it, stackql naive translator sends "col": "[...]" (string-wrapped) instead of "col": [...] (parsed array). Cloudflare rejects with 400 invalid JSON: 'col' cannot be a string. Confirmed by test/robot/functional/stackql_mocked_from_cmd_line.robot fixtures using JSON('[ "SFTP" ]') for AWS data__Protocols.
  • vw_* views in some providers have a projection bug where multi-column SELECT combined with WHERE on the required-param column reports could not locate symbol <col>. Workaround: query the underlying raw table with JSON_EXTRACT(spec, '$.<col>') instead. Hit on confluent.managed_kafka_clusters.vw_clusters.

Confluent gotchas

  • Cloud API keys (god key bound to My account with Global scope) can NOT hit the data plane. Cloudflare-style "one key" doesn't work — Confluent rejects them at the Kafka REST endpoint with 401. The data plane only accepts cluster-scoped vended API keys minted via confluent.iam.api_keys with spec.resource.id = <cluster_id>.
  • The vended key's permissions snapshot the principal's RBAC at MINT TIME. Grant sa_cluster_admin BEFORE creating cluster_api_key or the key is born unprivileged.
  • Cluster API key secret is only in the create response. Subsequent SELECTs against confluent.iam.api_keys won't include it. Use RETURNING * + return_vals.create: [{spec: <name>}] to capture, unpack later. If you skip the create (idempotent re-run), the secret is gone — delete the api key in Confluent UI to re-mint.
  • ksqlDB minimum is now 4 CSUs, ~$0.89/hour. Originally we planned to use it for the analysis story; we dropped it. Use Confluent Cloud UI Messages tab for live record inspection instead.
  • replication_factor defaults to 3 on Confluent Cloud topic create; don't try to override on BASIC clusters.
  • API key INSERT body uses spec (no data__ prefix) — provider has requestBodyTranslate: naive so columns become top-level body keys.

Cloudflare gotchas

  • Legacy /zones/{id}/rate_limits API is in MAINTENANCE MODE. Reads work, writes return HTTP 403 code 10037 ratelimit.api.maintenance_mode. Don't go down this rabbit hole — use modern rulesets (http_ratelimit phase) instead. Confirmed with live curl on 2026-05-30.
  • Modern rulesets PUT had a published provider bug where id, version, last_updated were marked required in the request body schema despite being readOnly: true. stackql forced them into both WHERE and body, Cloudflare rejected. Fix is upstream in stackql-provider-cloudflare (the user is patching it as of last conversation). After the fixed provider lands, REPLACE cloudflare.rulesets.phases SET rules = '...' WHERE zone_id = ... AND ruleset_phase = 'http_ratelimit' should work.
  • Free Cloudflare plan constraints on rate-limit rules:
    • period MUST be 10 (not 60 — that's paid-plan only)
    • characteristics must include BOTH ip.src AND cf.colo.id
    • expression must use starts_with() not matches (regex is paid-only)
  • Cloudflare GraphQL Analytics requires Account-scoped Analytics Read permission on the token (zone-scoped is insufficient). The token has this.
  • GraphQL provider operations land under top-10 namespaces:
    • cloudflare.zones.http_requests_adaptive_groups — main demo recon source
    • cloudflare.firewall.firewall_events_adaptive_groups — bot/threat flags
    • Plus 8 others wired by the user (httpRequests1hGroups, firewallEventsAdaptive, httpRequestsOverviewAdaptiveGroups, dnsAnalyticsAdaptiveGroups, workersInvocationsAdaptive, r2OperationsAdaptiveGroups, d1AnalyticsAdaptiveGroups, cdnNetworkAnalyticsAdaptiveGroups).
  • Bot Fight Mode is enabled via UI (Security → Settings → Bot fight mode). The token doesn't have perm to toggle it programmatically.
  • The beacon Worker (fancy-boat-8ddc) serves test.stackql.xyz/beacon.gif for organic traffic generation from microsite footers. Separate workstream from the main demo.

File map

edgepilot/
├── .env                       # secrets, shell-source only (gitignored)
├── .env.example               # template
├── .stackql-deploy-exports    # auto-written by stack 1, sourced by bootstrap.sh
├── stackql                    # local stackql binary (linux/x86_64)
├── stackql-deploy             # local stackql-deploy binary
├── demo.py                    # the 2-agent loop
├── loadgen.py                 # traffic gen for cloudflare zone
├── requirements.txt           # anthropic, mcp, aiohttp (NO python-dotenv)
├── claude_desktop_config.json # MCP config for the Claude Desktop variant
├── README.md
├── SCRIPT.md                  # speaking script (needs rewrite — Gap 6)
├── CLAUDE.md                  # this file
├── infrastructure/
│   ├── bootstrap.sh           # provisions stack 1 then stack 2
│   ├── teardown.sh            # tears down stack 1 (cluster cascade)
│   ├── control-plane/
│   │   ├── stackql_manifest.yml
│   │   └── resources/
│   │       ├── kafka_cluster.iql
│   │       ├── service_account.iql
│   │       ├── sa_cluster_admin.iql
│   │       └── cluster_api_key.iql
│   ├── data-plane/
│   │   ├── stackql_manifest.yml
│   │   └── resources/
│   │       ├── decision_log_topic.iql
│   │       ├── canary_record.iql
│   │       └── rate_limit_rule.iql
│   └── assurance/
│       ├── 01_kafka_cluster.iql
│       ├── 02_service_account.iql
│       ├── 03_rate_limit.iql
│       └── README.md
└── .stackql/src/              # provider specs — DO NOT EDIT IN PLACE
    ├── confluent/v00.00.00000/   # local dev split (control plane only)
    ├── kafka/v00.00.00000/       # local dev split (data plane only)
    └── cloudflare/v26.05.00399/  # published, fix in progress upstream

Workflow

First-time setup

  1. Cloudflare zone (stackql.xyz) exists, on free plan, Bot Fight Mode on
  2. Confluent environment exists (env-...)
  3. .env populated (use .env.example as template), shell-sourced
  4. pip install -r requirements.txt in a venv

Provision

set -a; . .env; set +a
bash infrastructure/bootstrap.sh

Takes ~7-10 min (mostly cluster provision). Idempotent for cluster, SA, role binding, role binding. NOT idempotent for cluster_api_key (re-run = key already exists = no RETURNING = bootstrap fatals because secret can't be recaptured). If you re-bootstrap, delete the api key in Confluent UI first.

Run the demo

python demo.py

Reads CLOUDFLARE_ZONE_ID + CONFLUENT_ENVIRONMENT_ID from env. The agents discover everything else at runtime via SELECT (query-before-mutate is the demo mantra).

Teardown

bash infrastructure/teardown.sh

Cluster delete cascades data plane. Cloudflare rate-limit rule persists — re-bootstrap will overwrite it idempotently, or curl-delete manually.

Assurance queries (on-stage state demonstration)

./stackql exec -i infrastructure/assurance/01_kafka_cluster.iql
./stackql exec -i infrastructure/assurance/02_service_account.iql
./stackql exec -i infrastructure/assurance/03_rate_limit.iql

Open todos

  • Gap 2demo.py validation. Hold until Cloudflare provider fix lands (in flight). Then run end-to-end, smooth log noise, validate the agent loop completes in <60s.
  • Gap 3 — Analysis queries (live SQL on stage to show the topic contents). Confluent UI is the primary surface; supplementary stackql queries optional.
  • Gap 4 — Claude Desktop variant. Tweak claude_desktop_config.json for Windows stackql.exe, document the swap in README.
  • Gap 5 — Teardown polish. Currently OK but cluster_api_key delete skips with "unresolved variables" message (id only available within same-run scope). Harmless because cluster delete cascades the key, but cosmetic improvement possible.
  • Gap 6 — Rewrite SCRIPT.md to match the actual demo flow after the whole chain works end-to-end. Current script references constructs that no longer exist.

Issues to file post-demo

  • stackql-deploy-rs: auth: block on resources silently ignored.
  • stackql-deploy-rs: protected: exports leak through RETURNING capture log line.
  • stackql-deploy-rs (cosmetic): cluster_api_key teardown logs "unresolved variables, assuming resource does not exist, skipping" when called outside the same-run scope. Cluster cascade handles it functionally.
  • stackql-provider-confluent: vw_clusters projection bug (multi-column SELECT + WHERE on required-param column fails).
  • stackql-provider-confluent: api_endpoint column returns empty string for new Basic-tier clusters; downstream code should use http_endpoint instead. Either fix the upstream or update docs to deprecate the column.
  • stackql-provider-cloudflare: rulesets PUT body schema marks readOnly fields as required (IN PROGRESS — user is patching).