Skip to content

Latest commit

 

History

History
188 lines (161 loc) · 11.1 KB

File metadata and controls

188 lines (161 loc) · 11.1 KB

Cloud execution and the web portal (Phase 2)

Phase 2 — not shipped in Phase 1. Everything in this document describes a planned cloud layer. None of it exists in the local-first Phase 1 product. It is documented now only to prove the architecture is designed for it, and to keep the Phase-1 surfaces from baking in assumptions that would block it. Do not read any behavior here as currently available.

Phase 2 adds an optional cloud execution layer and a web portal on top of the local-first product — it never replaces or breaks the Phase-1 surfaces. The core idea: the same packages/core engine that runs in the Tauri WebView, the VS Code extension host, and the CLI can also be invoked by a cloud API through a job queue, enabling 24/7 automation, team sharing, cloud triggers (webhooks, schedules), and a browser-based canvas. A user can link a Relavium Cloud account and choose, per workflow, whether to run locally or in the cloud — and can keep running entirely locally with no account forever.

Cloud execution is not the only Phase-2 mode, and not the first one. Phase 2 also adds managed inference — a separate, thinner capability where Relavium holds the provider key and meters usage, while the engine stays local and only LLM egress is proxied to a gateway. Managed inference is the first Phase-2 deliverable and ships ahead of the cloud-execution plane described in this document. The two must not be conflated: managed inference moves only the LLM call; cloud execution moves the whole engine run. See managed-inference.md for the full design; this document covers the cloud-execution plane (workers, queue, portal).

flowchart TB
    subgraph Local["Local surfaces (Phase 1 — unchanged)"]
        Desktop["Desktop app"]
        VSCode["VS Code extension"]
        CLI["CLI"]
    end
    subgraph Cloud["Relavium Cloud (Phase 2)"]
        Portal["apps/portal<br/>Vite + React SPA<br/>(control plane, in a browser)"]
        API["apps/api<br/>Hono on Bun + auth"]
        Queue["BullMQ on Redis 7<br/>orchestrator / node / system pools"]
        Workers["Workers<br/>run packages/core"]
        PG[("PostgreSQL 16")]
        Streams["Redis Streams<br/>(SSE delivery)"]
        Secrets[["Server-side key store<br/>AES-256-GCM"]]
    end
    Providers["LLM providers"]

    Desktop -. "opt-in: sync to cloud" .-> API
    CLI -. "opt-in" .-> API
    Portal --> API
    API --> Queue
    Queue --> Workers
    Workers --> PG
    Workers --> Streams
    Workers --> Secrets
    Streams -->|HTTP SSE| Portal
    Workers -->|HTTPS| Providers
Loading

Status: draft — to be expanded. This is a forward-looking sketch grounded in the architecture-decisions cloud rows and the pivot phase2Architecture source. Concrete cloud DDL, API routes, and the portal API are the canonical property of ../reference/ and are cited, not restated, here.

Context

The phasing is a hard product decision (ADR-0008 and ../product-constraints.md): Phase 1 is local-first with zero cloud dependency and no account, and cloud execution is Phase 2. The engine must support both modes behind a clean interface switch, but Phase 1 must never require the cloud. The dual-database story — SQLite local, PostgreSQL cloud — is designed for exactly this transition, with one Drizzle schema targeting both.

What stays the same

The whole point is that the engine does not change. packages/core and packages/llm run in cloud workers exactly as they run on a user's machine (see shared-core-engine.md and multi-llm-providers.md). Surfaces call WorkflowEngine.start(workflowId, input) and consume the same RunEvent objects regardless of mode — they do not branch on local-versus-cloud. The workflow YAML, the node-type catalog, the checkpoint shape, fallback chains, and cost accounting are all identical.

What the cloud layer adds

Component Role
apps/api A Hono REST API on Bun that wraps packages/core with job dispatch, multi-tenant auth, and cloud state. It does not re-implement execution — it enqueues it. (The engine's entry point is the same
WorkflowEngine.start(workflowId, input) every surface calls — see
shared-core-engine.md.)
apps/portal A Vite + React SPA that reuses Relavium's shared UI components (packages/ui) in a browser instead of a Tauri WebView, for viewing and managing runs. It is a control plane (usage, quota, team, runs, gates), not a second canvas or a new execution engine — it does not embed @relavium/core; it drives runs through the cloud API.
PostgreSQL 16 Replaces SQLite for cloud runs. The Drizzle schema is ~90% shared with the local SQLite schema; see ../reference/desktop/database-schema.md and the SQLite-vs-Postgres differences it records.
Redis 7 + BullMQ Job queues (orchestrator / node / system worker pools) plus Redis Streams for SSE log delivery and a sliding-window rate limiter.
Cloud workers Worker processes that pull jobs and run packages/core, one worker thread per agent node.
Server-side key store API keys for cloud runs are held in an AES-256-GCM-encrypted store instead of the OS keychain.
Object storage Large run-output artifacts (generated files) for cloud runs.

What changes between Phase 1 and Phase 2

These are the only substitutions; everything else is shared:

  • Execution location. In-process worker threads on the user's machine become BullMQ jobs dispatched by apps/api to a worker pool, enabling multi-server scaling and 24/7 runs.
  • State store. Local SQLite becomes PostgreSQL 16 with multi-tenancy (an org_id column and row-level security), so workflows and runs can be shared within a team.
  • Event transport. Run events that are produced in-process by the engine's WebView-side RunEventBus on the desktop (and consumed locally there — see ADR-0018) are, in cloud mode, delivered over HTTP SSE backed by Redis Streams, consumed by the portal's EventSource with Last-Event-ID resumption. The event shape is unchanged — the SSE event schema is the one canonical union for every surface and transport.
  • Key storage. OS keychain (local) becomes the server-side encrypted store (cloud). Local runs continue to use the OS keychain.
  • Triggers. Webhook and schedule triggers — which need an always-on listener — become functional in the cloud (a webhook endpoint enqueues a run; cron is BullMQ repeat jobs). These are out of scope locally; see ../ideas/scheduled-and-webhook-triggers.md.
  • Human-gate notifications. Gates gain email/Slack delivery for assignees who are not actively watching the portal.
  • Cost tracking. Per-node cost gains org-level aggregate views, budget alerts, and export.

The transparent local→cloud switch

The engine exposes an identical interface regardless of mode; surfaces never check the mode. executionMode is three-valued'local' | 'cloud' | 'managed' — and the value selects only where the engine runs and which LLMProvider implementation the factory hands back (a direct provider adapter for local/cloud, the ManagedGatewayProvider for managed); see managed-inference.md. Mode is resolved once, at engine creation time, in this order:

  1. An explicit executionMode config override (local / cloud / managed).
  2. The presence of a valid cloud auth token (implies a cloud-linked account; the stored preference then decides between cloud execution and managed inference — both require an account, but only cloud moves the engine off the user's machine).
  3. A stored user preference.
  4. Default: local.

managed and cloud are independent axes: a run can stay local while using managed inference, or run in the cloud while still using a BYOK key. The two are resolved separately and never inferred from each other.

The migration path is gradual and opt-in: a user starts local, signs up on the portal, gets a token, and the desktop app detects it and offers to run in the cloud for persistence, sharing, and no local key management. Opting in recreates the engine in cloud mode; the CLI and VS Code pick up the same preference on next init — no surface code changes. Two safety rules are non-negotiable, both flowing from local-first-and-security.md:

  • The engine never silently falls back from cloud to local. If cloud is unreachable in cloud mode it raises an explicit error suggesting a switch — silent fallback could leak credentials or bypass enterprise controls.
  • Full LLM transcripts are never synced, in any tier or mode. The cloud is a control and execution plane, not a transcript archive.

The portal is a control plane, not the execution plane

The web portal is where teams manage — usage, quota, run history, pending gates, team membership, audit, billing. It is explicitly not where local workflows run. Its API surface is canonical in ../reference/portal/api-reference.md. The browser never calls LLM providers directly; all cloud LLM calls go through the workers, so no API key ever appears in a browser network tab.

Key-security note for Phase 2

The single most dangerous data in the cloud layer is provider keys. The Phase-1 controls (keys never in payloads, never serialized into job/checkpoint/log data, stripped from exported YAML) carry forward, plus: cloud keys are encrypted at rest with AES-256-GCM, and a lint rule bans serializing keys into worker job payloads. A pre-release security audit must target the four leak surfaces — at-rest store, job payloads, exported YAML, and the browser network tab — before Phase 2 ships. See local-first-and-security.md for the cross-cutting secret-handling rules.

Related documents