Skip to content

core: detect/diagnose incompatible world versions instead of failing with opaque ZodError: invalid_union #2638

Description

@pranaygp

Summary

When @workflow/core runs against a @workflow/world* package from an older major line, a durable run executes its first step, the runtime replays the event log, and the world's event-schema Zod discriminated union hits an event-type discriminant it doesn't know about — throwing ZodError: invalid_union deep inside the world's storage layer. The error points nowhere useful: there is no version handshake or compatibility check between core and the installed world, so a routine version mismatch surfaces as an opaque schema crash at replay time rather than an actionable "your world is too old" message. The SDK already has the building blocks for a clean diagnosis (spec-version.ts, requiresNewerWorld() + RunNotSupportedError, and the getRunCapabilities() version table) — they just don't cover this direction of mismatch.

Repro

  • @workflow/core@5.0.0-beta.24 (as bundled by vercel/eve@0.13.3)
  • @workflow/world-postgres@4.2.0 — this is what pnpm add @workflow/world-postgres installs today, because the npm latest dist-tag still points at the 4.x line:
$ npm view @workflow/world-postgres dist-tags
{ latest: '4.2.0', beta: '5.0.0-beta.19' }
$ npm view @workflow/world dist-tags
{ latest: '4.2.0', beta: '5.0.0-beta.13' }
$ npm view @workflow/core dist-tags
{ latest: '4.5.0', beta: '5.0.0-beta.24' }

Steps: start any durable workflow and let it run past its first step. The 5.x runtime writes a new-style event, the run replays its event log, and the 4.x world's EventSchema.parse(...) throws ZodError: invalid_union during replay.

Root cause

The world owns the event-schema discriminated union and validates every event it reads from storage through it:

  • The union is defined in packages/world/src/events.tsEventTypeSchema (the z.enum of all event types, events.ts:57-80), AllEventsSchema (z.discriminatedUnion('eventType', [...]), events.ts:386-408), and the exported EventSchema = AllEventsSchema.and(...) (events.ts:412-420). The event-type vocabulary has grown over time — e.g. attr_set was added in Add native v4 workflow attribute events #2226 ("Add native v4 workflow attribute events"), and step_started carries lazy-start semantics added later (perf(core): lazy inline step start (save one world round-trip per step) #2478). A world pinned to an older @workflow/world has an older EventTypeSchema/AllEventsSchema whose union does not include these discriminants.

  • The world calls EventSchema.parse(...) unconditionally on every event read/return path. In world-postgres see packages/world-postgres/src/storage.ts:312, 630, 1454, 1675, 1691, 1726, 1757, 1791. When an event whose eventType isn't in the older union flows through any of these, z.discriminatedUnion rejects it with invalid_union — there is no eventType-specific branch to match, and the failure is a raw ZodError, not a workflow error.

The core↔world boundary itself carries no world-vocabulary version signal:

  • The World interface (packages/world/src/interfaces.ts:276-366) exposes specVersion?: number (interfaces.ts:286), but that is a forward marker: it's the spec version core writes new runs at (packages/core/src/runtime/start.ts:283-290). It is not a declaration of which event-type vocabulary / schema version the installed world can parse.

  • The only existing compatibility guard is requiresNewerWorld(run.specVersion) (packages/world/src/spec-version.ts:58-68), thrown as RunNotSupportedError (packages/errors/src/index.ts:826-846). But (a) it keys off the numeric run.specVersion, which is only reached after the event has already been parsed through the union, and (b) the guard itself only exists in 5.x worlds (packages/world-postgres/src/storage.ts:559-564, packages/world-local/src/storage/events-storage.ts:650-651). A 4.x world predates the guard and parses each event first, so it never gets the chance to report a clean version error — it dies on the Zod union.

  • A grep for coreApiVersion|worldVersion|schemaVersion|assertCompatible|EVENT_SCHEMA_VERSION across packages/**/src returns nothing: there is no handshake, capability negotiation, or schema-version marker exchanged between core and a world at setWorld/registration/start() time. Notably, the SDK already has a precedent for exactly this kind of negotiation on the core↔core (cross-deployment) boundary — getRunCapabilities() + the FORMAT_VERSION_TABLE/CAPABILITY_VERSION_TABLE keyed on @workflow/core version (packages/core/src/capabilities.ts:1-90, consumed in start.ts:267) — but nothing equivalent exists for the world's event vocabulary.

There is even prior art showing the maintainers already know unknown event types crash the runtime: world-vercel deliberately uses safeParse with an explicit "unknown/future event types" pass-through fallback (packages/world-vercel/src/events.ts:386-402, coerceEventDates), and the legacy postgres path throws an actionable Event type 'X' not supported ... Please upgrade @workflow packages. (packages/world-postgres/src/storage.ts:321-326). The hot replay path in world-postgres/world-local just doesn't get the same treatment.

Impact

  • Opaque failure with no version signal: a ZodError: invalid_union originating inside the world's storage layer gives no hint that the cause is a core/world version mismatch.
  • This is the default outcome of following install instructions, not an edge case: while @workflow/core ships on the 5.x beta line, the latest dist-tag for @workflow/world, @workflow/world-postgres, and @workflow/world-local is still 4.2.0. Anyone self-hosting who runs pnpm add @workflow/world-postgres against a 5.x core gets the broken combination automatically.
  • High debugging cost: the symptom is far from the cause, and it cost a self-hoster the majority of their debugging time before the mismatch was identified.
  • The dist-tag lag is itself worth fixing (so latest doesn't hand people an incompatible world), but core should fail safely regardless — dist-tag hygiene alone won't protect users who pin, use private registries, or otherwise end up with a mismatched world.

Proposed fix

Three grounded approaches, roughly in order of robustness. They are complementary, not mutually exclusive.

  1. Declared world schema/vocabulary version, validated by core at registration/start() (preferred). Add a World-level declaration of the event-schema vocabulary the world understands — e.g. a worldSpecVersion / supportedEventTypes exported alongside SPEC_VERSION_CURRENT and surfaced on the World interface (packages/world/src/interfaces.ts:276, next to the existing specVersion). When core sets/starts a world it compares its own SPEC_VERSION_CURRENT / event vocabulary (packages/world/src/spec-version.ts:39, EventTypeSchema in events.ts:57) against the world's declared value and fails fast with an actionable error if the world is too old, before any run executes. This mirrors the existing getRunCapabilities() negotiation (packages/core/src/capabilities.ts) but for the world boundary instead of the cross-deployment one. Tradeoff: requires worlds to publish the field, so it only fully protects against worlds new enough to declare it — but combined with Version Packages (beta) #3 it also covers the silent-old-world case.

  2. Validate the installed world package range at registration time. Since worlds depend on @workflow/world (workspace:* in-repo; a real semver range when published) but declare no relationship to @workflow/core, core can't currently reason about compatibility. Add a peer/declared compatibility range between @workflow/core and @workflow/world (or have core read the resolved @workflow/world version) and assert it at boot, emitting an explicit "core X requires world >= Y, found Z" error. Tradeoff: package-version checks are coarser than wire-level checks and can be fooled by hoisting/duplicate installs, but they catch the common case at the earliest possible point (install/boot) with a clear message.

  3. Make the event-schema union diagnose unknown discriminators instead of throwing invalid_union. At the world's parse sites (packages/world-postgres/src/storage.ts:312 et al., packages/world-local/src/storage/events-storage.ts), switch the hot read path to safeParse and, on an unknown-eventType failure, throw a version-aware error — e.g. "event type X requires @workflow/world >= Y; installed world is Z" — reusing/extending RunNotSupportedError (packages/errors/src/index.ts:826). This is the same shape world-vercel already implements (packages/world-vercel/src/events.ts:386-402) and the legacy-postgres path's actionable message (world-postgres/src/storage.ts:321-326); it's the only approach that helps when the world is the old one and predates any handshake field. Tradeoff: it's a per-world change rather than one central guard, and care is needed to distinguish "unknown future event type" (version mismatch) from "known type, malformed payload" (a real bug) — world-vercel's code already draws that line by re-checking EventTypeSchema.safeParse(raw.eventType).

Workarounds today

Pin the world to the same @workflow/* release line as core, e.g. for a 5.x beta core:

pnpm add @workflow/world-postgres@beta   # or an explicit @5.0.0-beta.x

Note re: vercel/eve

vercel/eve is adding docs guidance plus a shallow boot-time guard (detecting the mismatch and emitting an actionable message) as a stopgap on the consumer side. That helps eve users, but the durable, framework-wide fix belongs in @workflow/core/@workflow/world: core should detect or clearly diagnose an incompatible world rather than letting a ZodError: invalid_union escape from replay.

🤖 Generated with Claude Code

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions