Skip to content

[bug] S3 checkpoint syncer ignores --checkpointSyncer.endpoint and AWS_ENDPOINT_URL_S3 — breaks DO Spaces, MinIO, R2, B2 #8630

@davidtai

Description

@davidtai

Summary

The validator and relayer always send checkpoint S3 traffic to real AWS, even when configured for an S3-compatible store. Two compounding bugs are in play. First, --checkpointSyncer.endpoint is parsed but never reaches the SDK. Second, the standard AWS escape hatch AWS_ENDPOINT_URL_S3 is also ignored, because the aws-config dependency pin predates the version that added support for it. The result is InvalidAccessKeyId from real AWS S3, and the validator panics on its first report_agent_metadata call.

Reproduction

Run gcr.io/abacus-labs-dev/hyperlane-agent:main with --checkpointSyncer.endpoint=https://nyc3.digitaloceanspaces.com and DO Spaces credentials in AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY. Setting AWS_ENDPOINT_URL_S3 to the same URL does not change the behavior.

thread 'main' panicked at agents/validator/src/validator.rs:287:14:
Failed to report agent metadata: service error
Caused by:
   1: Error { code: "InvalidAccessKeyId", aws_request_id: "<AWS-format-id>" }
Location: hyperlane-base/src/types/s3_storage.rs:67:9

The AWS-format aws_request_id confirms the request reached real AWS, not the configured endpoint.

Root cause

Bug 1 — the CLI flag is dropped at the SDK boundary. S3Storage carries only bucket, region, and folder fields, and it builds the SDK config with .region(...) only. .endpoint_url(...) is never called, so the parsed CLI value is discarded.

Bug 2 — the documented env-var fallback can't save you. rust/main/Cargo.toml pins aws-config = "1.1.7", released February 2024. Support for service-specific endpoint env vars like AWS_ENDPOINT_URL_S3 landed in aws-config 1.2 via smithy-rs#3568, two months later. As a result, ConfigLoader::default() in this build silently ignores the variable.

Either fix alone would unblock every S3-compatible store, but currently neither does.

Proposed fix

  1. Bump aws-config = "1.1.7""1.5" in rust/main/Cargo.toml. This is a one-line change that makes AWS_ENDPOINT_URL_S3 work as documented.
  2. Add endpoint: Option<String> to S3Storage and wire --checkpointSyncer.endpoint through to .endpoint_url(...). This is roughly 20 lines and makes the CLI flag work as the docs imply.
  3. Optionally add --checkpointSyncer.forcePathStyle for backends that require path-style addressing (Cloudflare R2, MinIO defaults).

Workaround

A re-signing reverse-proxy sidecar inside the validator/relayer pod lets the agent keep talking to an S3-compatible backend without any code change to Hyperlane. The pod's hostAliases route the AWS S3 hostnames the agent uses (both the bare service host and the bucket-prefixed virtual-host form) to 127.0.0.1, where the sidecar listens on :443. The sidecar terminates TLS with a server certificate signed by a locally-generated CA, and an init container writes both the cert and a merged CA bundle into a shared emptyDir so the agent container can mount the bundle over /etc/ssl/certs/ca-certificates.crt via a subPath mount — AWS_CA_BUNDLE is also unsupported in this build, so the system trust store is the only path that works. For each request, the sidecar strips the AWS SigV4 authorization headers, rewrites the host to its DO Spaces equivalent (for example <bucket>.s3.us-east-1.amazonaws.com becomes <bucket>.nyc3.digitaloceanspaces.com), re-signs with the same credentials, and forwards. The re-signing is straightforward with github.com/aws/aws-sdk-go-v2/aws/signer/v4's SignHTTP; the only non-obvious detail is that you have to set the X-Amz-Content-Sha256 header explicitly before signing, otherwise DO Spaces rejects the signature.

The relevant pod-spec wiring looks like this:

spec:
  hostAliases:
    - ip: "127.0.0.1"
      hostnames:
        - "s3.us-east-1.amazonaws.com"
        - "s3.amazonaws.com"
        - "<bucket>.s3.us-east-1.amazonaws.com"
        - "<bucket>.s3.amazonaws.com"
  initContainers:
    - name: gen-cert
      image: <sigproxy-image>
      args: ["--mode=init", "--cert-dir=/shared"]
      volumeMounts:
        - { name: sigproxy-shared, mountPath: /shared }
  containers:
    - name: sigproxy
      image: <sigproxy-image>
      args: ["--mode=serve", "--cert-dir=/shared",
             "--upstream=https://nyc3.digitaloceanspaces.com"]
      env:
        - { name: AWS_ACCESS_KEY_ID,     valueFrom: { secretKeyRef: { name: do-spaces, key: AWS_ACCESS_KEY_ID } } }
        - { name: AWS_SECRET_ACCESS_KEY, valueFrom: { secretKeyRef: { name: do-spaces, key: AWS_SECRET_ACCESS_KEY } } }
      volumeMounts:
        - { name: sigproxy-shared, mountPath: /shared, readOnly: true }
    - name: validator
      image: gcr.io/abacus-labs-dev/hyperlane-agent:main
      # ... usual validator args ...
      volumeMounts:
        - name: sigproxy-shared
          mountPath: /etc/ssl/certs/ca-certificates.crt
          subPath: ca-bundle.crt
          readOnly: true
  volumes:
    - name: sigproxy-shared
      emptyDir: { medium: Memory, sizeLimit: 1Mi }

Environment

  • Image: hyperlane-agent:main (commit c558a9f)
  • aws-config 1.1.7, aws-sdk-s3 1.65.0, aws-smithy-runtime 1.8.1
  • Backend: DigitalOcean Spaces (nyc3)

Happy to send a PR — the dep bump alone is one line. Let me know which scope you'd prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions