ops(caddy+docs): nefariousplan apex cutover + blog/www 301 + TOPOLOGY/DECISIONS#156
ops(caddy+docs): nefariousplan apex cutover + blog/www 301 + TOPOLOGY/DECISIONS#156TSavo wants to merge 2 commits into
Conversation
…vision overlay in cloud-init
The prod compose had all services on `platform` (bridge), but fleet-spawned
tenant containers (warm pools, customer instances) live on the `platform-overlay`
swarm network. `core` and `node-agent` were on `platform` only, so fleet
containers calling `http://core:3001` for auth/billing/provision got SERVFAIL,
and node-agent couldn't be called back by tenants it spawned.
Surfaced today (2026-04-21) as runpaperclip.com signup returning HTTP 502:
something between the 2026-04-18 deploy and now removed the
`core-server-core-1` container, and when I recreated it from the stale
`/opt/core-server/docker-compose.yml` (no -f flag) it landed on
platform-overlay only while caddy sat on platform. DNS lookup of `core`
SERVFAIL'd until I hand-patched networks at runtime.
Changes:
* ops/core-server/docker-compose.prod.yml
- Declare `platform-overlay` as external alongside `platform`.
- core: dual-attach with aliases [core] on each net (caddy on platform and
fleet containers on overlay both resolve `core`).
- node-agent: dual-attach with aliases [node-agent] (tenants on overlay
call back to node-agent on overlay; compose services reach it via platform).
- postgres: intentionally stays on `platform` only. Tenants must go through
core's API, not raw SQL — putting postgres on overlay would have broadened
attack surface for no gain. (Flagged by CodeRabbit; verified no consumer
on overlay needs direct DB access.)
- Updated the networks-block comment to document the split explicitly.
* ops/core-server/cloud-init.sh
- Add guarded swarm init + `platform-overlay` network create after Docker
install. `external: true` otherwise fails on a fresh host with a cryptic
'network not found' — this closes that gap. Idempotent: re-running is a
no-op. (Flagged by CodeRabbit.)
Caddy/UIs/holyship/holyship-ui stay on `platform` only; they only need core
and holyship, both reachable on platform.
Runtime has been patched out-of-band on the current droplet
(`docker network connect --alias`). This PR makes the next CI deploy preserve
that shape instead of regressing to single-net.
Unexplained and not fixed here: what removed core-server-core-1 and
core-server-node-agent-1 between 18:17 UTC on 2026-04-18 (end of the last
successful deploy) and today. Worth auditing docker events / shell history
before the next deploy.
Reviewer's GuideAdds nefariousplan.com apex vhost to Caddy with DNS-01 via Cloudflare, replaces the ad‑hoc blog subdomain reverse proxy with a canonical apex + 301 redirects for blog/www, and wires core/node-agent into a new Docker Swarm overlay network (with matching cloud-init provisioning) so tenant containers can reach them by stable hostnames. Sequence diagram for nefariousplan.com apex and blog/www redirectssequenceDiagram
actor User
participant Browser
participant CloudflareDNS
participant Caddy
participant OpsWeb1
%% Direct apex request
User->>Browser: Enter nefariousplan.com
Browser->>CloudflareDNS: Resolve nefariousplan.com
CloudflareDNS-->>Browser: 138.68.30.247
Browser->>Caddy: HTTPS GET /
Caddy->>OpsWeb1: HTTP GET / on port 3000
OpsWeb1-->>Caddy: 200 OK HTML
Caddy-->>Browser: 200 OK HTML (apex canonical)
%% blog subdomain redirect
User->>Browser: Enter blog.nefariousplan.com
Browser->>CloudflareDNS: Resolve blog.nefariousplan.com
CloudflareDNS-->>Browser: 138.68.30.247
Browser->>Caddy: HTTPS GET /
Caddy-->>Browser: 301 Moved Permanently
Browser->>Browser: Follow Location https://nefariousplan.com/
Browser->>Caddy: HTTPS GET / on nefariousplan.com
Caddy->>OpsWeb1: HTTP GET / on port 3000
OpsWeb1-->>Caddy: 200 OK HTML
Caddy-->>Browser: 200 OK HTML (apex canonical)
%% www subdomain redirect
User->>Browser: Enter www.nefariousplan.com
Browser->>CloudflareDNS: Resolve www.nefariousplan.com
CloudflareDNS-->>Browser: 138.68.30.247
Browser->>Caddy: HTTPS GET /
Caddy-->>Browser: 301 Moved Permanently
Browser->>Browser: Follow Location https://nefariousplan.com/
Browser->>Caddy: HTTPS GET / on nefariousplan.com
Caddy->>OpsWeb1: HTTP GET / on port 3000
OpsWeb1-->>Caddy: 200 OK HTML
Caddy-->>Browser: 200 OK HTML (apex canonical)
Flow diagram for cloud-init Swarm and platform-overlay provisioningflowchart TD
A_Start["Start cloud-init"] --> B_CheckDocker["Check if docker is installed"]
B_CheckDocker -->|docker not found| C_InstallDocker["Install Docker Engine and plugins"]
B_CheckDocker -->|docker present| D_CheckSwarm["Check Swarm LocalNodeState"]
C_InstallDocker --> D_CheckSwarm
D_CheckSwarm -->|state != active| E_SwarmInit["docker swarm init"]
D_CheckSwarm -->|state == active| F_CheckOverlayNet["Check if platform-overlay network exists"]
E_SwarmInit --> F_CheckOverlayNet
F_CheckOverlayNet -->|network missing| G_CreateOverlayNet["docker network create -d overlay --attachable platform-overlay"]
F_CheckOverlayNet -->|network exists| H_NoOp["No-op (platform-overlay already present)"]
G_CreateOverlayNet --> I_Continue["Continue remaining cloud-init steps"]
H_NoOp --> I_Continue
I_Continue --> J_End["cloud-init complete"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughAdds Caddy routing for nefariousplan.com with Cloudflare TLS and redirects from legacy hostnames, ensures the host is a Docker Swarm manager and creates an attachable Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Cloudflare as Cloudflare (DNS / TLS)
participant Caddy as Caddy (wopr VPS)
participant Web as ops-web-1:3000
Client->>Cloudflare: Request to nefariousplan.com
Cloudflare->>Caddy: Forward TLS-terminated request
Caddy->>Web: reverse_proxy -> ops-web-1:3000\nforwards X-Real-IP/X-Forwarded-For/X-Forwarded-Proto
Web-->>Caddy: Response
Caddy-->>Cloudflare: Response
Cloudflare-->>Client: Response
sequenceDiagram
participant Tenant as Tenant container
participant Overlay as platform-overlay (Swarm overlay)
participant Core as core (service)
participant Node as node-agent (service)
Tenant->>Overlay: Join network / Resolve `core`/`node-agent`
Overlay->>Core: DNS lookup -> `core:3001`
Overlay->>Node: DNS lookup -> `node-agent:...`
Core-->>Tenant: Service response
Node-->>Tenant: Service response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
cloud-init.sh,docker swarm initis run whenever the local node state is notactive; consider guarding this further (e.g., behind a config flag or hostname check) so the script can be reused on non-swarm nodes without unexpectedly initializing a swarm manager. - The
platform-overlaynetwork name is duplicated betweendocker-compose.prod.ymlandcloud-init.sh; consider centralizing this (e.g., via an env var or shared config) to avoid drift if the network name ever changes. - The swarm/overlay setup in
cloud-init.shdoes not check or log failures fordocker swarm initordocker network create; capturing errors or adding brief logging would make debugging bootstrap issues on new hosts easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `cloud-init.sh`, `docker swarm init` is run whenever the local node state is not `active`; consider guarding this further (e.g., behind a config flag or hostname check) so the script can be reused on non-swarm nodes without unexpectedly initializing a swarm manager.
- The `platform-overlay` network name is duplicated between `docker-compose.prod.yml` and `cloud-init.sh`; consider centralizing this (e.g., via an env var or shared config) to avoid drift if the network name ever changes.
- The swarm/overlay setup in `cloud-init.sh` does not check or log failures for `docker swarm init` or `docker network create`; capturing errors or adding brief logging would make debugging bootstrap issues on new hosts easier.
## Individual Comments
### Comment 1
<location path="ops/core-server/cloud-init.sh" line_range="70-62" />
<code_context>
+# customer Paperclip / WOPR / NemoPod instances) attach to it, and `core` +
+# `node-agent` are dual-attached in the compose file so DNS resolves from
+# either network. Guarded so re-running cloud-init is a no-op.
+if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then
+ docker swarm init
+fi
+if ! docker network inspect platform-overlay &>/dev/null; then
+ docker network create -d overlay --attachable platform-overlay
</code_context>
<issue_to_address>
**issue:** Guard swarm init against a not-yet-started Docker daemon
If `docker info` runs before the daemon is fully started, it will fail and the command substitution will be empty, making the condition true and triggering `docker swarm init`, which will also fail. This can stop the script or leave the environment half-configured. Please add an explicit readiness check (e.g., a small retry/backoff loop around `docker info` or a dedicated daemon-health check) so this logic is reliable when Docker starts slowly or under load.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @@ -61,6 +61,19 @@ if ! command -v docker &>/dev/null; then | |||
| apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin | |||
| fi | |||
There was a problem hiding this comment.
issue: Guard swarm init against a not-yet-started Docker daemon
If docker info runs before the daemon is fully started, it will fail and the command substitution will be empty, making the condition true and triggering docker swarm init, which will also fail. This can stop the script or leave the environment half-configured. Please add an explicit readiness check (e.g., a small retry/backoff loop around docker info or a dedicated daemon-health check) so this logic is reliable when Docker starts slowly or under load.
Greptile SummaryThis PR commits the
Confidence Score: 4/5Safe to merge for the live server (already hot-reloaded), but cloud-init.sh has a P1 gap that will break fresh provisioning on multi-NIC droplets. The Caddy and compose changes are clean and already validated live. The single P1 is in cloud-init.sh: ops/core-server/cloud-init.sh —
|
| Filename | Overview |
|---|---|
| ops/core-server/Caddyfile | Adds nefariousplan.com apex vhost (DNS-01, reverse-proxy to ops-web-1:3000) and a unified 301 block for blog.* + www.*; matches the pattern used by other tenants on this box. |
| ops/core-server/cloud-init.sh | Adds Swarm init + platform-overlay overlay network creation; docker swarm init without --advertise-addr will fail on multi-NIC DigitalOcean droplets, aborting the script before the network is created. |
| ops/core-server/docker-compose.prod.yml | Adds platform-overlay external overlay network; dual-attaches core and node-agent to both platform and platform-overlay with DNS aliases so fleet-spawned containers can reach them; postgres intentionally stays bridge-only. |
Sequence Diagram
sequenceDiagram
participant Client
participant Caddy
participant apex as ops-web-1:3000 (nefariousplan app)
Client->>Caddy: GET https://nefariousplan.com/
Caddy->>apex: reverse_proxy (X-Real-IP, X-Forwarded-*)
apex-->>Caddy: 200 OK
Caddy-->>Client: 200 OK
Client->>Caddy: GET https://blog.nefariousplan.com/path
Caddy-->>Client: 301 → https://nefariousplan.com/path
Client->>Caddy: GET https://www.nefariousplan.com/path
Caddy-->>Client: 301 → https://nefariousplan.com/path
Client->>Caddy: GET https://nefariousplan.com/path
Caddy->>apex: reverse_proxy
apex-->>Caddy: 200 OK
Caddy-->>Client: 200 OK
Prompt To Fix All With AI
This is a comment left during a code review.
Path: ops/core-server/cloud-init.sh
Line: 70-71
Comment:
**`docker swarm init` may fail on multi-NIC hosts**
DigitalOcean droplets commonly have both a public (`eth0`) and a private (`eth1`) interface. Running `docker swarm init` without `--advertise-addr` will error with `could not choose an IP address to advertise since this system has multiple addresses on different interfaces` when Docker can't auto-select. The cloud-init script would then abort (set -e), leaving the overlay network un-created and the compose stack broken on first provision.
Alternatively, `--advertise-addr eth0` is cleaner and more portable across DigitalOcean droplet types.
```suggestion
docker swarm init --advertise-addr eth0
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: ops/core-server/Caddyfile
Line: 85
Comment:
**Comment omits `www.*` from the legacy-alias description**
The comment says "blog.* is a legacy alias" but the redirect block directly below also covers `www.nefariousplan.com`. Minor documentation drift that could confuse the next reader.
```suggestion
# nefariousplan — apex canonical, blog.* and www.* are legacy aliases that 301
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "ops(caddy): nefariousplan apex cutover +..." | Re-trigger Greptile
| if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then | ||
| docker swarm init |
There was a problem hiding this comment.
docker swarm init may fail on multi-NIC hosts
DigitalOcean droplets commonly have both a public (eth0) and a private (eth1) interface. Running docker swarm init without --advertise-addr will error with could not choose an IP address to advertise since this system has multiple addresses on different interfaces when Docker can't auto-select. The cloud-init script would then abort (set -e), leaving the overlay network un-created and the compose stack broken on first provision.
Alternatively, --advertise-addr eth0 is cleaner and more portable across DigitalOcean droplet types.
| if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then | |
| docker swarm init | |
| docker swarm init --advertise-addr eth0 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: ops/core-server/cloud-init.sh
Line: 70-71
Comment:
**`docker swarm init` may fail on multi-NIC hosts**
DigitalOcean droplets commonly have both a public (`eth0`) and a private (`eth1`) interface. Running `docker swarm init` without `--advertise-addr` will error with `could not choose an IP address to advertise since this system has multiple addresses on different interfaces` when Docker can't auto-select. The cloud-init script would then abort (set -e), leaving the overlay network un-created and the compose stack broken on first provision.
Alternatively, `--advertise-addr eth0` is cleaner and more portable across DigitalOcean droplet types.
```suggestion
docker swarm init --advertise-addr eth0
```
How can I resolve this? If you propose a fix, please make it concise.| } | ||
| } | ||
|
|
||
| # nefariousplan — apex canonical, blog.* is a legacy alias that 301s |
There was a problem hiding this comment.
Comment omits
www.* from the legacy-alias description
The comment says "blog.* is a legacy alias" but the redirect block directly below also covers www.nefariousplan.com. Minor documentation drift that could confuse the next reader.
| # nefariousplan — apex canonical, blog.* is a legacy alias that 301s | |
| # nefariousplan — apex canonical, blog.* and www.* are legacy aliases that 301 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: ops/core-server/Caddyfile
Line: 85
Comment:
**Comment omits `www.*` from the legacy-alias description**
The comment says "blog.* is a legacy alias" but the redirect block directly below also covers `www.nefariousplan.com`. Minor documentation drift that could confuse the next reader.
```suggestion
# nefariousplan — apex canonical, blog.* and www.* are legacy aliases that 301
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
Updates the core-server production ops configuration to support the nefariousplan.com apex cutover (canonical hostname) and consolidate legacy subdomain behavior, while also improving fleet networking between compose-managed services and swarm-spawned tenant containers.
Changes:
- Add
nefariousplan.comapex vhost reverse-proxying toops-web-1:3000with Cloudflare DNS-01 TLS. - Replace the previously live-only
blog.nefariousplan.combehavior with a versioned 301 redirect covering bothblog.*andwww.*→ apex. - Dual-attach
coreandnode-agentto bothplatformand externalplatform-overlay, and add cloud-init provisioning for the overlay/swarm prerequisites.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| ops/core-server/docker-compose.prod.yml | Dual-attaches core/node-agent to platform + platform-overlay and documents the overlay network purpose. |
| ops/core-server/cloud-init.sh | Adds swarm init + overlay network creation steps for hosts that need platform-overlay. |
| ops/core-server/Caddyfile | Adds nefariousplan apex reverse proxy and unified blog/www 301 redirect to apex. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # docker-compose.prod.yml declares `platform-overlay` as an external network. | ||
| # It lives on a single-host swarm: fleet-spawned tenant containers (warm pools, | ||
| # customer Paperclip / WOPR / NemoPod instances) attach to it, and `core` + | ||
| # `node-agent` are dual-attached in the compose file so DNS resolves from | ||
| # either network. Guarded so re-running cloud-init is a no-op. |
There was a problem hiding this comment.
The new Swarm/overlay section claims docker-compose.prod.yml dual-attaches core/node-agent, but this cloud-init script actually writes and instructs operators to run $INSTALL_DIR/docker-compose.yml, which currently has neither platform-overlay nor node-agent. Either update cloud-init to deploy the prod compose file (or generate an equivalent that includes the overlay + node-agent), or adjust this section so it reflects what cloud-init really provisions.
| # docker-compose.prod.yml declares `platform-overlay` as an external network. | |
| # It lives on a single-host swarm: fleet-spawned tenant containers (warm pools, | |
| # customer Paperclip / WOPR / NemoPod instances) attach to it, and `core` + | |
| # `node-agent` are dual-attached in the compose file so DNS resolves from | |
| # either network. Guarded so re-running cloud-init is a no-op. | |
| # Prepare a single-host swarm and an attachable `platform-overlay` network | |
| # for fleet-spawned tenant containers (warm pools, customer Paperclip / | |
| # WOPR / NemoPod instances). This script only creates the swarm/network; it | |
| # does not itself deploy the prod compose topology or attach application | |
| # services such as `core` or `node-agent` to `platform-overlay`. | |
| # Guarded so re-running cloud-init is a no-op. |
| # customer Paperclip / WOPR / NemoPod instances) attach to it, and `core` + | ||
| # `node-agent` are dual-attached in the compose file so DNS resolves from | ||
| # either network. Guarded so re-running cloud-init is a no-op. | ||
| if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then |
There was a problem hiding this comment.
With set -euo pipefail, a failure of docker info inside the command substitution can terminate the whole script (e.g., if the Docker daemon isn’t up yet). Consider capturing the swarm state with a fallback (or guarding with || true) before comparing, so provisioning stays idempotent and robust on first boot.
| if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then | |
| SWARM_STATE="$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null || true)" | |
| if [ "$SWARM_STATE" != "active" ]; then |
| reverse_proxy ops-web-1:3000 { | ||
| header_up X-Real-IP {remote_host} | ||
| header_up X-Forwarded-For {remote_host} | ||
| header_up X-Forwarded-Proto {scheme} | ||
| } |
There was a problem hiding this comment.
reverse_proxy ops-web-1:3000 points at an upstream that isn’t defined anywhere in this repo’s prod compose stack (and isn’t referenced elsewhere in ops/), so a clean deploy of this stack will return 502s unless an external container named ops-web-1 is already attached to the platform network. Consider documenting/provisioning that dependency in-repo (or making the upstream address configurable) to avoid hidden runtime coupling.
| # nefariousplan — apex canonical, blog.* is a legacy alias that 301s | ||
| nefariousplan.com { | ||
| tls { | ||
| dns cloudflare {env.CF_API_TOKEN} | ||
| } | ||
| reverse_proxy ops-web-1:3000 { | ||
| header_up X-Real-IP {remote_host} | ||
| header_up X-Forwarded-For {remote_host} | ||
| header_up X-Forwarded-Proto {scheme} | ||
| } | ||
| } | ||
| blog.nefariousplan.com, www.nefariousplan.com { |
There was a problem hiding this comment.
The comment says only blog.* is a legacy alias, but this redirect block also covers www.nefariousplan.com. Consider updating the comment to match the actual set of legacy aliases to avoid confusion during future edits.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
ops/core-server/cloud-init.sh (1)
64-75: Inline compose file is out of sync with the overlay bootstrap it enables.The comment on Lines 64–69 refers to
docker-compose.prod.yml(which dual-attachescore/node-agentand declaresplatform-overlayexternal), but the compose file this script actually writes at$INSTALL_DIR/docker-compose.yml(Lines 177–445) only uses theplatformbridge, has nonode-agentservice, and never referencesplatform-overlay. A fresh droplet that never gets a CI overlay deploy will have a danglingplatform-overlaynetwork with nothing attached, and anydocker compose up -dfrom the inline file will succeed without the behavior the comment describes.Either drop the inline compose block (and rely entirely on CI to sync
docker-compose.prod.yml), or bring it in line withdocker-compose.prod.yml(overlay +node-agent+ dual-attach). At minimum, call out in the comment that the overlay is only consumed once CI syncs the prod compose file.Also applies to: 176-445
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ops/core-server/cloud-init.sh` around lines 64 - 75, The comment and overlay bootstrap around the platform-overlay network are out of sync with the inline compose written to $INSTALL_DIR/docker-compose.yml: either remove the inline compose block and rely on CI's docker-compose.prod.yml, or update the inline compose to match docker-compose.prod.yml by adding an external platform-overlay network and a node-agent service and dual-attaching core and node-agent to both platform and platform-overlay; alternatively, at minimum change the comment near the platform-overlay creation (and the docker network create call referencing platform-overlay) to state explicitly that the overlay is only used after CI syncs docker-compose.prod.yml so the current inline compose will not consume platform-overlay until CI deploys the prod compose.ops/core-server/docker-compose.prod.yml (1)
107-113: Network aliases are redundant with the service name.Compose automatically publishes each service under its service name on every network it attaches to, so
aliases: [core]on thecoreservice (andaliases: [node-agent]onnode-agentat Lines 362–368) is a no-op vs. defaults. Harmless and arguably self-documenting, but if the intent is just DNS-as-service-name, these blocks can be simplified to the short-form list.♻️ Optional simplification
networks: - platform: - aliases: - - core - platform-overlay: - aliases: - - core + - platform + - platform-overlayApply the same to
node-agentat Lines 362–368. Keep the current long-form if you plan to add additional aliases (e.g. legacy names) later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ops/core-server/docker-compose.prod.yml` around lines 107 - 113, The network alias entries for the core service (aliases: - core) are redundant because Docker Compose already exposes the service under its service name on attached networks; remove the long-form aliases block under the core service’s networks (and do the same for the node-agent service’s aliases: - node-agent) or replace each long-form network entry with the short-form list of network names (platform, platform-overlay) to simplify the docker-compose.yml while preserving DNS resolution by service name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ops/core-server/Caddyfile`:
- Around line 86-95: Confirm that the upstream host ops-web-1 referenced in the
Caddyfile reverse_proxy block is attached to the Docker "platform" network and
document its deployment location: verify on the wopr VPS that ops-web-1 is
connected via docker network connect (or note the compose stack that provides
it), then add a clear comment above the Caddyfile block referencing the
deployment/stack or the provisioning playbook that creates ops-web-1 so future
operators can find it; also update the caddy configuration (or deployment) to
include a readiness/fail-fast measure (for example healthcheck or probe to
detect unreachable upstreams and fail start) so Caddy does not silently proxy to
a detached external service.
In `@ops/core-server/cloud-init.sh`:
- Around line 70-72: The current docker swarm init call in the conditional block
that checks Swarm.LocalNodeState must be made explicit about which IP to
advertise to avoid failing on DO droplets; replace the plain docker swarm init
with a call that passes --advertise-addr set to the host's default-route IPv4
address (determine the default-route interface IP by querying the kernel routing
table, e.g. via ip route get and extracting the src field) so the single-node
overlay uses that address; update the docker swarm init invocation in the same
if block accordingly so cloud-init does not abort under set -euo pipefail.
---
Nitpick comments:
In `@ops/core-server/cloud-init.sh`:
- Around line 64-75: The comment and overlay bootstrap around the
platform-overlay network are out of sync with the inline compose written to
$INSTALL_DIR/docker-compose.yml: either remove the inline compose block and rely
on CI's docker-compose.prod.yml, or update the inline compose to match
docker-compose.prod.yml by adding an external platform-overlay network and a
node-agent service and dual-attaching core and node-agent to both platform and
platform-overlay; alternatively, at minimum change the comment near the
platform-overlay creation (and the docker network create call referencing
platform-overlay) to state explicitly that the overlay is only used after CI
syncs docker-compose.prod.yml so the current inline compose will not consume
platform-overlay until CI deploys the prod compose.
In `@ops/core-server/docker-compose.prod.yml`:
- Around line 107-113: The network alias entries for the core service (aliases:
- core) are redundant because Docker Compose already exposes the service under
its service name on attached networks; remove the long-form aliases block under
the core service’s networks (and do the same for the node-agent service’s
aliases: - node-agent) or replace each long-form network entry with the
short-form list of network names (platform, platform-overlay) to simplify the
docker-compose.yml while preserving DNS resolution by service name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 916ead89-f508-48a4-87f5-45c166788475
📒 Files selected for processing (3)
ops/core-server/Caddyfileops/core-server/cloud-init.shops/core-server/docker-compose.prod.yml
| nefariousplan.com { | ||
| tls { | ||
| dns cloudflare {env.CF_API_TOKEN} | ||
| } | ||
| reverse_proxy ops-web-1:3000 { | ||
| header_up X-Real-IP {remote_host} | ||
| header_up X-Forwarded-For {remote_host} | ||
| header_up X-Forwarded-Proto {scheme} | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# On the wopr VPS, verify ops-web-1 is joined to the platform network:
# docker inspect ops-web-1 --format '{{json .NetworkSettings.Networks}}' | jq 'keys'
# Expected: includes "platform"
rg -n 'ops-web-1' -C2Repository: wopr-network/platform
Length of output: 358
🏁 Script executed:
# Find all mentions of ops-web-1 across the entire repo
rg 'ops-web-1' -n --no-headingRepository: wopr-network/platform
Length of output: 125
🏁 Script executed:
# Check if ops-web-1 is defined in docker-compose.prod.yml
cat ops/core-server/docker-compose.prod.yml | grep -n -A5 -B5 'ops-web-1' || echo "Not found in docker-compose.prod.yml"Repository: wopr-network/platform
Length of output: 101
🏁 Script executed:
# Look at the services defined in docker-compose.prod.yml to confirm ops-web-1 is absent
rg -n 'services:' -A100 ops/core-server/docker-compose.prod.yml | head -150Repository: wopr-network/platform
Length of output: 3582
🏁 Script executed:
# Check for any documentation about external services or deployment
fd -t f -e md -e txt -e rst | xargs rg -l 'ops-web-1|external.*service|undeclared.*depend' 2>/dev/null || echo "No docs found"Repository: wopr-network/platform
Length of output: 470
🏁 Script executed:
# Get the complete docker-compose.prod.yml to see all services
cat ops/core-server/docker-compose.prod.yml | wc -lRepository: wopr-network/platform
Length of output: 68
🏁 Script executed:
# View the caddy service definition and network configuration
rg -n 'caddy:' -A50 ops/core-server/docker-compose.prod.ymlRepository: wopr-network/platform
Length of output: 2066
🏁 Script executed:
# Look at the network definitions in docker-compose.prod.yml
rg -n 'networks:' -A20 ops/core-server/docker-compose.prod.yml | tail -50Repository: wopr-network/platform
Length of output: 2241
🏁 Script executed:
# Check for comments around the ops-web-1 reference in Caddyfile
sed -n '80,100p' ops/core-server/CaddyfileRepository: wopr-network/platform
Length of output: 576
🏁 Script executed:
# Search more broadly for "ops-web" to see if there's an external reference or docs
rg 'ops-web|nefariousplan' --type mdRepository: wopr-network/platform
Length of output: 347
Document ops-web-1 deployment and verify its network attachment before merge.
ops-web-1 is not a service in docker-compose.prod.yml and appears nowhere in deployment documentation. The caddy service (attached only to the platform network per lines 306–307) proxies to ops-web-1:3000 with no explicit dependency or network configuration in this repo. While tests passed (confirming the service is deployed externally and reachable on wopr), this creates a silent cross-stack coupling: any future recreation of the platform network, or operator running docker compose down, will orphan Caddy from ops-web-1 and break nefariousplan.com with no signal in this codebase.
Before merge:
- Confirm
ops-web-1isdocker network connect-ed toplatformon the wopr VPS - Add a comment to the Caddyfile block (lines 86–95) pointing to where
ops-web-1is deployed - Consider a readiness check or fail-fast mechanism if the upstream becomes unreachable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ops/core-server/Caddyfile` around lines 86 - 95, Confirm that the upstream
host ops-web-1 referenced in the Caddyfile reverse_proxy block is attached to
the Docker "platform" network and document its deployment location: verify on
the wopr VPS that ops-web-1 is connected via docker network connect (or note the
compose stack that provides it), then add a clear comment above the Caddyfile
block referencing the deployment/stack or the provisioning playbook that creates
ops-web-1 so future operators can find it; also update the caddy configuration
(or deployment) to include a readiness/fail-fast measure (for example
healthcheck or probe to detect unreachable upstreams and fail start) so Caddy
does not silently proxy to a detached external service.
| if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then | ||
| docker swarm init | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
docker swarm init multiple addresses different interfaces error advertise-addr
💡 Result:
The error "could not choose an IP address to advertise since this system has multiple addresses on different interfaces" occurs when running docker swarm init on a host with multiple network interfaces (e.g., Ethernet and WiFi). Docker cannot automatically select which IP to advertise to other swarm nodes for API access and overlay networking. To resolve it, explicitly specify the --advertise-addr flag with the IP address of the desired interface (one reachable by other nodes, preferably static): docker swarm init --advertise-addr Example (using Ethernet IP): docker swarm init --advertise-addr 192.168.130.117 You can also specify a network interface name (Docker picks an IP from it), optionally with port: docker swarm init --advertise-addr enp1s0 or docker swarm init --advertise-addr enp1s0:2377 To identify interfaces and IPs: ip addr show or ip route # default route often indicates primary interface For production, use static IPs on manager nodes to avoid instability on reboot. If separating control plane (management) and data (overlay) traffic across interfaces, add --data-path-addr: docker swarm init --advertise-addr --data-path-addr VPNs or dynamic IPs may contribute; disable unnecessary interfaces or VPN temporarily if testing locally. For single-node testing, 127.0.0.1 works but limits multi-node swarms. If an interface has multiple IPv6 addresses, it may error; remove extras with ip addr del or specify IPv4 explicitly.
Citations:
- 1: https://stackoverflow.com/questions/43662237/in-docker-error-response-from-daemon-could-not-choose-an-ip-address-to-advert
- 2: https://stackoverflow.com/questions/38602903/docker-swarm-init-could-not-choose-an-ip-address-error
- 3: https://docs.docker.com/engine/swarm/swarm-mode
- 4: https://docs.docker.com/reference/cli/docker/swarm/init/
- 5: https://docs.docker.com/reference/cli/docker/swarm/init
🏁 Script executed:
# Test the proposed ip command to extract the default-route interface's IPv4
echo "Testing the ip command syntax:"
OUTPUT=$(ip -4 -o route get 1.1.1.1 2>/dev/null | awk '{for(i=1;i<=NF;i++) if ($i=="src") {print $(i+1); exit}}')
echo "Result: '$OUTPUT'"
# Also test without the 2>/dev/null to see what the normal output looks like
echo -e "\nFull output without error suppression:"
ip -4 -o route get 1.1.1.1 2>&1 || echo "(command failed)"
# Test the fallback logic
echo -e "\nTesting fallback:"
ADVERTISE_ADDR="${OUTPUT:-127.0.0.1}"
echo "Advertise address would be: '$ADVERTISE_ADDR'"
# Test with invalid dest to simulate error case
echo -e "\nSimulating error case (no route):"
OUTPUT2=$(ip -4 -o route get 999.999.999.999 2>/dev/null | awk '{for(i=1;i<=NF;i++) if ($i=="src") {print $(i+1); exit}}')
ADVERTISE_ADDR2="${OUTPUT2:-127.0.0.1}"
echo "Fallback result: '$ADVERTISE_ADDR2'"Repository: wopr-network/platform
Length of output: 327
🏁 Script executed:
cat -n ops/core-server/cloud-init.sh | head -80Repository: wopr-network/platform
Length of output: 3882
docker swarm init will fail on DigitalOcean droplets with both public and private networking enabled.
On DigitalOcean droplets configured with both public and private interfaces (the default setup), docker swarm init without --advertise-addr exits with "could not choose an IP address to advertise since this system has multiple addresses on different interfaces". Combined with set -euo pipefail (line 30), this failure on line 71 aborts cloud-init mid-execution, leaving the host half-provisioned—missing the deploy user, compose files, and project directory created by subsequent steps.
Since this is a single-host overlay network (no multi-node swarm traffic), binding to the default-route interface's IPv4 is safe and prevents the abort.
🛠️ Proposed fix
if [ "$(docker info --format '{{.Swarm.LocalNodeState}}' 2>/dev/null)" != "active" ]; then
- docker swarm init
+ # Pick the default-route interface's IPv4 so swarm init doesn't abort on
+ # droplets that have both public + private networking enabled.
+ ADVERTISE_ADDR=$(ip -4 -o route get 1.1.1.1 2>/dev/null | awk '{for(i=1;i<=NF;i++) if ($i=="src") {print $(i+1); exit}}')
+ docker swarm init --advertise-addr "${ADVERTISE_ADDR:-127.0.0.1}"
fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ops/core-server/cloud-init.sh` around lines 70 - 72, The current docker swarm
init call in the conditional block that checks Swarm.LocalNodeState must be made
explicit about which IP to advertise to avoid failing on DO droplets; replace
the plain docker swarm init with a call that passes --advertise-addr set to the
host's default-route IPv4 address (determine the default-route interface IP by
querying the kernel routing table, e.g. via ip route get and extracting the src
field) so the single-node overlay uses that address; update the docker swarm
init invocation in the same if block accordingly so cloud-init does not abort
under set -euo pipefail.
878912d to
0ead186
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ops/TOPOLOGY.md`:
- Around line 592-593: Update the wording that currently reads "Next.js Astro
site" to accurately state "Next.js site" (or "Astro site" if the repo actually
uses Astro); specifically edit the sentence mentioning NEXT_PUBLIC_SITE_URL so
it reads something like "Next.js site. Gets canonical URL from
NEXT_PUBLIC_SITE_URL = https://nefariousplan.com (apex)." to remove the
contradictory "Astro" token and keep the environment variable reference
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 62a8aead-a288-4511-a67a-a63018eedb75
📒 Files selected for processing (3)
ops/DECISIONS.mdops/TOPOLOGY.mdops/core-server/Caddyfile
✅ Files skipped from review due to trivial changes (1)
- ops/core-server/Caddyfile
| │ Next.js Astro site. Gets canonical URL from NEXT_PUBLIC_SITE_URL | ||
| │ = https://nefariousplan.com (apex). |
There was a problem hiding this comment.
Clarify: "Next.js Astro site" is contradictory.
A web application is built with either Next.js or Astro, not both. Based on the context (mentions of NEXT_PUBLIC_SITE_URL in line 593 and the deploy workflow description), this appears to be a Next.js site.
📝 Proposed fix
- ├─ ops-web-1 (registry.wopr.bot/nefariousplan-web:latest) (3000)
- │ Next.js Astro site. Gets canonical URL from NEXT_PUBLIC_SITE_URL
+ ├─ ops-web-1 (registry.wopr.bot/nefariousplan-web:latest) (3000)
+ │ Next.js site. Gets canonical URL from NEXT_PUBLIC_SITE_URL📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| │ Next.js Astro site. Gets canonical URL from NEXT_PUBLIC_SITE_URL | |
| │ = https://nefariousplan.com (apex). | |
| │ Next.js site. Gets canonical URL from NEXT_PUBLIC_SITE_URL | |
| │ = https://nefariousplan.com (apex). |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ops/TOPOLOGY.md` around lines 592 - 593, Update the wording that currently
reads "Next.js Astro site" to accurately state "Next.js site" (or "Astro site"
if the repo actually uses Astro); specifically edit the sentence mentioning
NEXT_PUBLIC_SITE_URL so it reads something like "Next.js site. Gets canonical
URL from NEXT_PUBLIC_SITE_URL = https://nefariousplan.com (apex)." to remove the
contradictory "Astro" token and keep the environment variable reference
unchanged.
Summary
Two things:
Caddy — add apex vhost
nefariousplan.com→ops-web-1:3000on the wopr VPS, using DNS-01 via Cloudflare (matches every other apex on this box). Replace the previously-hand-added (live-only, unversioned)blog.nefariousplan.comreverse-proxy block with a unified 301 redirect covering bothblog.*andwww.*. Apex is now canonical; blog and www are legacy aliases that 301 to it.Docs — describe this in the places the next person will actually look:
ops/TOPOLOGY.md— new "Nefariousplan" section under the other product architectures; Droplets table updated sowopr-platformis no longer labeled "WOPR platform" as if WOPR were the only tenant.ops/DECISIONS.md— 2026-04-23 entry explaining why the cutover happened, what changed, and the rehearsable order that worked (including the first-attempt DNS-01 SERVFAIL transient).Context
blog.nefariousplan.comhad been hand-edited into the live Caddyfile on 138.68.30.247 with a TODO comment "blog subdomain for QA until top-level cutover." This PR does the cutover, commits the drift back to the repo, and extends the redirect block to cover www. Cloudflare DNS was flipped out-of-band (apex CNAME→pages.dev removed, apex + www A records added pointing at 138.68.30.247, DNS-only). MCP consumers retargeted at apex.Paired with the
nefariousplanrepo (separate repo,TSavo/nefariousplan):NEXT_PUBLIC_SITE_URL→https://nefariousplan.comDEPLOY.mdfor the VPS flow, delete the broken-and-now-wrongscheduled-deploy.yml(wrangler → dead CF Pages), fold the daily-rebuild cron intobuild-and-deploy.yml.Test plan
curl https://nefariousplan.com/→ 200 from 138.68.30.247, Let's Encrypt E8 cert, SAN=nefariousplan.comcurl https://blog.nefariousplan.com/→ 301 →https://nefariousplan.com/curl https://www.nefariousplan.com/→ 301 →https://nefariousplan.com//mcpreconnect with newNP_API)🤖 Generated with Claude Code
Summary by CodeRabbit