Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 8:26pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 8:26pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 8:26pm
rivet-site Ignored Ignored Preview Nov 4, 2025 8:26pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

PR Review: fix(runner): close active requests after onActorStop has finished running

Summary

This PR modifies the actor shutdown sequence to close active HTTP requests and WebSocket connections after the onActorStop hook completes, rather than immediately when the actor is removed. This allows the onActorStop hook to send final messages over the tunnel before it's torn down.

Code Quality & Best Practices

Positive Changes

  1. Better separation of concerns: Renaming unregisterActor to closeActiveRequests makes the intent clearer - this method is specifically about closing requests, not general actor unregistration.

  2. Improved comments: The added documentation explaining the Service Unavailable behavior during onActorStop execution and the WebSocket retry semantics is helpful.

  3. Correct sequencing: Moving closeActiveRequests to after onActorStop is the right approach for allowing cleanup operations to use the tunnel.

Issues & Concerns

1. Incomplete refactoring - potential bug ⚠️

The PR adds a comment at mod.ts:227:

// IMPORTANT: Make sure to call stopActiveRequests if calling #removeActor

However, the method is actually called closeActiveRequests, not stopActiveRequests. This is a documentation bug.

More critically, I don't see closeActiveRequests being called anywhere else that #removeActor is called. Let me check:

  • #stopAllActors() at line 190-192 calls forceStopActor(), which properly calls closeActiveRequests
  • But #removeActor is a private method that only removes the actor from the map - it doesn't close requests

The comment suggests there may be other code paths that call #removeActor directly without properly closing requests. If such paths exist, this could leak active requests/WebSockets.

Recommendation:

  • Fix the comment to say closeActiveRequests
  • Audit all callers of #removeActor to ensure they properly close requests first
  • Consider making #removeActor itself call closeActiveRequests to make it impossible to forget

2. Race condition potential ⚠️

In mod.ts:166-180, there's a window where:

  1. onActorStop is running (lines 172-175)
  2. New requests could arrive during this time
  3. Those requests will get a 503 Service Unavailable and retry (per the comment)
  4. closeActiveRequests is called (line 178)

Question: What happens if a request arrives during onActorStop execution but before closeActiveRequests?

Looking at tunnel.ts:212-222, if the actor doesn't exist, it returns 503. But the actor still exists during onActorStop because #removeActor was already called before this method. So the request might succeed or get partially processed.

Potential issue: The actor is removed from this.#actors at line 163 (#removeActor), so by the time onActorStop runs, hasActor() checks will fail. This means incoming requests during onActorStop will correctly get 503 responses. ✅

Actually, reviewing more carefully: #removeActor is called, then onActorStop runs. During onActorStop, the actor is already removed from the map, so new requests will correctly fail with 503. This is good.

3. Removed code not replaced

The PR removes the call to unregisterActor at mod.ts:245-246 (now deleted). The comment above #removeActor says to call stopActiveRequests (should be closeActiveRequests) if calling this method, but #removeActor itself doesn't enforce this.

This creates a footgun - future developers might call #removeActor and forget to close requests.

Recommendation: Consider one of:

  • Make #removeActor always call closeActiveRequests before removing the actor
  • Make closeActiveRequests idempotent and safe to call even if the actor is already removed
  • Add a TypeScript type-level safeguard or assertion

Performance Considerations

✅ No significant performance concerns. The reordering doesn't add overhead, and closing requests after cleanup is actually better as it gives cleanup code a chance to complete.

Security Concerns

✅ No new security issues introduced. The 503 retry behavior during onActorStop is actually good for availability and prevents dropped requests.

Test Coverage

⚠️ Missing test coverage for the new behavior:

  1. No test verifying that tunnel operations work during onActorStop
  2. No test verifying that requests arriving during onActorStop get 503 responses
  3. No test verifying that WebSockets are properly closed after onActorStop completes

Recommendation: Add integration tests covering:

  • Actor shutdown with active HTTP requests
  • Actor shutdown with active WebSocket connections
  • Sending messages over the tunnel during onActorStop
  • Request retry behavior when arriving during shutdown

Additional Observations

Documentation Issue

The comment at tunnel.ts:473-478 references a specific line in the codebase:

// https://github.com/rivet-dev/rivet/blob/222dae87e3efccaffa2b503de40ecf8afd4e31eb/engine/packages/pegboard-gateway/src/lib.rs#L238

This is good documentation, but the commit SHA is hardcoded. Consider linking to a stable branch or documenting the behavior inline to prevent link rot.

Naming Clarity

The rename from unregisterActor to closeActiveRequests is excellent - it's much clearer about what the method does.

Summary of Required Changes

Must fix:

  1. Fix the comment at mod.ts:227 to say closeActiveRequests instead of stopActiveRequests

Should consider:
2. Audit all code paths that call or might call #removeActor to ensure requests are closed
3. Make #removeActor safer by either always calling closeActiveRequests or documenting why callers must do it
4. Add test coverage for the new shutdown sequence

Nice to have:
5. Add integration tests for tunnel operations during onActorStop

Overall Assessment

The core change is sound and improves the shutdown sequence. The logic is correct: allowing onActorStop to use the tunnel before closing it makes sense.

However, there's a documentation bug and potential for future bugs if #removeActor is called without properly closing requests first.

Recommendation: Request changes to fix the comment and clarify the contract around #removeActor.


Review generated by Claude Code. Questions or concerns? Feel free to push back on any feedback.

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from 65f206c to b372bb1 Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_close_active_requests_after_onactorstop_has_finished_running branch from 30b896a to 09512fc Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

Code Review - PR #3323

Summary

This PR fixes a critical bug in the actor lifecycle management by ensuring that closeActiveRequests() is called after onActorStop() completes, rather than before actor removal. This allows the onActorStop handler to send final messages over the tunnel before connections are terminated.

Changes Analysis

engine/sdks/typescript/runner/src/mod.ts

Positive Changes:

  1. Correct lifecycle ordering (lines 177-178): Moving closeActiveRequests() after onActorStop() is the right fix. This ensures user code in onActorStop can send final messages/responses before connections are severed.

  2. Helpful comment added (line 227): The comment about calling stopActiveRequests serves as good documentation for future maintainers.

  3. Improved documentation (lines 167-170): The comment explaining Service Unavailable behavior during onActorStop execution is valuable for understanding the system behavior.

engine/sdks/typescript/runner/src/tunnel.ts

Positive Changes:

  1. Method rename (line 186): Renaming unregisterActor() to closeActiveRequests() better describes the method's actual behavior.

  2. Enhanced documentation (lines 473-478): The comment explaining WebSocket close behavior and linking to the specific code reference is excellent.

Concerns & Recommendations

1. Potential Race Condition (Medium Priority)

If requests arrive during onActorStop() execution, they receive a 503 Service Unavailable error and will be retried by Guard. However, there's a window where the retry might fail if the actor is fully removed before the retry completes.

Recommendation: Consider adding a state flag (e.g., stopping: boolean) to the ActorInstance interface to track actors in the stopping state.

2. Error Handling in onActorStop (Low Priority)

The comment mentions Pegboard will handle timeout with ACTOR_STOP_THRESHOLD_DURATION_MS but I don't see timeout enforcement in this code.

Recommendation: Consider adding explicit timeout handling here, or clarify in comments where the timeout is enforced.

3. Manual Enforcement Risk (Low Priority)

The comment at line 227 relies on developers remembering to call closeActiveRequests(). Since #removeActor() is private and only called from forceStopActor(), this is currently safe, but it's fragile.

Recommendation: Consider making #removeActor() take a boolean flag closeRequests and enforce it, or move closeActiveRequests() inside #removeActor().

4. Test Coverage (Medium Priority)

I don't see test coverage for this specific change. The test file is commented out.

Recommendation: Add a test case that verifies onActorStop() can send messages before connections close.

Security Considerations

No security concerns identified. The change maintains proper cleanup and doesn't introduce new attack vectors.

Performance Considerations

Minimal performance impact. The change only reorders operations without adding new work.

Final Verdict

Approved with suggestions. This is a well-reasoned fix that addresses a real bug. The implementation is clean and well-documented. The suggestions above are minor improvements that could enhance robustness but are not blockers.

Recommended Actions Before Merge:

  1. Core fix is correct - no changes needed
  2. Consider adding test coverage for this scenario
  3. Consider clarifying timeout enforcement location
  4. Optional: Add actor state tracking to better handle the stopping window

Great work on identifying and fixing this issue!

@claude
Copy link

claude bot commented Nov 4, 2025

Pull Request Review: Fix Runner Actor Stop Request Handling

Summary

This PR fixes a critical timing issue in the TypeScript runner where active requests were being closed before onActorStop could complete, preventing cleanup code from sending messages over the tunnel during shutdown.

Code Quality ✅

The changes are well-structured and minimal with clear separation of concerns and appropriate documentation.

Correctness ✅

The fix is correct:

  • Moving closeActiveRequests() to after onActorStop() allows cleanup code to use the tunnel
  • Method rename from unregisterActor() to closeActiveRequests() better reflects actual behavior
  • Added helpful comments explaining retry behavior

Potential Issues ⚠️

1. Comment Inconsistency (mod.ts:227)

The comment refers to stopActiveRequests but the method is named closeActiveRequests. Should be updated.

2. Race Condition Consideration

Actor is removed from map before onActorStop completes. New requests during this window will get 503 (which triggers retry). This appears intentional but worth verifying.

3. Test Coverage

The lifecycle test file is commented out. Should add tests for:

  • Sending messages from onActorStop handler
  • Handling concurrent requests during shutdown
  • WebSocket cleanup timing

Performance ✅

No concerns - simply reorders existing operations.

Security ✅

Proper resource cleanup with error handling.

Final Verdict ✅

Approve with minor suggestions

Core logic is sound and necessary. Main suggestions:

  1. Fix comment typo (stopActiveRequests → closeActiveRequests)
  2. Add test coverage for this scenario
  3. Consider adding debug logging at cleanup point

The fix correctly addresses the issue and allows cleanup code to function properly during actor shutdown.


Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from 102e612 to bd0a1b7 Compare November 4, 2025 19:24
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_close_active_requests_after_onactorstop_has_finished_running branch from 77aaf9c to 6655e70 Compare November 4, 2025 19:24
@claude
Copy link

claude bot commented Nov 4, 2025

Pull Request Review

Summary

This PR fixes a timing issue where active requests were being closed before onActorStop finished running. The change moves the closeActiveRequests call to after the onActorStop hook completes, allowing the hook to send messages over the tunnel during shutdown.

Code Quality ✅

Strengths:

  • Clear intent: The code change directly addresses the issue described in the commit message
  • Good documentation: Added helpful comments explaining the behavior
  • Proper error handling: The onActorStop call is already wrapped in try-catch
  • Follows existing patterns: Maintains consistency with the codebase style
  • Method rename: unregisterActorcloseActiveRequests is more descriptive

Potential Issues ⚠️

1. Race condition during onActorStop execution

Between when #removeActor is called and when closeActiveRequests is called, new requests could theoretically arrive. The comment states requests will receive Service Unavailable errors, but this depends on how the tunnel handles requests for actors not in #actors.

Recommendation: Verify that the tunnel properly rejects requests for removed actors with Service Unavailable.

2. Comment inconsistency in #removeActor

The comment says "Make sure to call stopActiveRequests" but the method is now closeActiveRequests. This should be updated.

3. Potential memory leak scenario

If onActorStop hangs, requests remain tracked until closeActiveRequests is called. While Pegboard handles timeout, it's unclear if that also triggers cleanup of the requests.

Test Coverage ⚠️

Missing test coverage:

  • No tests verify the new behavior
  • The existing test file (lifecycle.test.ts) is entirely commented out
  • Should test: requests during shutdown, tunnel usage during onActorStop, proper cleanup

Recommendation: Add integration tests verifying the new shutdown behavior.

Suggestions

  1. Update the comment in #removeActor to use correct method name: closeActiveRequests
  2. Add tests to verify tunnel messages work during onActorStop and cleanup happens correctly
  3. Verify race condition handling for requests arriving during shutdown
  4. Consider adding comment about Pegboard timeout behavior and request cleanup

Verdict: ✅ Approve with minor suggestions

This is a good fix that addresses a real timing issue. The code is clean, well-documented, and follows conventions. Main concerns are test coverage and verifying the race condition handling. The core logic change is sound and improves the actor shutdown process.

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review - PR #3323

Summary

This PR fixes a critical timing issue in the actor shutdown lifecycle. Previously, active requests and WebSocket connections were closed before onActorStop completed, preventing cleanup code from sending final messages over the tunnel. The fix moves closeActiveRequests() to execute after onActorStop() completes.


Positive Changes

  1. Correct lifecycle ordering: Moving closeActiveRequests() after onActorStop() (mod.ts:177-178) is the right fix. This allows user code in onActorStop to send final cleanup messages/responses before connections are terminated.

  2. Better method naming: Renaming unregisterActor()closeActiveRequests() (tunnel.ts:186) accurately describes what the method does.

  3. Improved documentation:

    • Clear explanation of Service Unavailable behavior during onActorStop (mod.ts:167-170)
    • Excellent comment linking to specific code showing WebSocket retry behavior (tunnel.ts:473-478)
  4. Maintains proper error handling: The existing try-catch around onActorStop is preserved.


⚠️ Issues Found

1. Comment Inconsistency (Minor Bug) 🐛

Location: mod.ts:227

The comment says:

// IMPORTANT: Make sure to call stopActiveRequests if calling #removeActor

But the method is actually named closeActiveRequests, not stopActiveRequests.

Fix: Update comment to match actual method name.


2. Missing Test Coverage ⚠️

The lifecycle test file (tests/lifecycle.test.ts) is entirely commented out. There are no tests verifying:

  • Tunnel operations work during onActorStop execution
  • WebSockets are properly closed after onActorStop completes
  • HTTP requests are properly terminated after cleanup
  • Requests arriving during onActorStop receive proper 503 responses

Recommendation: Add integration tests covering:

- Actor shutdown with active HTTP requests
- Actor shutdown with active WebSocket connections  
- Sending messages over tunnel during onActorStop
- Request retry behavior during shutdown window

3. Potential Edge Case Consideration 🤔

During the window when onActorStop is executing:

  1. Actor is already removed from this.#actors (via #removeActor at line 163)
  2. New incoming requests will fail hasActor() checks and receive 503 responses
  3. Guard will retry these requests (per the comment)

This behavior appears correct and intentional - but it would be good to verify:

  • Does Pegboard/Guard have proper retry limits to prevent infinite retries?
  • Are there metrics/logs to track these retry events?

This is more of a systems-level question than a code issue, but worth confirming.


📋 Code Quality

✅ Follows existing patterns and conventions
✅ Proper error handling maintained
✅ Clear documentation with helpful comments
✅ Minimal, focused change addressing the specific issue


🔒 Security & Performance

Security: No concerns - proper resource cleanup with error handling
Performance: No concerns - simply reorders existing operations without adding overhead


📝 Recommendations

Before merge:

  1. ✏️ Fix comment typo at mod.ts:227 (stopActiveRequestscloseActiveRequests)

Follow-up work:
2. 🧪 Add test coverage for the new shutdown sequence
3. 📊 Consider adding metrics/logging for requests arriving during shutdown
4. 📖 Document the shutdown sequence in a README or architecture doc


Final Verdict

APPROVED - This is a well-reasoned fix that correctly addresses a real bug. The core logic is sound, and the implementation is clean.

The comment typo should be fixed before merge, but it's a trivial change. The missing test coverage is a concern for future maintenance but not a blocker since this is part of a larger stack.

Great work identifying and fixing this timing issue! The change allows proper cleanup during actor shutdown, which is critical for graceful termination.


Review generated using Claude Code with repository context from CLAUDE.md

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:49 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:50 PM UTC: CI is running for this pull request on a draft pull request (#3352) due to your merge queue CI optimization settings.
  • Nov 4, 8:52 PM UTC: Merged by the Graphite merge queue via draft PR: #3352.

@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_close_active_requests_after_onactorstop_has_finished_running branch from 6655e70 to 856a3ff Compare November 4, 2025 20:25
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_runner_fix_http_requests_not_returning_503_when_actor_removed branch from bd0a1b7 to 6f5d613 Compare November 4, 2025 20:25
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix runner to close active requests after onActorStop

Overview

This PR fixes a critical timing issue in the actor lifecycle by ensuring that active HTTP requests and WebSocket connections are closed after onActorStop has completed, rather than before. This allows user code in onActorStop to send final messages through the tunnel before cleanup.

Code Quality & Best Practices

Positive Changes

  1. Improved Comment Documentation - The added comments clearly explain the behavior and expectations during actor shutdown
  2. Better Separation of Concerns - Renaming unregisterActor() to closeActiveRequests() more accurately describes what the function does
  3. Correct Ordering - Moving the tunnel cleanup to occur after onActorStop completes is the correct approach for graceful shutdown

Potential Issues

  1. Comment Mismatch - Line 227 in mod.ts says "stopActiveRequests" but the method is actually called "closeActiveRequests". Please update the comment.

  2. Race Condition Handling - The comment states requests during onActorStop will get 503, which is correct since the actor is removed first. Consider adding an explicit isStopping state flag for better observability.

  3. Code Coupling - Removing tunnel cleanup from #removeActor creates tighter coupling. Consider making #removeActor more explicit about requiring external cleanup.

Performance Considerations

The onActorStop callback now blocks cleanup of active requests. If user code hangs or takes too long, requests remain open until ACTOR_STOP_THRESHOLD_DURATION_MS timeout fires. Consider documenting expected behavior and timeout values.

Test Coverage

The test file is entirely commented out. Recommend adding specific tests for:

  • Sending messages during onActorStop
  • Request handling during actor stop (503 behavior)
  • Shutdown ordering

Summary

This is a good fix that addresses a real lifecycle ordering issue. The core change is correct and important.

Must Fix:

  • Update the comment on line 227 (stopActiveRequests → closeActiveRequests)

Should Consider:

  • Add test coverage for this critical path
  • Add explicit state tracking for the stopping phase
  • Document timeout behavior for users

Overall: Approve with minor changes recommended

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review - PR #3323

Summary

This PR fixes an important timing issue in the actor shutdown lifecycle by moving the call to close active requests after the onActorStop hook completes. This allows cleanup code in onActorStop to send final messages over the tunnel before connections are terminated.

Code Quality & Best Practices ✅

Strengths:

  1. Clear intent: Renaming unregisterActorcloseActiveRequests makes the purpose much clearer
  2. Good documentation: Added helpful comments explaining the Service Unavailable behavior and retry semantics
  3. Minimal changes: Focused fix without unnecessary refactoring
  4. Proper sequencing: The new order (remove actor → run onActorStop → close requests) is logical

Issues Found

1. Documentation Bug 🐛 (mod.ts:227)

// IMPORTANT: Make sure to call stopActiveRequests if calling #removeActor

The comment says stopActiveRequests but the method is actually closeActiveRequests. Please fix this typo.

2. Error Handling During Shutdown 💭

Currently if onActorStop throws an error (line 174), we log it but still proceed to close requests. This is correct behavior, but worth noting that errors in cleanup won't prevent connection closure.

Architecture & Design ✅

The change makes sense architecturally:

  • Actor is removed from the map first (line 163), so new requests will immediately get 503 responses
  • onActorStop runs with the tunnel still active, allowing final messages to be sent
  • Active requests/WebSockets are closed after cleanup completes
  • This prevents "Actor stopped" errors during legitimate cleanup operations

Testing Considerations 🧪

I notice there's a tests/lifecycle.test.ts file. Consider adding test coverage for:

  1. Verifying tunnel messages can be sent during onActorStop
  2. Confirming requests arriving during onActorStop get 503 responses
  3. Validating WebSockets are properly closed after onActorStop completes

Performance & Security ✅

  • Performance: No concerns - just reordering existing operations
  • Security: Proper resource cleanup with appropriate error handling

Related Code Patterns

The comment at tunnel.ts:473-478 references specific Pegboard Gateway behavior with a commit SHA. This is good documentation, though the hardcoded SHA may eventually become outdated. Consider if this should reference a stable branch or be documented inline.

Recommendation

Approve with minor changes requested:

  1. Fix the typo in the comment at mod.ts:227 (stopActiveRequests → closeActiveRequests)
  2. Consider adding test coverage for this scenario

The core fix is correct and necessary. Great work identifying and resolving this lifecycle issue!


Generated by Claude Code

@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-fix_runner_close_active_requests_after_onactorstop_has_finished_running branch November 4, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants