Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 8:26pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 8:26pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 8:26pm
rivet-site Ignored Ignored Preview Nov 4, 2025 8:26pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

Code Review: PR #3327 - Log explicit GC reason for pending requests

Summary

This PR improves debugging visibility by adding explicit logging of the reason why in-flight requests are being garbage collected. The refactoring changes the GC logic from a boolean flag to an enum-based reason system.

Positive Aspects

Improved Observability: Adding explicit GC reasons (MessageNotAcked vs WebSocketMessageNotAcked) makes debugging timeout issues much easier.

Better Code Structure: The enum-based approach is more explicit and self-documenting than the previous boolean keep variable.

Follows Project Conventions:

  • Uses structured logging with tracing (?reason, request_id=?Uuid::from_bytes(*request_id))
  • Lowercase log message ("gc collecting in flight request")
  • Proper use of let-chains for nested Option checks

Logic Correctness: The timeout logic has been inverted correctly (using <= instead of > and reversing the return logic).

Issues & Concerns

🐛 Critical Logic Bug

Lines 428-432 and 438-442: The timeout comparison logic is inverted. The current code breaks to return Some(reason) when the duration is less than or equal to the timeout, which means requests that are still within the valid time window will be marked for GC.

Current (incorrect) logic:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);  // Wrong!
}

Should be:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

This will cause premature garbage collection of requests that are still waiting for acknowledgments within the valid 30-second timeout window.

Original code comparison:

// Old code (correct):
keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;
// If duration > timeout, keep=true (don't GC)
// If duration <= timeout, keep=false (do GC) ❌ Wait, this was also wrong!

Actually, reviewing the original code more carefully:

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;
// Then at the end: if !keep { gc() }

So the original logic was: if duration > timeout, then keep=true (don't GC). The new code should maintain this behavior.

The new code returns reason.is_none() to indicate keep, so:

  • None = keep (don't GC)
  • Some(reason) = don't keep (do GC)

Therefore, you want to set Some(reason) when duration exceeds timeout:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

⚠️ Minor: Named Loop Label

Line 426: Using a labeled block 'reason: is clever but might be less idiomatic. Consider extracting this into a helper method for better readability:

fn check_gc_reason(req: &InFlightRequest, now: Instant) -> Option<MsgGcReason> {
    if let Some(earliest_pending_msg) = req.pending_msgs.first() {
        if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
            return Some(MsgGcReason::MessageNotAcked);
        }
    }

    if let Some(hs) = &req.hibernation_state
        && let Some(earliest_pending_ws_msg) = hs.pending_ws_msgs.first()
    {
        if now.duration_since(earliest_pending_ws_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
            return Some(MsgGcReason::WebSocketMessageNotAcked);
        }
    }

    None
}

This would make the main GC loop cleaner and easier to test.

Recommendations

  1. Fix the timeout comparison logic (critical - see above)
  2. Consider adding unit tests for the GC logic to prevent regression
  3. Consider extracting the reason check into a separate function for testability

Test Coverage

No tests are modified in this PR. Given that this changes critical timeout logic, consider adding tests for:

  • GC behavior when messages timeout
  • GC behavior when messages are still within timeout window
  • Different GC reasons being logged correctly

Overall, the approach is good and improves debugging capabilities, but the logic bug needs to be fixed before merging.

🤖 Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 7895f28 to 546e885 Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_pegboard_fix_tunnel_ack_from_gateway branch from 2769824 to 43b9be8 Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR improves observability in the garbage collection process by adding explicit logging for why pending requests are being GC'd. The refactoring makes the timeout logic clearer and adds structured logging to track which condition triggered the cleanup.

Positive Observations

Improved Observability: Adding the MsgGcReason enum with debug logging significantly improves troubleshooting capabilities. This will make it much easier to diagnose timeout issues in production.

Better Code Structure: The refactored logic using a labeled block ('reason:) is cleaner and more explicit than the previous approach with the keep flag.

Structured Logging: Following the project's structured logging conventions correctly (using ?reason instead of formatting into the message).

Correct Naming: Using lowercase log messages per CLAUDE.md conventions.

Issues & Concerns

🐛 Critical Bug: Inverted Logic

Lines 428-432 and 438-442: The timeout logic appears to be inverted. The code currently GCs requests when the duration is less than or equal to the timeout, but it should GC when the duration is greater than:

// Current (incorrect):
if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

// Should be:
if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

The original code had:

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

Which meant "keep the request if the timeout has NOT been exceeded" (duration is still within timeout). The new logic incorrectly inverts this.

Impact: This will cause requests to be GC'd immediately if they're within the timeout period, rather than waiting for them to actually timeout. This is a critical functional regression.

Recommendation: Change both <= operators to > on lines 429 and 439.

📝 Minor: Incomplete Logging Context

The request_id is logged, but additional context like send_instant or elapsed_duration could be helpful for debugging timeout issues:

tracing::debug!(
    request_id=?Uuid::from_bytes(*request_id),
    ?reason,
    elapsed_ms = now.duration_since(earliest_send_instant).as_millis(),
    "gc collecting in flight request"
);

This would help operators understand how far past the timeout a request was when collected.

Testing Recommendations

  • Add a unit test to verify requests are only GC'd after MESSAGE_ACK_TIMEOUT has elapsed, not before
  • Verify the logging output in logs to ensure the reason is captured correctly
  • Test both GC paths (regular messages and WebSocket messages)

Performance Considerations

No performance concerns - the refactoring maintains similar performance characteristics.

Security Considerations

No security issues identified.


Verdict: ❌ Request Changes - The inverted timeout logic is a critical bug that needs to be fixed before merging.

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR improves observability in the garbage collection logic by adding explicit logging of GC reasons when pending requests timeout. The changes refactor the conditional logic to be more explicit and easier to debug.

Positive Changes ✓

  1. Better Observability: Adding structured logging with explicit GC reasons makes debugging timeout issues much easier
  2. Improved Code Clarity: The refactored logic using a labeled block ('reason) makes the intent clearer than the previous keep variable approach
  3. Structured Logging: Correctly uses tracing's structured logging pattern with ?reason and request_id=?Uuid::from_bytes(*request_id), following the project conventions from CLAUDE.md
  4. Documentation: The enum variants include doc comments explaining what each GC reason represents

Logic Issue Found ⚠️

There's a critical logic bug in the timeout comparison operators. The conditions are inverted:

Lines 428-430 and 438-440:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

This says "if the message age is less than or equal to the timeout, mark it for GC". This is backwards!

Expected behavior: Messages should be GC'd when they've been pending longer than the timeout
Current behavior: Messages are GC'd when they're newer than the timeout

Fix Required

Change <= to > in both comparisons:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

And similarly for line 438-440 with WebSocketMessageNotAcked.

Original Logic Verification

Looking at the original code, it used:

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

This was correct - it kept messages that hadn't exceeded the timeout (> returns false for old messages). The refactor accidentally inverted this by using <= instead of >.

Additional Observations

  1. Let-chain usage: Good use of let-chain syntax on lines 435-436 combining optional checks
  2. Comment clarity: The comment on line 457 "Return true if the request was not gc'd" is helpful
  3. Consistent style: Code follows project formatting with hard tabs

Recommendation

Fix the comparison operators before merging to prevent incorrect GC behavior that would cause requests to be prematurely terminated.

@vercel
Copy link

vercel bot commented Nov 4, 2025

Deployment failed with the following error:

Too many requests - try again in 5 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR improves observability in the pegboard gateway's garbage collection logic by adding explicit logging when in-flight requests are collected, along with the specific reason for collection.

Positive Changes

Improved Observability: The addition of explicit GC reasons (MessageNotAcked vs WebSocketMessageNotAcked) makes debugging much easier. When requests are timing out, you'll now know exactly which type of message failed to be acknowledged.

Better Code Structure: The refactoring using a labeled block ('reason) and enum makes the control flow clearer and more maintainable than the previous boolean-based approach.

Follows Logging Conventions: The logging follows the project's structured logging patterns correctly:

  • Uses tracing::debug! appropriately for GC events
  • Properly formats the message in lowercase: "gc collecting in flight request"
  • Uses structured fields with ?reason and request_id rather than string interpolation

Logic Correctness: The timeout logic has been inverted correctly. The old code kept requests when duration > timeout (incorrect), while the new code GCs requests when duration > timeout (correct, based on the logic inversion at line 458).

Minor Observations

Logic Clarity: The condition on lines 428-431 might be slightly confusing on first read:

The code breaks with a GC reason when the message is not yet timed out, which feels counterintuitive. However, this is correct because the final check inverts the logic - requests are kept when reason.is_none(), meaning they haven't exceeded the timeout yet.

Consider adding a clarifying comment to make it more explicit that we're checking if the timeout has been exceeded, but this is optional - the current implementation is correct.

Test Coverage

No tests were added, but this is reasonable for a logging/observability improvement that doesn't change functional behavior.

Performance & Security

  • ✅ No performance concerns - the enum is zero-cost and only instantiated when GC occurs
  • ✅ No security concerns

Conclusion

Approved - This is a solid observability improvement that will help with debugging timeout issues in production. The code is correct, follows conventions, and makes the GC logic more maintainable.


🤖 Review generated by Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from 0423a08 to dcca850 Compare November 4, 2025 19:24
@claude
Copy link

claude bot commented Nov 4, 2025

Code Review

Summary

This PR improves observability in the garbage collection process by adding explicit logging for the reason why pending requests are being collected. The changes introduce a MsgGcReason enum to distinguish between two timeout scenarios.

Positive Aspects ✅

  1. Improved Observability: The addition of explicit GC reason logging will make debugging timeout issues significantly easier. Being able to distinguish between regular message timeouts and WebSocket message timeouts is valuable for troubleshooting.

  2. Code Clarity: The refactored logic using labeled blocks ('reason) is more explicit about the control flow and makes the timeout detection logic easier to follow.

  3. Follows Logging Conventions: The logging properly uses structured logging with ?reason and request_id, following the tracing patterns from CLAUDE.md.

Issues & Concerns ⚠️

Critical: Logic Inversion Bug 🐛

Location: Lines 428-432 and 438-442

The timeout logic appears to be inverted. The current code breaks with Some(MsgGcReason::...) when the duration is LESS THAN OR EQUAL TO the timeout, but it should be the opposite.

Current (incorrect) logic:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

This means:

  • If the message is fresh (< 30s old) → it gets marked for GC ❌
  • If the message is stale (> 30s old) → it stays alive ❌

Expected (correct) logic:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

Original code for comparison:
The original code used:

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

Where keep = true means "keep the request" and keep = false means "GC it". The new code inverted this by checking <= MESSAGE_ACK_TIMEOUT instead of > MESSAGE_ACK_TIMEOUT.

Impact: This is a critical bug that would cause premature GC of active requests and keep stale requests alive indefinitely.

Recommendations

  1. Fix the logic inversion by changing <= to > on lines 428-429 and 438-439
  2. Add unit tests for the GC logic to prevent similar issues. Consider testing:
    • Requests with messages within timeout window should be kept
    • Requests with messages beyond timeout should be GC'd
    • WebSocket hibernation timeout scenarios
    • Closed channel handling
  3. Consider integration testing with actual timeout scenarios to validate the end-to-end behavior

Minor Suggestions

  1. The comment on line 406 says "Any tunnel message not acked (TunnelAck)" but could be clearer as "Tunnel message exceeded ack timeout"
  2. Consider adding a metric counter for GC events by reason to track this in production

Style & Conventions ✅

  • Follows Rust formatting conventions
  • Proper use of structured logging per CLAUDE.md
  • Good use of labeled blocks for early exit logic
  • Enum naming follows conventions

Action Required: Please fix the logic inversion bug before merging. This is blocking.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:52 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:53 PM UTC: CI is running for this pull request on a draft pull request (#3353) due to your merge queue CI optimization settings.
  • Nov 4, 8:56 PM UTC: Merged by the Graphite merge queue via draft PR: #3353.

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch from dcca850 to e4d3b77 Compare November 4, 2025 20:25
@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Log explicit GC reason for pending requests

Summary

This PR improves observability in the garbage collection logic by adding explicit logging of why in-flight requests are being collected. The refactoring makes the code more maintainable and debuggable.

Positive Aspects

Improved Observability: Adding explicit logging with the GC reason (MessageNotAcked vs WebSocketMessageNotAcked) will make debugging timeout issues much easier.

Better Code Structure: The refactoring from a mutable keep flag to an explicit reason: Option<MsgGcReason> makes the control flow clearer and easier to reason about.

Follows Logging Conventions: The logging follows the repository's structured logging patterns using tracing::debug! with field parameters (request_id=?, ?reason).

Appropriate Debug Level: Using debug! level for GC collection events is appropriate - this is detailed operational information that's useful for troubleshooting but not critical for normal operation.

Issues Found

🐛 Critical Bug: Inverted Timeout Logic

The timeout comparison logic is inverted on lines 428-430 and 438-440:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

This says "if the message age is less than or equal to the timeout, mark it for GC". This is backwards!

Expected behavior: GC should happen when duration > MESSAGE_ACK_TIMEOUT (message is too old)
Current behavior: GC happens when duration <= MESSAGE_ACK_TIMEOUT (message is still fresh)

The original code had:

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

Which correctly kept the request when duration was greater than timeout (meaning it should be GC'd).

Fix: Change both conditions from <= to >:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked);
}

And similarly for the WebSocket message check.

Minor Suggestions

Documentation: Consider adding a doc comment to the MsgGcReason enum explaining the timeout threshold (30s) and what "not acked" means in this context.

Variable Naming: The variable reason might be more clearly named gc_reason to be explicit about what it represents.

Test Coverage

⚠️ Testing Recommendation: This bug would likely be caught by integration tests that verify:

  1. Requests are NOT GC'd before MESSAGE_ACK_TIMEOUT (30s)
  2. Requests ARE GC'd after MESSAGE_ACK_TIMEOUT
  3. The correct timeout reason is logged

Consider adding tests in /home/runner/work/rivet/rivet/engine/packages/pegboard-gateway/tests/ to verify this behavior.

Security & Performance

✔️ Security: No security concerns identified
✔️ Performance: The refactoring has minimal performance impact. The additional enum allocation is negligible.


Action Required: Please fix the inverted timeout logic before merging.

@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-chore_pegboard_log_explicit_gc_reason_for_pending_requests branch November 4, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants