Skip to content

Conversation

@thedotmack
Copy link
Owner

Summary

Fixes #514 - Excessive observer sessions created during startup-recovery (13,000+ orphaned .jsonl files)

Root Cause

When memorySessionId was null or equaled contentSessionId (placeholder), no resume parameter was passed to the SDK's query(). This caused the SDK to create a new session file on every call. If queries aborted before capturing the SDK's session_id, the placeholder remained, leading to cascading creation of orphaned files.

Changes

  • Generate deterministic ID mem-${contentSessionId} upfront instead of waiting to capture from SDK
  • Always pass it to resume parameter (no more conditional spreading)
  • Persist immediately to database before query starts
  • If SDK returns different ID, capture and use that going forward

Files Changed

  • src/services/worker/SDKAgent.ts - Session ID generation and resume logic

Test plan

  • Build succeeds: npm run build
  • Verify no conditional resume: grep -n "hasRealMemorySessionId" src/services/worker/SDKAgent.ts returns nothing
  • Verify deterministic ID: grep -n 'mem-\${session.contentSessionId}' src/services/worker/SDKAgent.ts returns line 74
  • Verify always resume: grep -n "resume:" src/services/worker/SDKAgent.ts shows unconditional usage
  • Manual test: Restart worker, trigger session, verify no new orphaned .jsonl files created

🤖 Generated with Claude Code

Fixes #514 - Excessive observer sessions created during startup-recovery

Root cause: When memorySessionId was null or equaled contentSessionId
(placeholder), no `resume` parameter was passed to the SDK's query().
This caused the SDK to create a NEW session file (.jsonl) on every call.
If queries aborted before capturing the SDK's session_id, the placeholder
remained, leading to cascading creation of 13,000+ orphaned files.

Fix:
- Generate deterministic ID `mem-${contentSessionId}` upfront
- Always pass it to `resume` parameter
- Persist immediately to database before query starts
- If SDK returns different ID, capture and use that going forward

This ensures one session file per contentSessionId, eliminating orphans.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@claude
Copy link

claude bot commented Jan 1, 2026

Code Review: PR #516 - Fix Orphaned Session Files

Summary

This PR addresses issue #514 by fixing excessive observer session creation during startup-recovery. The solution generates a deterministic memory session ID upfront and always passes it to the SDK's resume parameter, preventing the creation of 13,000+ orphaned .jsonl files.

✅ Strengths

1. Root Cause Analysis

The PR correctly identifies that the bug occurred when memorySessionId was null or equaled contentSessionId (placeholder), causing the SDK to create new session files on every call. The fix is well-targeted and addresses the core issue.

2. Deterministic ID Generation

The new approach using mem-${contentSessionId} is elegant:

  • Generates predictable, unique IDs
  • Maintains 1:1 relationship between content sessions and memory sessions
  • Prevents the race condition where SDK session IDs weren't captured before aborts

3. Immediate Persistence

The fix persists the memory session ID to the database before starting the SDK query (lines 77-80), which is critical for:

  • Recovery scenarios (worker restarts)
  • Preventing orphaned sessions if the query aborts early
  • Ensuring consistency across restarts

4. Fallback Handling

The code gracefully handles the case where the SDK returns a different session ID than requested (lines 106-118), logging a warning and updating to use the SDK's ID. This defensive programming is excellent.

5. Code Quality

  • Clear, detailed comments explaining the fix
  • Good logging for debugging
  • Maintains backward compatibility

🔍 Potential Issues & Concerns

1. CRITICAL: Potential UNIQUE Constraint Violation

Location: src/services/worker/SDKAgent.ts:79

The database schema (from SessionStore.ts:79) defines:

memory_session_id TEXT UNIQUE

The Problem:
When you call updateMemorySessionId(session.sessionDbId, memorySessionId) on line 79, you're setting memory_session_id to mem-${contentSessionId}. However, the database was initialized with memory_session_id = contentSessionId (placeholder) at session creation.

If another session already exists with memorySessionId = contentSessionId, this update will fail with a UNIQUE constraint violation because:

  1. Session A has memorySessionId = "content-123" (as placeholder)
  2. You try to update Session B to memorySessionId = "mem-content-123"
  3. Database rejects because "content-123" is already taken by Session A's placeholder

Evidence: Looking at SessionStore.ts:1195:

INSERT OR IGNORE INTO sdk_sessions
(content_session_id, memory_session_id, project, user_prompt, started_at, started_at_epoch, status)
VALUES (?, ?, ?, ?, ?, ?, 'active')
`).run(contentSessionId, contentSessionId, project, userPrompt, now.toISOString(), nowEpoch);

The second parameter sets memory_session_id = contentSessionId initially.

Suggested Fix:

// Generate deterministic memory session ID if not already set
const memorySessionId = (session.memorySessionId && session.memorySessionId !== session.contentSessionId)
  ? session.memorySessionId
  : `mem-${session.contentSessionId}`;

// Persist immediately if we just generated it
if (!session.memorySessionId || session.memorySessionId === session.contentSessionId) {
  session.memorySessionId = memorySessionId;
  
  // Use try-catch to handle potential UNIQUE constraint violations
  try {
    this.dbManager.getSessionStore().updateMemorySessionId(session.sessionDbId, memorySessionId);
  } catch (error) {
    logger.warn('SDK', 'Failed to update memory session ID, possibly due to constraint violation', {
      sessionDbId: session.sessionDbId,
      memorySessionId,
      error: String(error)
    });
    // Fallback: if update fails, let the SDK generate its own ID naturally
  }
}

2. Missing Test Coverage

Issue: There are no tests validating this new behavior, specifically:

  • Test that deterministic ID generation works correctly
  • Test that the ID is persisted before SDK query starts
  • Test recovery scenarios where worker restarts
  • Test the edge case where SDK returns a different ID

Recommendation: Add tests to tests/session_id_refactor.test.ts:

describe('Deterministic Memory Session ID Generation (Issue #514)', () => {
  it('should generate deterministic mem- prefixed ID on first run', () => {
    const contentSessionId = 'test-content-session';
    const sessionDbId = store.createSDKSession(contentSessionId, 'test-project', 'Test');
    
    // Simulate SDKAgent generating deterministic ID
    const deterministicId = `mem-${contentSessionId}`;
    store.updateMemorySessionId(sessionDbId, deterministicId);
    
    const session = store.getSessionById(sessionDbId);
    expect(session?.memory_session_id).toBe(deterministicId);
  });

  it('should not update if memory_session_id already differs from content_session_id', () => {
    const contentSessionId = 'existing-session';
    const existingMemoryId = 'sdk-captured-id-abc';
    
    const sessionDbId = store.createSDKSession(contentSessionId, 'test-project', 'Test');
    store.updateMemorySessionId(sessionDbId, existingMemoryId);
    
    // Verify it doesn't get overwritten by deterministic generation
    const session = store.getSessionById(sessionDbId);
    expect(session?.memory_session_id).toBe(existingMemoryId);
    expect(session?.memory_session_id).not.toBe(`mem-${contentSessionId}`);
  });
});

3. Documentation Gap

Issue: The comment on line 187 in SessionStore.ts references the old conditional resume logic that was removed:

// NOTE: memory_session_id is initialized to contentSessionId as a placeholder for FK purposes.
// The REAL memory session ID is captured by SDKAgent from the first SDK response
// and stored via updateMemorySessionId(). The resume logic checks if memorySessionId
// differs from contentSessionId before using it - see SDKAgent.startSession().

This comment is now outdated. The new behavior generates a deterministic ID upfront, not from the SDK response.

Recommendation: Update the comment to reflect the new behavior:

// NOTE: memory_session_id is initialized to contentSessionId as a placeholder.
// SDKAgent generates a deterministic ID (mem-${contentSessionId}) before starting
// the query to prevent orphaned session files (Issue #514). If the SDK returns
// a different ID, SDKAgent will update to use that instead.

4. Edge Case: Concurrent Updates

Scenario: If multiple workers or threads try to update the same session's memorySessionId concurrently, there could be a race condition.

Current Risk: Low, because typically only one SDKAgent processes a given session at a time, but worth documenting or adding a database transaction if needed.

5. Logging Verbosity

Minor Issue: The warning log on line 107 will fire even in normal operation if the SDK chooses to generate its own ID instead of using the deterministic one. This might create noise in logs.

Recommendation: Consider downgrading to logger.debug or adding context about whether this is expected:

logger.info('SDK', 'SDK returned different session ID, updating to use SDK-provided ID', {
  sessionDbId: session.sessionDbId,
  requested: session.memorySessionId,
  received: message.session_id
});

🔒 Security Considerations

No security issues identified. The deterministic ID generation doesn't expose sensitive information and maintains proper database constraints.

🚀 Performance Considerations

Positive Impact:

  • Eliminates 13,000+ unnecessary session file creations
  • Reduces API calls (each orphaned session made at least one LLM call)
  • Improves database query performance by reducing orphaned records

Neutral:

  • The additional updateMemorySessionId call before SDK query starts is negligible overhead

✅ Test Plan Verification

The PR's test plan is good but could be strengthened:

  • ✅ Build verification
  • ✅ Code pattern checks (grep commands)
  • ✅ Manual testing
  • ❌ Missing: Automated unit tests
  • ❌ Missing: Integration test for recovery scenarios

📋 Recommendations

High Priority

  1. Add error handling for the UNIQUE constraint violation scenario
  2. Add unit tests for the new deterministic ID behavior
  3. Update outdated comment in SessionStore.ts:1187

Medium Priority

  1. Consider logging level adjustment for the "different session ID" warning
  2. Add integration test for worker restart recovery scenario

Low Priority

  1. Document the mem- prefix convention in code or architecture docs

🎯 Verdict

Approval Status: ✅ APPROVE with recommended fixes

This PR successfully addresses the critical bug that was causing thousands of orphaned session files. The core solution is sound and well-implemented. The concerns raised (especially the UNIQUE constraint issue) should be addressed before merging, but the overall approach is excellent.

The fix demonstrates good understanding of the SDK lifecycle and implements a clever solution to ensure session resumption works correctly even across worker restarts.

Impact:

  • ✅ Fixes high-severity bug (13,000+ orphaned files)
  • ✅ Reduces unnecessary API costs
  • ✅ Improves user experience (claude -r now usable)
  • ✅ Maintains backward compatibility

Great work on diagnosing and fixing this complex issue! 🎉


Review completed by Claude Code Assistant on 2026-01-01

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@claude
Copy link

claude bot commented Jan 1, 2026

PR Review: Fix for Orphaned Session Files (Issue #514)

Summary

This PR successfully addresses the critical bug where 13,000+ orphaned .jsonl files were created during startup-recovery. The fix implements deterministic session ID generation and ensures the resume parameter is always passed to the SDK.


Code Quality: ✅ Excellent

Strengths

  1. Clean, focused changes - Only modifies what's necessary to fix the issue
  2. Well-documented logic - Clear comments explain the reasoning (lines 69-74, 90-95)
  3. Consistent with codebase patterns - Follows existing conventions from CLAUDE.md
  4. Proper error handling - Captures SDK session ID mismatches (lines 106-118)

Root Cause Analysis

The PR correctly identifies and fixes the core issue:

  • Before: When memorySessionId was null or equaled contentSessionId (placeholder), no resume parameter was passed
  • After: Always generates deterministic ID mem-${contentSessionId} upfront and passes it to resume

This prevents the SDK from creating new session files on every call.


Code Review by Section

1. Session ID Generation (Lines 69-80)

const memorySessionId = (session.memorySessionId && session.memorySessionId !== session.contentSessionId)
  ? session.memorySessionId
  : `mem-${session.contentSessionId}`;

✅ Correct Logic:

  • Reuses existing memorySessionId if it's valid (not null, not placeholder)
  • Generates deterministic ID if missing or placeholder
  • Persists immediately to database before query starts

Potential Edge Case (minor):
The logic correctly handles empty strings since they're falsy in JavaScript.

2. Resume Parameter (Lines 90-100)

const queryResult = query({
  prompt: messageGenerator,
  options: {
    model: modelId,
    resume: memorySessionId,  // Always pass - prevents orphaned sessions
    disallowedTools,
    // ...
  }
});

✅ Critical Fix: Always passing resume prevents SDK from creating new session files. This is the heart of the fix.

3. SDK Session ID Validation (Lines 104-118)

if (message.session_id && message.session_id !== session.memorySessionId) {
  logger.warn('SDK', 'SDK returned different session ID than requested', {
    sessionDbId: session.sessionDbId,
    requested: session.memorySessionId,
    received: message.session_id
  });
  session.memorySessionId = message.session_id;
  this.dbManager.getSessionStore().updateMemorySessionId(
    session.sessionDbId,
    message.session_id
  );
}

✅ Excellent defensive programming:

  • Handles case where SDK doesn't honor our deterministic ID
  • Captures and persists the actual ID for future calls
  • Logs warning for debugging

Potential Issues & Edge Cases

1. Session ID Format Validation ⚠️

The deterministic ID uses format mem-${contentSessionId}.

Question: Does the Agent SDK have any restrictions on session ID format? Consider:

  • Max length limits
  • Character restrictions (alphanumeric, hyphens, etc.)
  • Prefix restrictions

Recommendation: Add validation or document the expected format of contentSessionId.

2. Database Update Failure Handling 🔍

Line 79: updateMemorySessionId is called but there's no error handling if the database update fails.

Recommendation: Consider wrapping in try-catch or checking return value:

try {
  this.dbManager.getSessionStore().updateMemorySessionId(session.sessionDbId, memorySessionId);
} catch (error) {
  logger.error('SDK', 'Failed to persist memory session ID', { sessionDbId: session.sessionDbId }, error);
  // Consider whether to throw or continue - depends on criticality
}

Performance Considerations: ✅ Good

Positive Impacts

  1. Eliminates excessive file creation - Primary performance win
  2. Reduces API calls - No more orphaned sessions means no wasted API tokens
  3. Database writes are minimal - Only one extra write on first session initialization

No Performance Regressions

  • Deterministic ID generation is O(1) string concatenation
  • Database update is a single row update
  • No additional loops or blocking operations

Security Considerations: ✅ Safe

No Security Issues Identified

  1. Session ID is deterministic but not guessable - Based on contentSessionId which should already be random
  2. No injection risks - contentSessionId should be validated elsewhere in the codebase
  3. No exposure of sensitive data - Session IDs are internal to the SDK

Test Coverage: ⚠️ Needs Attention

The PR description includes a manual test plan:

  • ✅ Build succeeds
  • ✅ Verify no conditional resume
  • ✅ Verify deterministic ID
  • ✅ Manual test for orphaned files

Recommendations:

  1. Add unit tests for the ID generation logic
  2. Add integration test that verifies:
    • Same session uses same memory session ID across restarts
    • No orphaned .jsonl files are created
    • SDK session ID capture works correctly
  3. Consider adding a regression test that counts session files before/after a recovery scenario

Example Test Cases:

describe('SDKAgent session ID management', () => {
  test('generates deterministic memory session ID', () => {
    const contentSessionId = 'test-session-123';
    expect(memorySessionId).toBe('mem-test-session-123');
  });

  test('reuses existing valid memory session ID', () => {
    const existing = 'existing-mem-id';
    const contentSessionId = 'test-session';
    // Should return existing, not generate new
  });

  test('replaces placeholder memory session ID', () => {
    const memorySessionId = contentSessionId; // Placeholder case
    // Should generate new deterministic ID
  });
});

Additional Observations

1. Logging is Comprehensive ✅

Lines 82-88 provide excellent observability:

logger.info('SDK', 'Starting SDK query', {
  sessionDbId: session.sessionDbId,
  contentSessionId: session.contentSessionId,
  memorySessionId: memorySessionId,
  resume_parameter: memorySessionId,
  lastPromptNumber: session.lastPromptNumber
});

This will be invaluable for debugging in production.

2. Version Bump is Appropriate ✅

plugin/package.json: 8.5.18.5.2 is correct for a patch-level bug fix.

3. Built Files Updated ✅

plugin/scripts/worker-service.cjs shows the bundled changes, confirming the build process works.


Final Recommendations

Critical (Must Fix)

None - the core fix is solid.

High Priority (Should Fix)

  1. Add database error handling around updateMemorySessionId call (line 79)
  2. Add unit/integration tests to prevent regression

Medium Priority (Nice to Have)

  1. Document session ID format requirements in code comments
  2. Consider downgrading SDK ID mismatch log from warning to info if it's expected behavior
  3. Validate contentSessionId format if there's any possibility of external input

Low Priority (Future Enhancement)

  1. Add metrics/telemetry to track:
    • How often SDK returns different session IDs
    • Session resumption success rate
    • Orphaned file count (should be zero)

Conclusion

Overall Assessment: ✅ APPROVED

This is a well-crafted fix that directly addresses the root cause of issue #514. The code is clean, well-documented, and follows best practices. The deterministic session ID approach is sound and should prevent the cascading creation of orphaned session files.

Strengths

  • Directly solves the stated problem
  • Minimal, focused changes
  • Good defensive programming (SDK ID capture)
  • Excellent logging for observability

Areas for Improvement

  • Add error handling for database operations
  • Increase test coverage
  • Minor code documentation enhancements

Risk Assessment

Low Risk - This change is isolated to session initialization and uses a deterministic approach that's predictable and testable.


Suggested Next Steps

  1. ✅ Merge this PR (it's ready)
  2. Add follow-up PR for test coverage
  3. Monitor production logs for "SDK returned different session ID" warnings
  4. Consider adding automated cleanup for any remaining orphaned files as a one-time migration

Great work fixing this critical bug! The approach is sound and the implementation is solid. 🎉


Reviewed with Claude Code 🤖

- Removed deterministic memory session ID generation logic.
- Updated query options to use contentSessionId for resume parameter.
- Eliminated checks for SDK returning different session IDs.
- Streamlined logging and session ID updates for clarity and efficiency.
@claude
Copy link

claude bot commented Jan 1, 2026

Code Review

Summary

This PR simplifies session ID management in the SDKAgent by using contentSessionId directly as the resume parameter, eliminating the complex logic that tracked and captured memorySessionId from SDK responses. This change addresses issue #514 where 13,000+ orphaned observer session files were created.


✅ Positive Aspects

  1. Significant Code Simplification: The PR removes 34 lines of complex session ID tracking logic, making the code more maintainable and easier to understand.

  2. Clear Root Cause Fix: The issue was that when memorySessionId was null or equaled contentSessionId (placeholder), no resume parameter was passed to the SDK, causing it to create a new session file on every call. The fix ensures a deterministic session ID is always passed.

  3. Well-Documented: The PR description clearly explains the root cause, the fix, and includes a test plan.

  4. Reduces Complexity: Eliminates conditional logic (hasRealMemorySessionId) and database updates for capturing SDK-returned session IDs.


⚠️ Concerns & Questions

1. Semantic Change in Session ID Usage (Critical)

The old code used memorySessionId (captured from SDK) while the new code uses contentSessionId (from the primary session). This is a fundamental architectural change.

Questions:

  • Is contentSessionId guaranteed to be compatible with the SDK's session ID format?
  • What happens if the SDK expects to manage its own session IDs and receives a foreign ID?
  • Could this cause session state corruption or unexpected behavior in the SDK?

Recommendation: This needs validation that the SDK accepts externally-provided session IDs and that contentSessionId is in the correct format.

2. Removed SDK Session ID Capture (High Priority)

The old code captured the SDK's session ID and persisted it to the database via updateMemorySessionId(). The new code removes this entirely.

Questions:

  • Is the memory_session_id column in the database still needed?
  • Are there other parts of the codebase that rely on session.memorySessionId being set?
  • What about cross-restart recovery scenarios mentioned in the old comments?

Recommendation: Verify that removing this capture does not break other functionality. Run a grep for memorySessionId usage across the codebase.

3. Test Coverage (High Priority)

There are no unit or integration tests for SDKAgent, making it difficult to verify the fix works correctly and does not introduce regressions.

Questions:

  • How was this tested beyond manual verification?
  • What prevents a similar bug from being reintroduced in the future?

Recommendation: Add integration tests that verify:

  • No orphaned session files are created on restart
  • Resume functionality works correctly
  • Session state is properly maintained across SDK queries

4. Database Schema Implications (Medium Priority)

The commit history shows three commits with the middle one being a refactor. The final commit message suggests the deterministic ID generation was removed.

Questions:

  • Why was the deterministic mem-contentSessionId generation removed in the third commit?
  • What changed between commit 2 and commit 3?
  • Is the current approach (using contentSessionId directly) the final intended solution?

Recommendation: Clarify the evolution of the fix and ensure the final approach is intentional, not an intermediate state.

5. Missing Logging (Low Priority)

The old code had detailed logging for session ID capture and resume parameters. The new code removes this.

Recommendation: Consider adding at least debug-level logging to track the resume parameter value for troubleshooting.

6. Edge Cases (Medium Priority)

Questions:

Recommendation: Add defensive checks and error handling for edge cases.


🔍 Additional Investigation Needed

  1. Database Migration: Check if the memory_session_id column needs to be deprecated or if there is existing data that relies on the old behavior.

  2. Backward Compatibility: Verify what happens to sessions that were created with the old code and have a different memorySessionId than contentSessionId.

  3. SDK Documentation: Review the Agent SDK documentation for the resume parameter to understand its expected format and behavior.

  4. Related Code Paths: Check SessionManager, DatabaseManager.updateMemorySessionId(), and any other code that references memorySessionId.


📋 Test Plan Validation

The PR's test plan is minimal:

  • ✅ Build succeeds
  • ✅ Grep for removed code
  • ✅ Grep for deterministic ID (but this was removed in commit 3!)
  • ✅ Manual test for orphaned files

Missing Tests:

  • No verification that resume functionality actually works
  • No test for session state persistence across restarts
  • No test for concurrent sessions
  • No test for edge cases (null IDs, etc.)

🎯 Recommendations

Must Address Before Merge:

  1. Validate SDK Compatibility: Confirm that using contentSessionId as the resume parameter is valid and will not cause SDK issues.
  2. Search for memorySessionId Usage: Run rg memorySessionId to find all references and ensure removal does not break anything.
  3. Clarify Commit History: Explain why the deterministic ID generation was removed in commit 3.
  4. Add Tests: At minimum, add an integration test that verifies no orphaned files are created during restart scenarios.

Should Address:

  1. Add debug logging for the resume parameter value
  2. Add null checks for session.contentSessionId
  3. Document the architectural decision to use contentSessionId directly
  4. Consider deprecating the memory_session_id database column if no longer needed

Nice to Have:

  1. Add comprehensive test coverage for SDKAgent
  2. Add performance benchmarks to ensure the fix does not degrade performance
  3. Document migration path for existing sessions

📊 Code Quality: 6/10

Pros:

  • Simplifies complex logic
  • Addresses a critical bug
  • Reduces lines of code

Cons:

  • Lacks test coverage
  • Missing documentation for architectural change
  • Unclear commit history evolution
  • Potential breaking changes not fully addressed

✋ Verdict: Request Changes

While this PR addresses a critical bug and simplifies the code significantly, it makes fundamental architectural changes without sufficient validation and test coverage. The semantic change from SDK-managed session IDs to externally-provided IDs needs careful verification to avoid introducing new bugs.

Please address the "Must Address Before Merge" items before this PR can be safely merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] v8.5.2: Excessive observer sessions created during startup-recovery - 13,000+ orphaned files

2 participants