Skip to content

Conversation

@sitaowang1998
Copy link
Collaborator

@sitaowang1998 sitaowang1998 commented May 29, 2025

Description

Adds fine-grained failure recovery by rolling back minimum number of tasks.

When tasks fail, Spider finds the minimum graph in the task graph that:

  1. Contains all failed tasks in a job
  2. Contains only finished, running and ready tasks
  3. All Data passed into the graph are persisted
  4. Has no children that are finished, running or ready.

Spider rolls back all tasks in the subgraph to ready or pending, and executes them again.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • GitHub workflows pass.
  • Unit tests pass in dev container.
  • Integration tests pass in dev container.

Summary by CodeRabbit

  • New Features
    • Introduced job recovery functionality to automatically detect and recover failed jobs, resetting task states as needed.
    • Added support for marking data as checkpointed (persisted), ensuring important data is retained during recovery.
  • Bug Fixes
    • Improved logic for resetting and retrying failed tasks, including better handling of task dependencies and data persistence.
  • Tests
    • Added comprehensive unit tests for job recovery and data persistence features.
    • Enhanced storage tests to verify correct handling of persisted data and partial job resets.

@sitaowang1998 sitaowang1998 requested a review from a team as a code owner May 29, 2025 20:34
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 29, 2025

Warning

Rate limit exceeded

@sitaowang1998 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f412bc3 and 55bfdaf.

📒 Files selected for processing (2)
  • src/spider/scheduler/scheduler.cpp (5 hunks)
  • src/spider/storage/mysql/MySqlStorage.cpp (8 hunks)

Walkthrough

This change introduces a job recovery mechanism to the scheduler, enabling detection and recovery of failed jobs and tasks by analyzing task graphs and data persistence. It adds new interfaces and implementations for marking data as persisted, resetting tasks, and retrieving failed jobs. Comprehensive tests and build configuration updates are included.

Changes

File(s) Change Summary
src/spider/CMakeLists.txt, tests/CMakeLists.txt Added new source/header and test files for JobRecovery to build/test targets.
src/spider/client/Data.hpp, src/spider/core/Data.hpp Added checkpointing/persisted state methods and flags to Data and its Builder.
src/spider/core/JobRecovery.cpp, src/spider/core/JobRecovery.hpp Implemented and declared the JobRecovery class for reconstructing minimal subgraphs for job recovery.
src/spider/scheduler/scheduler.cpp Added a recovery loop thread for failed jobs; integrates JobRecovery and task resetting into the scheduler.
src/spider/storage/DataStorage.hpp, src/spider/storage/mysql/MySqlStorage.cpp, src/spider/storage/mysql/MySqlStorage.hpp Added/implemented set_data_persisted method for marking data persisted in storage.
src/spider/storage/MetadataStorage.hpp, src/spider/storage/mysql/MySqlStorage.cpp, src/spider/storage/mysql/MySqlStorage.hpp Added/implemented get_failed_jobs and reset_tasks methods for job/task recovery.
tests/scheduler/test-JobRecovery.cpp Added comprehensive unit tests for job recovery scenarios and task graph handling.
tests/storage/test-DataStorage.cpp Extended data storage tests to verify persisted state logic.
tests/storage/test-MetadataStorage.cpp Added test for partial job reset and task state transitions.
tests/utils/CoreDataUtils.hpp Updated data_equal utility to compare persisted state.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler
    participant StorageFactory
    participant MetadataStorage
    participant DataStorage
    participant JobRecovery

    loop Every 1000 seconds
        Scheduler->>StorageFactory: get_connection()
        Scheduler->>MetadataStorage: get_failed_jobs(conn)
        MetadataStorage-->>Scheduler: [failed_job_ids]
        loop for each failed_job_id
            Scheduler->>JobRecovery: construct(job_id, conn, data_store, metadata_store)
            Scheduler->>JobRecovery: compute_graph()
            JobRecovery->>MetadataStorage: load_task_graph(job_id)
            JobRecovery->>DataStorage: get_data(data_id)
            JobRecovery-->>Scheduler: ready_tasks, pending_tasks
            Scheduler->>MetadataStorage: reset_tasks(conn, ready_tasks, pending_tasks)
        end
    end
Loading

Suggested reviewers

  • davidlion
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
src/spider/core/JobRecovery.cpp (1)

1-1: Use angle brackets for header inclusion.

For consistency with other includes in this file, use angle brackets instead of quotes.

-#include "JobRecovery.hpp"
+#include <spider/core/JobRecovery.hpp>
src/spider/core/JobRecovery.hpp (3)

41-43: Consider returning const references for better performance.

The getter methods currently return vectors by value, which involves copying. Since these are likely read-only operations, returning const references would be more efficient.

-    auto get_ready_tasks() -> std::vector<boost::uuids::uuid>;
+    auto get_ready_tasks() const -> std::vector<boost::uuids::uuid> const&;

-    auto get_pending_tasks() -> std::vector<boost::uuids::uuid>;
+    auto get_pending_tasks() const -> std::vector<boost::uuids::uuid> const&;

Note: This would require converting the internal absl::flat_hash_set to std::vector during compute_graph() instead of in the getters.


52-53: Parameter could be passed by const reference.

The not_persisted parameter is an output parameter but the method signature doesn't clearly indicate this. Consider using a more explicit approach.

-    auto check_task_input(Task const& task, absl::flat_hash_set<boost::uuids::uuid>& not_persisted)
+    auto check_task_input(Task const& task, absl::flat_hash_set<boost::uuids::uuid>* not_persisted)
             -> StorageErr;

Using a pointer makes it clearer at the call site that this is an output parameter.


23-28: Constructor parameters could benefit from named parameter idiom.

With four shared_ptr parameters of different types, there's a risk of passing arguments in the wrong order. Consider using a builder pattern or configuration struct.

Consider creating a configuration struct:

struct JobRecoveryConfig {
    boost::uuids::uuid job_id;
    std::shared_ptr<StorageConnection> storage_connection;
    std::shared_ptr<DataStorage> data_store;
    std::shared_ptr<MetadataStorage> metadata_store;
};

JobRecovery(JobRecoveryConfig config);
tests/scheduler/test-JobRecovery.cpp (1)

114-115: Avoid shadowing member variables.

The local variables ready_tasks and pending_tasks could be confused with potential member variables. Consider using different names for clarity.

-    auto ready_tasks = recovery.get_ready_tasks();
-    auto pending_tasks = recovery.get_pending_tasks();
+    auto const recovered_ready_tasks = recovery.get_ready_tasks();
+    auto const recovered_pending_tasks = recovery.get_pending_tasks();
src/spider/storage/mysql/MySqlStorage.cpp (1)

1178-1181: Consider indexing for performance.

The SQL query filters on state and compares retry with max_retry. Ensure these columns are properly indexed for efficient query execution.

Consider adding a composite index on (state, retry, max_retry) or at least on state if not already present:

CREATE INDEX idx_tasks_state_retry ON tasks(state, retry, max_retry);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b7e5ff and accbdfd.

📒 Files selected for processing (15)
  • src/spider/CMakeLists.txt (2 hunks)
  • src/spider/client/Data.hpp (4 hunks)
  • src/spider/core/Data.hpp (1 hunks)
  • src/spider/core/JobRecovery.cpp (1 hunks)
  • src/spider/core/JobRecovery.hpp (1 hunks)
  • src/spider/scheduler/scheduler.cpp (5 hunks)
  • src/spider/storage/DataStorage.hpp (1 hunks)
  • src/spider/storage/MetadataStorage.hpp (1 hunks)
  • src/spider/storage/mysql/MySqlStorage.cpp (8 hunks)
  • src/spider/storage/mysql/MySqlStorage.hpp (2 hunks)
  • tests/CMakeLists.txt (1 hunks)
  • tests/scheduler/test-JobRecovery.cpp (1 hunks)
  • tests/storage/test-DataStorage.cpp (2 hunks)
  • tests/storage/test-MetadataStorage.cpp (1 hunks)
  • tests/utils/CoreDataUtils.hpp (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/spider/client/Data.hpp (3)
src/spider/client/TaskContext.hpp (1)
  • m_data_store (263-263)
src/spider/client/Exception.hpp (2)
  • ConnectionException (14-15)
  • ConnectionException (14-14)
src/spider/storage/DataStorage.hpp (16)
  • conn (23-23)
  • conn (26-27)
  • conn (30-31)
  • conn (33-33)
  • conn (44-49)
  • conn (60-65)
  • conn (67-67)
  • conn (68-68)
  • conn (69-69)
  • conn (71-72)
  • conn (74-78)
  • conn (80-84)
  • conn (86-90)
  • conn (92-92)
  • conn (94-94)
  • conn (96-96)
tests/scheduler/test-JobRecovery.cpp (2)
src/spider/core/Data.hpp (1)
  • gen (45-48)
src/spider/core/JobRecovery.hpp (1)
  • task (52-53)
src/spider/core/JobRecovery.cpp (2)
src/spider/core/JobRecovery.hpp (4)
  • JobRecovery (23-28)
  • task_id (76-76)
  • task (52-53)
  • data_id (63-63)
src/spider/client/TaskContext.hpp (2)
  • m_metadata_store (265-265)
  • m_data_store (263-263)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: non-storage-unit-tests (ubuntu-22.04)
  • GitHub Check: non-storage-unit-tests (ubuntu-24.04)
  • GitHub Check: lint
🔇 Additional comments (30)
tests/CMakeLists.txt (1)

13-13: LGTM! Test integration follows established patterns.

The addition of the JobRecovery test file to the build system is correctly placed and follows the existing alphabetical ordering convention.

src/spider/storage/DataStorage.hpp (1)

68-68: Well-designed interface addition for persistence tracking.

The new set_data_persisted method follows the established interface patterns with appropriate parameter types and return value. This is a breaking change for existing implementations, but that's expected when adding new functionality.

src/spider/core/Data.hpp (2)

34-36: Clean implementation of persistence state management.

The set_persisted and is_persisted methods follow established naming conventions and const-correctness patterns. The [[nodiscard]] attribute on the getter is consistent with other query methods in the class.


43-43: Appropriate default initialization for persistence flag.

The m_persisted member is correctly initialized to false, which is the expected default state for newly created data objects that haven't been checkpointed yet.

src/spider/CMakeLists.txt (2)

7-7: Correct integration of JobRecovery source file.

The addition follows the established pattern for core components and maintains alphabetical ordering within the sources list.


31-31: Proper header file integration.

The JobRecovery header is correctly added to the headers list with appropriate placement maintaining alphabetical order.

tests/storage/test-DataStorage.cpp (2)

43-43: LGTM! Necessary change to enable mutation.

Removing the const qualifier is required to allow calling set_persisted(true) on the data object in the extended test.


59-64: LGTM! Comprehensive persistence state testing.

The test properly verifies the complete persistence workflow:

  • Sets the persistence flag on the data object
  • Calls the storage interface to persist the state
  • Retrieves the data again to verify persistence is maintained
  • Uses the updated data_equal utility that includes persistence comparison

This provides good coverage for the new persistence functionality.

tests/utils/CoreDataUtils.hpp (1)

23-25: LGTM! Essential addition for comprehensive data comparison.

The persistence state check follows the same pattern as other property comparisons and is necessary for accurate test assertions now that Data objects have persistence state.

src/spider/scheduler/scheduler.cpp (3)

11-11: LGTM! Necessary includes for recovery functionality.

The new includes support the recovery feature:

  • vector for job ID storage
  • boost/uuid/uuid_io.hpp for UUID string conversion in logging
  • JobRecovery.hpp for the recovery logic implementation

Also applies to: 21-21, 27-27


48-48: LGTM! Consistent timing configuration.

The recovery interval constant follows the same pattern as the cleanup interval, maintaining consistency in the codebase.


330-335: LGTM! Consistent thread management.

The recovery thread follows the established pattern:

  • Created with the same parameter passing style as other threads
  • Properly joined before server shutdown
  • Maintains consistency with existing thread lifecycle management

Also applies to: 339-339

src/spider/storage/MetadataStorage.hpp (2)

81-89: LGTM! Well-documented interface addition.

The get_failed_jobs method is properly documented and follows established interface patterns:

  • Clear documentation explaining the method's purpose
  • Consistent parameter style with storage connection and output parameter
  • Appropriate return type for error handling

90-102: LGTM! Comprehensive task reset interface.

The reset_tasks method provides flexible task state management:

  • Detailed documentation explaining the dual-state reset capability
  • Separate parameters for ready and pending tasks allow fine-grained control
  • Consistent with other interface methods in style and error handling
tests/storage/test-MetadataStorage.cpp (1)

549-633: Well-structured test for partial reset functionality.

The test case properly validates the new reset_tasks method by:

  • Setting up a complex task graph with dependencies
  • Executing only parent_1 before the reset
  • Verifying that parent_1 maintains its Succeed state and outputs after partial reset
  • Confirming that parent_2 and child_task are appropriately reset to Ready and Pending states respectively

The test complements the existing full job reset test and provides good coverage for the partial recovery feature.

src/spider/client/Data.hpp (2)

81-99: Consistent implementation of checkpointing functionality.

The set_checkpointed() method follows the established pattern from set_locality(), properly handling connection reuse and error cases. The integration with the storage layer via set_data_persisted() is appropriate.


129-137: Clean extension of the builder pattern.

The builder properly supports the checkpointing feature with:

  • A fluent set_checkpointed() method
  • Appropriate default value (false) for backward compatibility
  • Correct propagation of the persisted flag to the constructed Data object

Also applies to: 152-152, 200-200

src/spider/storage/mysql/MySqlStorage.hpp (1)

78-84: Appropriate interface extensions for job recovery.

The new methods follow established patterns in the codebase:

  • get_failed_jobs() uses output parameters consistent with other getter methods
  • reset_tasks() clearly separates ready and pending task lists for precise state control
  • set_data_persisted() mirrors the existing set_data_locality() interface

These additions provide the necessary storage layer support for the fine-grained failure recovery feature.

Also applies to: 173-173

src/spider/core/JobRecovery.cpp (4)

35-58: Well-implemented graph traversal for recovery computation.

The breadth-first approach correctly identifies all tasks that need to be recovered by:

  1. Starting with failed tasks
  2. Processing their descendants and ancestors based on data persistence
  3. Propagating errors appropriately

The use of deque for BFS is appropriate for maintaining traversal order.


60-71: Efficient caching implementation.

The data caching logic appropriately:

  • Checks cache before making storage calls
  • Only caches successfully retrieved data
  • Returns the same error if retrieval fails

This optimization is important given the potentially large number of data objects in a task graph.


73-99: Correct logic for identifying non-persisted dependencies.

The method properly:

  • Handles optional data IDs and parent task references
  • Collects parent tasks whose outputs haven't been persisted
  • Propagates storage errors appropriately

This ensures that all tasks producing non-persisted data consumed by the current task are included in the recovery subgraph.


101-148: Sound implementation of the core recovery algorithm.

The method correctly:

  1. Includes non-pending child tasks in the recovery subgraph (as they may have consumed failed task outputs)
  2. Determines task readiness based on input data persistence
  3. Propagates recovery to parent tasks when their outputs aren't persisted

The distinction between ready and pending tasks based on data persistence aligns with the PR's objective of fine-grained recovery.

tests/scheduler/test-JobRecovery.cpp (3)

26-70: LGTM! Well-structured test for basic recovery scenario.

The test properly sets up a single failed task and verifies that recovery correctly identifies it as ready. Good cleanup at the end.


502-505: Good use of STL algorithms for verification.

Using std::ranges::find to check task presence in the pending list is clean and idiomatic. The assertions properly verify all expected tasks are in the pending state.


179-377: Excellent documentation with visual diagrams.

The Graphviz dot notation in comments clearly illustrates the task dependency graphs being tested. This makes the test scenarios much easier to understand.

src/spider/storage/mysql/MySqlStorage.cpp (5)

1258-1283: Complex logic for data persistence management needs careful review.

The logic to determine which data should be marked as not persisted is complex and could have edge cases. The current implementation:

  1. Gets all persisted data referenced by tasks
  2. Filters out data referenced by tasks not in the ready/pending lists
  3. Updates remaining data to not persisted

Please verify that this logic correctly handles all scenarios, especially:

  • Data referenced by multiple tasks (some being reset, some not)
  • Data that should remain persisted due to external references

Consider adding detailed comments explaining the logic.


1851-1875: Good addition of retry logic before marking job as failed.

The modification properly checks retry count against max retries before marking the job as failed. This aligns well with the retry semantics needed for job recovery.


1775-1781: SQL query change from NOT EXISTS to NOT IN.

The change from NOT EXISTS to NOT IN for updating tasks to ready state could have different performance characteristics. NOT IN can be problematic with NULL values.

Verify that:

  1. The task_id column cannot contain NULL values
  2. Performance is acceptable with the NOT IN approach

Consider using NOT EXISTS for potentially better performance:

UPDATE `tasks` SET `state` = 'ready' 
WHERE `id` IN (SELECT `task_id` FROM `task_inputs` WHERE `output_task_id` = ?) 
AND `state` = 'pending' 
AND NOT EXISTS (
    SELECT 1 FROM `task_inputs` 
    WHERE `task_id` = `tasks`.`id` 
    AND `output_task_id` = ? 
    AND `value` IS NULL 
    AND `data_id` IS NULL
)

2167-2169: Consistent handling of persisted flag in data operations.

Good addition of the persisted column in INSERT statements for both driver and task data. The implementation is consistent across both methods.

Also applies to: 2217-2219


2411-2428: Well-implemented set_data_persisted method.

The new method properly updates the persisted flag with appropriate error handling and transaction management.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/spider/scheduler/scheduler.cpp (1)

48-48: Consider documenting the recovery interval choice.

The 1000-second (16.7 minutes) recovery interval is quite long. While this may be appropriate for production workloads to avoid excessive overhead, consider adding a comment explaining the rationale for this interval.

+// Recovery check interval in seconds (16.7 minutes to balance responsiveness with system load)
 constexpr int cRecoveryInterval = 1000;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between accbdfd and f412bc3.

📒 Files selected for processing (4)
  • src/spider/core/JobRecovery.cpp (1 hunks)
  • src/spider/core/JobRecovery.hpp (1 hunks)
  • src/spider/scheduler/scheduler.cpp (5 hunks)
  • tests/scheduler/test-JobRecovery.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/scheduler/test-JobRecovery.cpp
  • src/spider/core/JobRecovery.hpp
  • src/spider/core/JobRecovery.cpp
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/spider/scheduler/scheduler.cpp (1)
src/spider/storage/MetadataStorage.hpp (16)
  • conn (27-27)
  • conn (29-29)
  • conn (30-30)
  • conn (32-33)
  • conn (35-36)
  • conn (40-43)
  • conn (45-51)
  • conn (53-54)
  • conn (56-57)
  • conn (59-60)
  • conn (62-66)
  • conn (69-70)
  • conn (72-76)
  • conn (78-78)
  • conn (80-80)
  • conn (87-88)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: lint
  • GitHub Check: non-storage-unit-tests (ubuntu-24.04)
  • GitHub Check: non-storage-unit-tests (ubuntu-22.04)
🔇 Additional comments (3)
src/spider/scheduler/scheduler.cpp (3)

11-11: LGTM: Include statements are necessary for the new functionality.

The added includes are all required for the recovery feature:

  • <vector> for job_ids storage
  • <boost/uuid/uuid_io.hpp> for UUID string conversion in logging
  • <spider/core/JobRecovery.hpp> for the JobRecovery class

Also applies to: 21-21, 27-27


153-212: Excellent implementation of the recovery loop with comprehensive error handling.

The recovery loop follows established patterns and includes proper error handling at each stage. The previous critical issue with reset_tasks parameters has been correctly resolved - the function now properly passes recovery.get_ready_tasks() and recovery.get_pending_tasks() as separate parameters.

Key strengths:

  • Consistent pattern with other loop functions
  • Comprehensive error handling with appropriate logging
  • Proper resource management with storage connections
  • Graceful degradation on failures (continues with next job)

330-335: Proper thread management for the recovery functionality.

The recovery thread is correctly created and joined, following the same pattern as existing heartbeat and cleanup threads. The thread lifecycle management ensures clean shutdown.

Also applies to: 339-339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant