-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Add fine-grained failure recovery. #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning Rate limit exceeded@sitaowang1998 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 41 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughThis change introduces a job recovery mechanism to the scheduler, enabling detection and recovery of failed jobs and tasks by analyzing task graphs and data persistence. It adds new interfaces and implementations for marking data as persisted, resetting tasks, and retrieving failed jobs. Comprehensive tests and build configuration updates are included. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant StorageFactory
participant MetadataStorage
participant DataStorage
participant JobRecovery
loop Every 1000 seconds
Scheduler->>StorageFactory: get_connection()
Scheduler->>MetadataStorage: get_failed_jobs(conn)
MetadataStorage-->>Scheduler: [failed_job_ids]
loop for each failed_job_id
Scheduler->>JobRecovery: construct(job_id, conn, data_store, metadata_store)
Scheduler->>JobRecovery: compute_graph()
JobRecovery->>MetadataStorage: load_task_graph(job_id)
JobRecovery->>DataStorage: get_data(data_id)
JobRecovery-->>Scheduler: ready_tasks, pending_tasks
Scheduler->>MetadataStorage: reset_tasks(conn, ready_tasks, pending_tasks)
end
end
Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
src/spider/core/JobRecovery.cpp (1)
1-1: Use angle brackets for header inclusion.For consistency with other includes in this file, use angle brackets instead of quotes.
-#include "JobRecovery.hpp" +#include <spider/core/JobRecovery.hpp>src/spider/core/JobRecovery.hpp (3)
41-43: Consider returning const references for better performance.The getter methods currently return vectors by value, which involves copying. Since these are likely read-only operations, returning const references would be more efficient.
- auto get_ready_tasks() -> std::vector<boost::uuids::uuid>; + auto get_ready_tasks() const -> std::vector<boost::uuids::uuid> const&; - auto get_pending_tasks() -> std::vector<boost::uuids::uuid>; + auto get_pending_tasks() const -> std::vector<boost::uuids::uuid> const&;Note: This would require converting the internal
absl::flat_hash_settostd::vectorduringcompute_graph()instead of in the getters.
52-53: Parameter could be passed by const reference.The
not_persistedparameter is an output parameter but the method signature doesn't clearly indicate this. Consider using a more explicit approach.- auto check_task_input(Task const& task, absl::flat_hash_set<boost::uuids::uuid>& not_persisted) + auto check_task_input(Task const& task, absl::flat_hash_set<boost::uuids::uuid>* not_persisted) -> StorageErr;Using a pointer makes it clearer at the call site that this is an output parameter.
23-28: Constructor parameters could benefit from named parameter idiom.With four shared_ptr parameters of different types, there's a risk of passing arguments in the wrong order. Consider using a builder pattern or configuration struct.
Consider creating a configuration struct:
struct JobRecoveryConfig { boost::uuids::uuid job_id; std::shared_ptr<StorageConnection> storage_connection; std::shared_ptr<DataStorage> data_store; std::shared_ptr<MetadataStorage> metadata_store; }; JobRecovery(JobRecoveryConfig config);tests/scheduler/test-JobRecovery.cpp (1)
114-115: Avoid shadowing member variables.The local variables
ready_tasksandpending_taskscould be confused with potential member variables. Consider using different names for clarity.- auto ready_tasks = recovery.get_ready_tasks(); - auto pending_tasks = recovery.get_pending_tasks(); + auto const recovered_ready_tasks = recovery.get_ready_tasks(); + auto const recovered_pending_tasks = recovery.get_pending_tasks();src/spider/storage/mysql/MySqlStorage.cpp (1)
1178-1181: Consider indexing for performance.The SQL query filters on
stateand comparesretrywithmax_retry. Ensure these columns are properly indexed for efficient query execution.Consider adding a composite index on
(state, retry, max_retry)or at least onstateif not already present:CREATE INDEX idx_tasks_state_retry ON tasks(state, retry, max_retry);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/spider/CMakeLists.txt(2 hunks)src/spider/client/Data.hpp(4 hunks)src/spider/core/Data.hpp(1 hunks)src/spider/core/JobRecovery.cpp(1 hunks)src/spider/core/JobRecovery.hpp(1 hunks)src/spider/scheduler/scheduler.cpp(5 hunks)src/spider/storage/DataStorage.hpp(1 hunks)src/spider/storage/MetadataStorage.hpp(1 hunks)src/spider/storage/mysql/MySqlStorage.cpp(8 hunks)src/spider/storage/mysql/MySqlStorage.hpp(2 hunks)tests/CMakeLists.txt(1 hunks)tests/scheduler/test-JobRecovery.cpp(1 hunks)tests/storage/test-DataStorage.cpp(2 hunks)tests/storage/test-MetadataStorage.cpp(1 hunks)tests/utils/CoreDataUtils.hpp(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/spider/client/Data.hpp (3)
src/spider/client/TaskContext.hpp (1)
m_data_store(263-263)src/spider/client/Exception.hpp (2)
ConnectionException(14-15)ConnectionException(14-14)src/spider/storage/DataStorage.hpp (16)
conn(23-23)conn(26-27)conn(30-31)conn(33-33)conn(44-49)conn(60-65)conn(67-67)conn(68-68)conn(69-69)conn(71-72)conn(74-78)conn(80-84)conn(86-90)conn(92-92)conn(94-94)conn(96-96)
tests/scheduler/test-JobRecovery.cpp (2)
src/spider/core/Data.hpp (1)
gen(45-48)src/spider/core/JobRecovery.hpp (1)
task(52-53)
src/spider/core/JobRecovery.cpp (2)
src/spider/core/JobRecovery.hpp (4)
JobRecovery(23-28)task_id(76-76)task(52-53)data_id(63-63)src/spider/client/TaskContext.hpp (2)
m_metadata_store(265-265)m_data_store(263-263)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: non-storage-unit-tests (ubuntu-22.04)
- GitHub Check: non-storage-unit-tests (ubuntu-24.04)
- GitHub Check: lint
🔇 Additional comments (30)
tests/CMakeLists.txt (1)
13-13: LGTM! Test integration follows established patterns.The addition of the JobRecovery test file to the build system is correctly placed and follows the existing alphabetical ordering convention.
src/spider/storage/DataStorage.hpp (1)
68-68: Well-designed interface addition for persistence tracking.The new
set_data_persistedmethod follows the established interface patterns with appropriate parameter types and return value. This is a breaking change for existing implementations, but that's expected when adding new functionality.src/spider/core/Data.hpp (2)
34-36: Clean implementation of persistence state management.The
set_persistedandis_persistedmethods follow established naming conventions and const-correctness patterns. The[[nodiscard]]attribute on the getter is consistent with other query methods in the class.
43-43: Appropriate default initialization for persistence flag.The
m_persistedmember is correctly initialized tofalse, which is the expected default state for newly created data objects that haven't been checkpointed yet.src/spider/CMakeLists.txt (2)
7-7: Correct integration of JobRecovery source file.The addition follows the established pattern for core components and maintains alphabetical ordering within the sources list.
31-31: Proper header file integration.The JobRecovery header is correctly added to the headers list with appropriate placement maintaining alphabetical order.
tests/storage/test-DataStorage.cpp (2)
43-43: LGTM! Necessary change to enable mutation.Removing the
constqualifier is required to allow callingset_persisted(true)on the data object in the extended test.
59-64: LGTM! Comprehensive persistence state testing.The test properly verifies the complete persistence workflow:
- Sets the persistence flag on the data object
- Calls the storage interface to persist the state
- Retrieves the data again to verify persistence is maintained
- Uses the updated
data_equalutility that includes persistence comparisonThis provides good coverage for the new persistence functionality.
tests/utils/CoreDataUtils.hpp (1)
23-25: LGTM! Essential addition for comprehensive data comparison.The persistence state check follows the same pattern as other property comparisons and is necessary for accurate test assertions now that
Dataobjects have persistence state.src/spider/scheduler/scheduler.cpp (3)
11-11: LGTM! Necessary includes for recovery functionality.The new includes support the recovery feature:
vectorfor job ID storageboost/uuid/uuid_io.hppfor UUID string conversion in loggingJobRecovery.hppfor the recovery logic implementationAlso applies to: 21-21, 27-27
48-48: LGTM! Consistent timing configuration.The recovery interval constant follows the same pattern as the cleanup interval, maintaining consistency in the codebase.
330-335: LGTM! Consistent thread management.The recovery thread follows the established pattern:
- Created with the same parameter passing style as other threads
- Properly joined before server shutdown
- Maintains consistency with existing thread lifecycle management
Also applies to: 339-339
src/spider/storage/MetadataStorage.hpp (2)
81-89: LGTM! Well-documented interface addition.The
get_failed_jobsmethod is properly documented and follows established interface patterns:
- Clear documentation explaining the method's purpose
- Consistent parameter style with storage connection and output parameter
- Appropriate return type for error handling
90-102: LGTM! Comprehensive task reset interface.The
reset_tasksmethod provides flexible task state management:
- Detailed documentation explaining the dual-state reset capability
- Separate parameters for ready and pending tasks allow fine-grained control
- Consistent with other interface methods in style and error handling
tests/storage/test-MetadataStorage.cpp (1)
549-633: Well-structured test for partial reset functionality.The test case properly validates the new
reset_tasksmethod by:
- Setting up a complex task graph with dependencies
- Executing only parent_1 before the reset
- Verifying that parent_1 maintains its Succeed state and outputs after partial reset
- Confirming that parent_2 and child_task are appropriately reset to Ready and Pending states respectively
The test complements the existing full job reset test and provides good coverage for the partial recovery feature.
src/spider/client/Data.hpp (2)
81-99: Consistent implementation of checkpointing functionality.The
set_checkpointed()method follows the established pattern fromset_locality(), properly handling connection reuse and error cases. The integration with the storage layer viaset_data_persisted()is appropriate.
129-137: Clean extension of the builder pattern.The builder properly supports the checkpointing feature with:
- A fluent
set_checkpointed()method- Appropriate default value (false) for backward compatibility
- Correct propagation of the persisted flag to the constructed Data object
Also applies to: 152-152, 200-200
src/spider/storage/mysql/MySqlStorage.hpp (1)
78-84: Appropriate interface extensions for job recovery.The new methods follow established patterns in the codebase:
get_failed_jobs()uses output parameters consistent with other getter methodsreset_tasks()clearly separates ready and pending task lists for precise state controlset_data_persisted()mirrors the existingset_data_locality()interfaceThese additions provide the necessary storage layer support for the fine-grained failure recovery feature.
Also applies to: 173-173
src/spider/core/JobRecovery.cpp (4)
35-58: Well-implemented graph traversal for recovery computation.The breadth-first approach correctly identifies all tasks that need to be recovered by:
- Starting with failed tasks
- Processing their descendants and ancestors based on data persistence
- Propagating errors appropriately
The use of deque for BFS is appropriate for maintaining traversal order.
60-71: Efficient caching implementation.The data caching logic appropriately:
- Checks cache before making storage calls
- Only caches successfully retrieved data
- Returns the same error if retrieval fails
This optimization is important given the potentially large number of data objects in a task graph.
73-99: Correct logic for identifying non-persisted dependencies.The method properly:
- Handles optional data IDs and parent task references
- Collects parent tasks whose outputs haven't been persisted
- Propagates storage errors appropriately
This ensures that all tasks producing non-persisted data consumed by the current task are included in the recovery subgraph.
101-148: Sound implementation of the core recovery algorithm.The method correctly:
- Includes non-pending child tasks in the recovery subgraph (as they may have consumed failed task outputs)
- Determines task readiness based on input data persistence
- Propagates recovery to parent tasks when their outputs aren't persisted
The distinction between ready and pending tasks based on data persistence aligns with the PR's objective of fine-grained recovery.
tests/scheduler/test-JobRecovery.cpp (3)
26-70: LGTM! Well-structured test for basic recovery scenario.The test properly sets up a single failed task and verifies that recovery correctly identifies it as ready. Good cleanup at the end.
502-505: Good use of STL algorithms for verification.Using
std::ranges::findto check task presence in the pending list is clean and idiomatic. The assertions properly verify all expected tasks are in the pending state.
179-377: Excellent documentation with visual diagrams.The Graphviz dot notation in comments clearly illustrates the task dependency graphs being tested. This makes the test scenarios much easier to understand.
src/spider/storage/mysql/MySqlStorage.cpp (5)
1258-1283: Complex logic for data persistence management needs careful review.The logic to determine which data should be marked as not persisted is complex and could have edge cases. The current implementation:
- Gets all persisted data referenced by tasks
- Filters out data referenced by tasks not in the ready/pending lists
- Updates remaining data to not persisted
Please verify that this logic correctly handles all scenarios, especially:
- Data referenced by multiple tasks (some being reset, some not)
- Data that should remain persisted due to external references
Consider adding detailed comments explaining the logic.
1851-1875: Good addition of retry logic before marking job as failed.The modification properly checks retry count against max retries before marking the job as failed. This aligns well with the retry semantics needed for job recovery.
1775-1781: SQL query change from NOT EXISTS to NOT IN.The change from
NOT EXISTStoNOT INfor updating tasks to ready state could have different performance characteristics.NOT INcan be problematic with NULL values.Verify that:
- The
task_idcolumn cannot contain NULL values- Performance is acceptable with the NOT IN approach
Consider using NOT EXISTS for potentially better performance:
UPDATE `tasks` SET `state` = 'ready' WHERE `id` IN (SELECT `task_id` FROM `task_inputs` WHERE `output_task_id` = ?) AND `state` = 'pending' AND NOT EXISTS ( SELECT 1 FROM `task_inputs` WHERE `task_id` = `tasks`.`id` AND `output_task_id` = ? AND `value` IS NULL AND `data_id` IS NULL )
2167-2169: Consistent handling of persisted flag in data operations.Good addition of the
persistedcolumn in INSERT statements for both driver and task data. The implementation is consistent across both methods.Also applies to: 2217-2219
2411-2428: Well-implemented set_data_persisted method.The new method properly updates the persisted flag with appropriate error handling and transaction management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/spider/scheduler/scheduler.cpp (1)
48-48: Consider documenting the recovery interval choice.The 1000-second (16.7 minutes) recovery interval is quite long. While this may be appropriate for production workloads to avoid excessive overhead, consider adding a comment explaining the rationale for this interval.
+// Recovery check interval in seconds (16.7 minutes to balance responsiveness with system load) constexpr int cRecoveryInterval = 1000;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/spider/core/JobRecovery.cpp(1 hunks)src/spider/core/JobRecovery.hpp(1 hunks)src/spider/scheduler/scheduler.cpp(5 hunks)tests/scheduler/test-JobRecovery.cpp(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/scheduler/test-JobRecovery.cpp
- src/spider/core/JobRecovery.hpp
- src/spider/core/JobRecovery.cpp
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/spider/scheduler/scheduler.cpp (1)
src/spider/storage/MetadataStorage.hpp (16)
conn(27-27)conn(29-29)conn(30-30)conn(32-33)conn(35-36)conn(40-43)conn(45-51)conn(53-54)conn(56-57)conn(59-60)conn(62-66)conn(69-70)conn(72-76)conn(78-78)conn(80-80)conn(87-88)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: lint
- GitHub Check: non-storage-unit-tests (ubuntu-24.04)
- GitHub Check: non-storage-unit-tests (ubuntu-22.04)
🔇 Additional comments (3)
src/spider/scheduler/scheduler.cpp (3)
11-11: LGTM: Include statements are necessary for the new functionality.The added includes are all required for the recovery feature:
<vector>for job_ids storage<boost/uuid/uuid_io.hpp>for UUID string conversion in logging<spider/core/JobRecovery.hpp>for the JobRecovery classAlso applies to: 21-21, 27-27
153-212: Excellent implementation of the recovery loop with comprehensive error handling.The recovery loop follows established patterns and includes proper error handling at each stage. The previous critical issue with
reset_tasksparameters has been correctly resolved - the function now properly passesrecovery.get_ready_tasks()andrecovery.get_pending_tasks()as separate parameters.Key strengths:
- Consistent pattern with other loop functions
- Comprehensive error handling with appropriate logging
- Proper resource management with storage connections
- Graceful degradation on failures (continues with next job)
330-335: Proper thread management for the recovery functionality.The recovery thread is correctly created and joined, following the same pattern as existing heartbeat and cleanup threads. The thread lifecycle management ensures clean shutdown.
Also applies to: 339-339
Description
Adds fine-grained failure recovery by rolling back minimum number of tasks.
When tasks fail,
Spiderfinds the minimum graph in the task graph that:Datapassed into the graph are persistedSpiderrolls back all tasks in the subgraph to ready or pending, and executes them again.Checklist
breaking change.
Validation performed
Summary by CodeRabbit