Skip to content

Conversation

@sitaowang1998
Copy link
Collaborator

@sitaowang1998 sitaowang1998 commented May 21, 2025

Description

Add support for client to set a Data as persisted when building the data or through Data::set_persisted at runtime. The Data marked persisted will not be garbage collected.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • GitHub workflows pass.
  • Unit tests pass in dev container.
  • Integration tests pass in dev container.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Added the ability to mark data as persisted, preventing it from being cleaned up.
    • Users can now set and update the persisted state for data both during creation and afterwards.
  • Bug Fixes
    • Improved test coverage to verify correct handling of the persisted state in data storage and retrieval.
  • Tests
    • Enhanced tests to ensure persisted state is correctly set, updated, and validated during data operations.

@sitaowang1998 sitaowang1998 requested a review from a team as a code owner May 21, 2025 16:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 21, 2025

Walkthrough

A persisted state flag was introduced for data objects, enabling them to be marked as persisted both during construction via the builder and after creation via the data instance. This required updates to the core data class, client-side builder, storage interface, MySQL storage implementation, and related tests to support storing, updating, and verifying the persisted flag.

Changes

File(s) Change Summary
src/spider/client/Data.hpp Added set_persisted() method to Data class and Builder nested class; introduced m_persisted flag in Builder; updated builder to set persisted state on build and update storage accordingly.
src/spider/core/Data.hpp Added private m_persisted boolean member; added set_persisted(bool) setter and [[nodiscard]] auto is_persisted() const getter to Data class.
src/spider/storage/DataStorage.hpp Added pure virtual method set_data_persisted(StorageConnection&, Data const&) to DataStorage interface.
src/spider/storage/mysql/MySqlStorage.hpp, src/spider/storage/mysql/MySqlStorage.cpp Modified insert and select queries to handle persisted field; added set_data_persisted method to update persisted flag in MySQL storage.
tests/storage/test-DataStorage.cpp Modified test to remove const from data, set persisted flag, update storage persisted state, and verify persisted flag after retrieval.
tests/utils/CoreDataUtils.hpp Extended data_equal function to compare is_persisted() states between two core::Data objects.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DataBuilder
    participant Data
    participant DataStore
    participant StorageFactory
    participant StorageConnection

    Client->>DataBuilder: set_persisted()
    DataBuilder->>DataBuilder: m_persisted = true
    Client->>DataBuilder: build()
    DataBuilder->>Data: set_persisted(m_persisted)
    DataBuilder->>DataStore: add_data(Data)
    DataStore->>StorageFactory: get_connection()
    StorageFactory-->>DataStore: StorageConnection
    DataStore->>StorageConnection: store Data (with persisted flag)
Loading
sequenceDiagram
    participant Client
    participant Data
    participant DataStore
    participant StorageFactory
    participant StorageConnection

    Client->>Data: set_persisted()
    Data->>Data: set_persisted(true)
    Data->>DataStore: update_data_persisted(Data)
    DataStore->>StorageFactory: get_connection()
    StorageFactory-->>DataStore: StorageConnection
    DataStore->>StorageConnection: update persisted flag in storage
Loading

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 027c835 and 62f56a7.

📒 Files selected for processing (5)
  • src/spider/client/Data.hpp (4 hunks)
  • src/spider/storage/DataStorage.hpp (1 hunks)
  • src/spider/storage/mysql/MySqlStorage.cpp (5 hunks)
  • src/spider/storage/mysql/MySqlStorage.hpp (1 hunks)
  • tests/storage/test-DataStorage.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/spider/storage/DataStorage.hpp
  • src/spider/storage/mysql/MySqlStorage.hpp
  • tests/storage/test-DataStorage.cpp
  • src/spider/storage/mysql/MySqlStorage.cpp
  • src/spider/client/Data.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: non-storage-unit-tests (ubuntu-22.04)
  • GitHub Check: non-storage-unit-tests (ubuntu-24.04)
  • GitHub Check: lint
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/spider/storage/DataStorage.hpp (1)

68-68: Add documentation for the new method.

The new method lacks the documentation comment block that other methods in this interface have.

+    /**
+     * Update the persisted state of a data object in storage.
+     * @param conn
+     * @param data The data object with the desired persisted state
+     * @return StorageErr::Success if the update succeeds. Error types otherwise.
+     */
     virtual auto set_data_persisted(StorageConnection& conn, Data const& data) -> StorageErr = 0;
tests/storage/test-DataStorage.cpp (1)

59-65: Consider testing both true and false states.

While the test confirms the ability to set and retrieve the persisted state, it only tests setting it to false. For more thorough testing, consider also verifying that setting it to true works correctly.

 // Set data persisted should succeed
 data.set_persisted(false);
 REQUIRE(data_storage->set_data_persisted(*conn, data).success());
 // Get data should match
 REQUIRE(data_storage->get_data(*conn, data.get_id(), &result).success());
 REQUIRE(spider::test::data_equal(data, result));
+
+// Set data persisted to true should also succeed
+data.set_persisted(true);
+REQUIRE(data_storage->set_data_persisted(*conn, data).success());
+// Get data should match with persisted=true
+REQUIRE(data_storage->get_data(*conn, data.get_id(), &result).success());
+REQUIRE(spider::test::data_equal(data, result));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a29306 and 041679f.

📒 Files selected for processing (7)
  • src/spider/client/Data.hpp (4 hunks)
  • src/spider/core/Data.hpp (1 hunks)
  • src/spider/storage/DataStorage.hpp (1 hunks)
  • src/spider/storage/mysql/MySqlStorage.cpp (5 hunks)
  • src/spider/storage/mysql/MySqlStorage.hpp (1 hunks)
  • tests/storage/test-DataStorage.cpp (2 hunks)
  • tests/utils/CoreDataUtils.hpp (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/spider/core/Data.hpp (1)
src/spider/client/Data.hpp (1)
  • m_persisted (135-138)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: lint
  • GitHub Check: non-storage-unit-tests (ubuntu-24.04)
  • GitHub Check: non-storage-unit-tests (ubuntu-22.04)
🔇 Additional comments (13)
src/spider/core/Data.hpp (2)

34-37: Clean implementation of persisted state getters and setters.

The implementation of persisted state handling follows the class's existing patterns with proper use of the [[nodiscard]] attribute for the getter.


43-43: Appropriate default initialization.

The m_persisted flag is correctly initialized to false by default, ensuring data isn't considered persisted unless explicitly marked as such.

src/spider/storage/mysql/MySqlStorage.hpp (1)

164-164: Method declaration correctly overrides the interface.

The declaration properly overrides the pure virtual method from the parent interface, maintaining consistency with other storage methods.

tests/storage/test-DataStorage.cpp (1)

43-43: Dropping const qualifier to enable mutation.

The const qualifier has been removed from the data object to enable mutation for testing the new persisted flag functionality.

tests/utils/CoreDataUtils.hpp (1)

23-25: Good addition to handle the new persisted state in equality checks.

The comparison of is_persisted() is properly added to the data_equal function, ensuring that two Data objects are only considered equal if their persisted states match. This change is consistent with the existing pattern of comparing all relevant properties.

src/spider/storage/mysql/MySqlStorage.cpp (4)

1990-1991: SQL statement properly updated to store persisted state.

The addition of the persisted column to the SQL insert statement and corresponding parameter binding ensures that the persistence state is correctly stored in the database when creating new Data objects.

Also applies to: 1998-1998


2040-2041: Good consistency in parallel implementation.

The same changes for the persisted state have been correctly applied to the add_task_data method, maintaining consistency with add_driver_data.

Also applies to: 2048-2048


2088-2088: Data retrieval correctly includes persisted state.

The SQL select statement and subsequent object initialization properly retrieve and set the persisted state when loading Data objects from the database.

Also applies to: 2104-2104


2234-2251: Well-implemented method for updating persisted state.

The new set_data_persisted method follows the established pattern for database operations in this class, with proper:

  • SQL statement preparation
  • Parameter binding
  • Exception handling
  • Transaction management (commit/rollback)

This implementation enables updating the persisted state of existing Data objects in the database.

src/spider/client/Data.hpp (4)

82-100: Good implementation of post-creation checkpointing.

The set_checkpointed() method provides a clear API for marking data as persisted after creation. The implementation:

  • Sets the persisted flag on the core data object
  • Properly manages database connections (reusing existing or creating new)
  • Follows the established error handling pattern
  • Has clear documentation with appropriate exception specification

This method enables clients to checkpoint existing data objects, preventing them from being cleaned up.


130-138: Builder API properly extended for checkpointing during construction.

The set_checkpointed() method in the Builder class follows the fluent interface pattern used throughout the class. It has good documentation and properly maintains the internal persistence state.


210-210: Member variable correctly initialized.

The m_persisted member variable is properly initialized to false by default, ensuring that data is not persisted unless explicitly requested.


153-153: Build method correctly applies persisted state.

The addition to the build() method ensures that the persisted state from the Builder is correctly transferred to the core::Data object during construction.

@sitaowang1998 sitaowang1998 requested a review from davidlion May 21, 2025 17:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/spider/storage/mysql/MySqlStorage.cpp (1)

2234-2251: Consider adding validation for robustness.

The implementation correctly follows the established pattern for database updates with proper transaction management. However, consider these improvements:

  1. Validate that the data record exists before attempting the update
  2. Check the update count to ensure the operation affected a record
  3. Consider optimizing by checking if the persisted state is actually changing
 auto MySqlDataStorage::set_data_persisted(StorageConnection& conn, Data const& data) -> StorageErr {
     try {
         sql::bytes id_bytes = uuid_get_bytes(data.get_id());
         std::unique_ptr<sql::PreparedStatement> statement(
                 static_cast<MySqlConnection&>(conn)->prepareStatement(
                         "UPDATE `data` SET `persisted` = ? WHERE `id` = ?"
                 )
         );
         statement->setBoolean(1, data.is_persisted());
         statement->setBytes(2, &id_bytes);
-        statement->executeUpdate();
+        int32_t const update_count = statement->executeUpdate();
+        if (update_count == 0) {
+            static_cast<MySqlConnection&>(conn)->rollback();
+            return StorageErr{StorageErrType::KeyNotFoundErr, "Data record not found"};
+        }
     } catch (sql::SQLException& e) {
         static_cast<MySqlConnection&>(conn)->rollback();
         return StorageErr{StorageErrType::OtherErr, e.what()};
     }
     static_cast<MySqlConnection&>(conn)->commit();
     return StorageErr{};
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9bcb0e and 027c835.

📒 Files selected for processing (6)
  • src/spider/client/Data.hpp (4 hunks)
  • src/spider/storage/DataStorage.hpp (1 hunks)
  • src/spider/storage/mysql/MySqlStorage.cpp (5 hunks)
  • src/spider/storage/mysql/MySqlStorage.hpp (1 hunks)
  • tests/storage/test-DataStorage.cpp (2 hunks)
  • tests/utils/CoreDataUtils.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/utils/CoreDataUtils.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: lint
  • GitHub Check: non-storage-unit-tests (ubuntu-24.04)
  • GitHub Check: non-storage-unit-tests (ubuntu-22.04)
🔇 Additional comments (9)
src/spider/storage/mysql/MySqlStorage.hpp (1)

164-164: LGTM! Method declaration follows established patterns.

The new set_data_persisted method declaration correctly overrides the pure virtual method from the DataStorage interface. The method signature is consistent with other storage methods in the class, and the placement maintains logical grouping with other data manipulation methods.

tests/storage/test-DataStorage.cpp (2)

43-43: Correct removal of const qualifier to enable mutation.

Removing the const qualifier from the data variable is necessary to allow calling data.set_persisted(true) in the subsequent test code. This change enables testing the persistence functionality while maintaining the existing test logic.


59-65: Comprehensive test coverage for data persistence functionality.

The new test sequence thoroughly validates the persistence feature:

  • Sets the data as persisted using data.set_persisted(true)
  • Calls the storage method to persist the flag
  • Retrieves and verifies the data matches the updated state

This follows the established testing patterns and ensures the new persistence functionality works correctly with the storage layer.

src/spider/storage/DataStorage.hpp (1)

68-68: Well-designed interface extension for persistence functionality.

The new set_data_persisted pure virtual method appropriately extends the DataStorage interface. The method signature is consistent with existing interface methods, and the placement maintains logical grouping with other data manipulation operations. This provides a clean abstraction for setting data persistence across different storage implementations.

src/spider/storage/mysql/MySqlStorage.cpp (2)

1990-1991: LGTM: SQL statements properly updated for persisted column support.

The INSERT statements correctly include the new persisted column and use secure parameter binding. The implementation is consistent with existing patterns in the codebase.

Also applies to: 2040-2041, 1998-1998, 2048-2048


2088-2088: LGTM: SELECT statements correctly updated to retrieve persisted state.

The queries properly include the persisted column and the boolean value is correctly retrieved and set on the data object.

Also applies to: 2104-2104

src/spider/client/Data.hpp (3)

82-100: LGTM: Well-implemented checkpointing functionality.

The set_checkpointed() method correctly follows the established pattern from set_locality(), with proper error handling and support for both existing and new database connections. The documentation is clear and the exception handling is appropriate.


130-138: LGTM: Clean builder pattern implementation.

The builder's set_checkpointed() method is well-implemented, following the established builder pattern and providing clear documentation.


153-153: LGTM: Proper integration of persisted state in builder.

The addition of the m_persisted member variable with appropriate default value and its usage in the build() method correctly integrates the checkpointing functionality into the builder pattern.

Also applies to: 210-210

Copy link
Member

@davidlion davidlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add a real PR description.
  2. Is there a reason we use both "checkpointed" and "persisted" as terms for this? Are there plans for different types of "persistent" data? It would be less complex to stick to a single term.

@sitaowang1998
Copy link
Collaborator Author

2. Is there a reason we use both "checkpointed" and "persisted" as terms for this? Are there plans for different types of "persistent" data? It would be less complex to stick to a single term.

For theoretical part, especially for the paper, all the task inputs/outputs that are plain value are considered persisted in the storage. However, in practice, these two terms are used interchangeably.

@davidlion
Copy link
Member

2. Is there a reason we use both "checkpointed" and "persisted" as terms for this? Are there plans for different types of "persistent" data? It would be less complex to stick to a single term.

For theoretical part, especially for the paper, all the task inputs/outputs that are plain value are considered persisted in the storage. However, in practice, these two terms are used interchangeably.

I see. As discussed offline let's only use one for the code.

@sitaowang1998 sitaowang1998 changed the title feat: Add data checkpoint. feat: Add data persistence. May 29, 2025
@sitaowang1998 sitaowang1998 requested a review from davidlion May 29, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants