[wip] Invalidate the DMatrix cache after modification.#11885
[wip] Invalidate the DMatrix cache after modification.#11885trivialfis wants to merge 2 commits intodmlc:masterfrom
Conversation
|
How about deleting caches instead of erroring out on modifications? |
|
Yes, we can do that. But I would like to see the impact of making the DMatrix immutable first. I will update the PR to work with the cache. |
a808277 to
410cf1d
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements a cache invalidation mechanism for DMatrix objects after their metadata has been modified. It introduces a has_been_read_ flag in the MetaInfo class to track whether data has been read via GetBatches() methods, and resets this flag when data is modified via SetInfo(). The cache eviction logic is updated to invalidate entries for DMatrix objects that haven't been marked as read.
- Adds tracking mechanism with
has_been_read_flag in MetaInfo class - Updates all GetBatches() template specializations to mark data as read
- Modifies cache invalidation logic to consider the read flag
- Includes minor documentation updates and spelling corrections
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| include/xgboost/data.h | Adds has_been_read_ flag, accessor methods, and read flag setting in all GetBatches() specializations; updates documentation style |
| src/data/data.cc | Adds SetReadFlag(false) call in MetaInfo::SetInfo() to invalidate cache when metadata is modified |
| include/xgboost/cache.h | Updates cache eviction logic to check the read flag and invalidate unread entries; fixes spelling of "repeatedly" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * \param key The key of the information. | ||
| * \param interface_str String representation of json format array interface. | ||
| /** | ||
| * @brief Set data in the meta info with array interface. |
There was a problem hiding this comment.
The documentation change from "information" to "data" may be inaccurate. The SetInfo method accepts various types of metadata (labels, weights, base_margin, etc.) through the array interface, not just "data". The term "information" more accurately describes what this method sets. Consider reverting this documentation change or clarifying that "data" refers to metadata fields.
| CHECK(it != container_.cend()); | ||
| if (it->second.ref.expired()) { | ||
| // Re-new the cache if this has never been read. | ||
| if (it->second.ref.expired() || !it->second.ref.lock()->Info().HasBeenRead()) { |
There was a problem hiding this comment.
The cache invalidation logic may be incorrect. The condition checks if the DMatrix has NOT been read (!HasBeenRead()), which seems backwards. According to the comment "Re-new the cache if this has never been read", the logic appears to invalidate caches that haven't been read yet. However, the intent should be to invalidate caches for DMatrix objects that have been modified (where has_been_read_ was reset to false after being true). Consider whether the flag should be renamed to something like "is_modified_" or "needs_cache_refresh_" to better reflect its purpose, or verify that the invalidation condition is correct.
| } // namespace | ||
|
|
||
| void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) { | ||
| this->SetReadFlag(false); |
There was a problem hiding this comment.
The SetReadFlag(false) call should be placed after the metadata modification is complete, not before. If the modification fails (e.g., Json::Load throws an exception), the flag will be incorrectly set to false even though no actual modification occurred. Consider moving this call to the end of the function, after all validation and modification operations have completed successfully.
| } // namespace | ||
|
|
||
| void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) { | ||
| this->SetReadFlag(false); |
There was a problem hiding this comment.
The has_been_read_ flag should be reset when MetaInfo is modified through methods other than SetInfo(). Consider adding SetReadFlag(false) calls to other modification methods such as SetFeatureInfo(), Extend(), Clear(), LoadBinary(), and Cats(std::shared_ptr). Without this, cache invalidation will not work correctly when data is modified through these alternative paths.
| /*! \brief argsort of labels */ | ||
| mutable std::vector<size_t> label_order_cache_; | ||
| bool has_categorical_{false}; | ||
| bool has_been_read_{false}; |
There was a problem hiding this comment.
The has_been_read_ member variable is not thread-safe. The flag is being read and written from multiple locations without synchronization, which could lead to data races in multi-threaded scenarios. Consider either making this field std::atomic or adding proper synchronization mechanisms, especially since the cache operations in cache.h are already protected by a mutex.
|
This is not going to be an easy change. Internally, the DMatrix doesn't have a concept of when the construction is "completed" and hence there's no easy way to check whether a modification has been made. |
Uh oh!
There was an error while loading. Please reload this page.