Skip to content

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Sep 29, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

🤖 Generated with Claude Code

This PR implements a critical data integrity feature that prevents undrop operations on tables whose data may have been partially or fully cleaned up by vacuum processes. The core principle is: once vacuum has started for a retention period, tables dropped before that period can never be restored, ensuring users cannot accidentally restore tables with incomplete or inconsistent data.

Problem Statement

Previously, there was a dangerous race condition where:

  1. User drops a table
  2. Vacuum process starts cleaning up the table's data
  3. User attempts undrop while vacuum is in progress
  4. Undrop succeeds but table data is incomplete/corrupted

This could lead to silent data corruption and inconsistent database state.

Design Philosophy

Tenant-Level Global Watermark Design

We chose a tenant-level global vacuum watermark approach instead of per-table or per-database granularity for several key reasons:

  1. Simplicity & Clarity: A single timestamp per tenant is easier to reason about, implement, and maintain
  2. Sufficient for Use Case: Vacuum operations are typically tenant-wide administrative tasks, making global coordination natural
  3. Consistent Semantics: All undrop operations within a tenant follow the same rules, reducing cognitive overhead
  4. Performance: Single KV read/write per tenant vs. potentially thousands for per-table tracking
  5. Smooth Evolution Path: The current design can be extended to table/DB-level granularity without breaking changes

Implementation Approach

Core Mechanism: Monotonic Timestamp Protection

VacuumWatermark {
    time: DateTime<Utc>, // Monotonically increasing, never decreases
}
  • Vacuum Phase: Sets timestamp when vacuum starts (retention_time = now() - retention_days)
  • Undrop Phase: Compares table's drop_time against vacuum timestamp
  • Safety Rule: drop_time <= vacuum_timestamp → undrop FORBIDDEN

Atomic Operations & Concurrent Safety

Monotonic Timestamp Updates: Uses crud_upsert_with with CAS semantics to ensure vacuum watermark only advances forward, preventing timestamp rollback.

Safety & Behavior

Protection Matrix (Sample Scenarios)

Scenario Vacuum Watermark Table Drop Time Undrop Result Reason
Pre-vacuum State None (never set) Any time ALLOWED No vacuum has run yet - data guaranteed safe
Post-vacuum Risk Set to 2023-12-01 (example) Dropped 2023-11-20 (example) BLOCKED Drop predates vacuum - data may be cleaned
Post-vacuum Safe Set to 2023-12-01 (example) Dropped 2023-12-05 (example) ALLOWED Drop postdates vacuum - data guaranteed intact

Safety Guarantees

Tables whose data may have been cleaned by vacuum processes cannot be restored via undrop, preventing restoration of incomplete or corrupted data.

Technical Implementation

Data Structures

// Rust structure
pub struct VacuumWatermark {
    pub time: DateTime<Utc>,
}
// Protobuf serialization (v152)
message VacuumWatermark {
  uint64 ver = 100;
  uint64 min_reader_ver = 101;
  string time = 1;         // Timestamp string
}

MetaStore Storage

Key Format:   __fd_vacuum_watermark_ts/{tenant_name}
Key Example:  __fd_vacuum_watermark_ts/default
Value Type:   VacuumWatermark (protobuf serialized)
Scope:        Global per tenant (one watermark per tenant)

Integration Points

  1. Vacuum Trigger: VacuumDropTablesInterpreter::execute2() sets watermark before cleanup
  2. Protection Check: handle_undrop_table() validates drop_time vs watermark

Critical Flow

1. VACUUM DROP TABLE → Set watermark (fail-safe: abort if fails)
2. Data cleanup proceeds only after watermark is established
3. UNDROP TABLE → Check drop_time <= watermark → REJECT if true

Concurrent Safety Example

Race condition protection during undrop:

1. `Undrop` operation reads watermark with seq=N
2. Concurrent `vacuum` updates watermark (seq=N+1)
3. KV transaction submitted by `Undrop` operation fails due to seq mismatch → Safe abort

Timeline Example

Timeline (Scenario: data_retention_time_in_days = 30):

Oct-15        Nov-01       Nov-20       Dec-01       Jan-05
│             │            │            │            │
TableA        │            TableB       VACUUM       UNDROP
Dropped       │            Dropped     EXECUTION    Requests
│             │                         (sets        │
│             │                         watermark    │
│             │                         = Nov-01)    │
│             │                                      │
│             │                                      └─ TableA: ❌ BLOCKED
│             │                                         (Oct-15 ≤ Nov-01)
│             │
│             │                                      └─ TableB: ✅ ALLOWED
│             │                                         (Nov-20 > Nov-01)
│             │
│             └─ Watermark boundary
│                (retention cutoff)
│
└─ TableA dropped before watermark
   (data potentially cleaned)

Note: Watermark = vacuum_execution_time - data_retention_time_in_days

Test Coverage

  • Unit Tests: Core API behavior and monotonic property validation
  • Integration Tests: End-to-end vacuum-undrop workflows
  • Concurrency Tests: Race condition handling validation
  • Compatibility Tests: Protobuf serialization/deserialization (v152)
  • Error Handling: Failure mode validation and fail-safe behavior

Files Modified

Core Implementation

  • src/meta/api/src/garbage_collection_api.rs - Vacuum watermark timestamp management
  • src/meta/api/src/schema_api.rs - Undrop protection logic with concurrent safety
  • src/query/service/src/interpreters/interpreter_vacuum_drop_tables.rs - Integration point for vacuum operations

Data Model & Serialization

  • src/meta/app/src/schema/vacuum_watermark.rs - Core VacuumWatermark structure
  • src/meta/app/src/schema/vacuum_watermark_ident.rs - Storage identifier
  • src/meta/proto-conv/src/vacuum_watermark_from_to_protobuf_impl.rs - Protobuf conversion
  • src/meta/protos/proto/vacuum_watermark.proto - Protobuf definition

Error Handling

  • src/meta/app/src/app_error.rs - UndropTableRetentionGuard error handling
  • src/common/exception/src/exception_code.rs - New error code for vacuum protection

Tests

  • src/meta/api/src/schema_api_test_suite.rs - Comprehensive test coverage
  • src/meta/proto-conv/tests/it/v152_vacuum_retention.rs - Backward compatibility tests

Migration Safety

  • No Breaking Changes: Existing functionality preserved when no vacuum watermark exists
  • Backward Compatible: Protobuf v152 maintains compatibility with existing deployments
  • Graceful Migration: Systems without watermarks continue to work normally
  • Safe Rollback: Can be disabled without data loss or corruption

Performance Impact

  • Minimal Overhead: Single KV read/write per tenant during vacuum operations
  • Efficient Storage: Compact protobuf representation for watermark timestamps
  • Fast Validation: Simple timestamp comparison for undrop protection
  • No Query Impact: Zero performance impact on normal table operations

This implementation provides robust data integrity protection while maintaining performance and operational simplicity. The tenant-level design offers a balanced approach between safety, simplicity, and future extensibility.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Add retention guard mechanism to prevent undrop operations after vacuum
has started, ensuring data consistency by blocking restoration of tables
whose data may have been partially or fully cleaned up.

Key changes:
- Add VacuumRetention metadata with monotonic timestamp semantics
- Implement fetch_set_vacuum_timestamp API with correct CAS behavior
- Integrate retention checks in vacuum drop table workflow
- Add retention guard validation in undrop table operations
- Include comprehensive error handling and user-friendly messages
- Add protobuf serialization support with v151 compatibility
- Provide full integration test coverage

Fixes data integrity issue where undrop could succeed on tables with
incomplete S3 data after vacuum cleanup has begun.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Sep 29, 2025
dantengsky and others added 28 commits September 30, 2025 00:18
Make error message more user-friendly by:
- Using clear language about why undrop is blocked
- Including vacuum start timestamp for better context
- Removing technical jargon like 'retention guard' and 'precedes'

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Rename VacuumRetention to VacuumWatermark for clarity
- Remove unnecessary fields: updated_by, updated_at, version
- Keep only essential 'time' field for monotonic timestamp tracking
- Update protobuf conversion and tests accordingly
- Maintain API compatibility and retention guard functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Move vacuum timestamp setting from gc_drop_tables to VacuumDropTablesInterpreter::execute2
- Use actual retention settings instead of hardcoded 7 days
- Set timestamp before vacuum operation starts for better timing
- Simplify gc_drop_tables to focus only on metadata cleanup
- Improve separation of concerns between business logic and cleanup operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Only set vacuum timestamp when NOT in dry run mode
- Maintains consistency with existing dry run behavior for metadata cleanup
- Dry run should not modify any state including vacuum watermark
- Preserves read-only nature of dry run operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Returns Option<VacuumWatermark> instead
…alues

- Replace unwrap_or_default() with explicit Option handling in fetch_set_vacuum_timestamp
- Use clear match semantics: None = never set, Some = previous value
- Update get_vacuum_timestamp to return Option<VacuumWatermark> instead of using epoch default
- Fix schema_api retention guard to only check when vacuum timestamp is actually set
- Update tests to handle proper Option semantics for first-time vs subsequent calls
- Remove dependency on artificial epoch time as "unset" indicator
- Improve type safety by letting Option express the "possibly unset" state

This eliminates confusion around artificial default values and makes the
vacuum watermark semantics clearer through the type system.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Adds vacuum timestamp seq checking to undrop transaction conditions to prevent
race conditions where vacuum and undrop operations could execute concurrently,
leading to data inconsistency.

- Read vacuum timestamp with seq before undrop transaction
- Add vacuum timestamp seq check to transaction conditions
- Ensures undrop fails atomically if vacuum timestamp changes during operation
- Existing test coverage in vacuum_retention_timestamp validates concurrent scenario

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The vacuum_retention_timestamp test was failing because it tried to create
a table without first creating the database. Added util.create_db() call
to fix the test execution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Fixed timing assertion failure by:
- Capturing drop_time after the actual drop operation completes
- Using relative time comparison instead of exact equality
- Ensuring vacuum_time is always after drop_time as expected

The test now properly validates that undrop is blocked when vacuum timestamp
is set after the drop time, without timing precision issues.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Ran `cargo fmt --all` to ensure consistent code formatting across
vacuum retention implementation files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Updated test_decode_v152_vacuum_retention to match the standard proto-conv test format:
- Added complete copyright header
- Added serialized bytes array for backward compatibility testing
- Added test_load_old call to test deserialization from v152 format
- Added proper documentation comments about byte array immutability
- Used correct serialized bytes generated by test framework

Follows the same pattern as other version tests like v150_role_comment.rs
to ensure proper backward compatibility testing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The get_vacuum_timestamp method was only used in tests and had no other
consumers. Removed it to follow YAGNI principle and simplify the API:

- Removed get_vacuum_timestamp from GarbageCollectionApi trait
- Updated test to use direct KV API call (kv_api.get_pb)
- Added VacuumRetentionIdent import to test file

This aligns with the existing pattern in undrop logic which also uses
direct KV API access for reading vacuum timestamps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Make vacuum timestamp setting a blocking operation that must succeed
before any data cleanup begins. This prevents a critical race condition
where data could be cleaned up without proper undrop protection.

Changes:
- Convert timestamp setting from best-effort to critical operation
- Vacuum operation now fails fast if timestamp cannot be set
- Added detailed error message explaining the safety abort
- Ensures vacuum watermark always precedes any data cleanup

This maintains the core safety guarantee: tables that may have incomplete
data after vacuum can never be restored via undrop.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Improve consistency with existing codebase naming conventions:

1. Rename protobuf message from VacuumRetention to VacuumWatermark
   - Aligns with Rust struct name (VacuumWatermark)
   - Follows existing pattern: Rust struct name = Protobuf message name
   - Examples: DatabaseMeta, CatalogMeta, IndexMeta all use same names

2. Remove unused protobuf fields
   - Removed: updated_by, updated_at, version (all set to empty/0)
   - Kept: ver, min_reader_ver (required for versioning), time (core data)
   - Maintains same serialization bytes (backward compatible)

3. Simplify proto-conv implementation
   - Removed unused field mappings in to_pb()
   - Cleaner, more maintainable conversion code

The protobuf now only contains the essential fields actually used,
following the principle of minimal necessary data representation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Rename protobuf definition file to match the message name inside it.
This improves consistency and makes the codebase easier to navigate:

- File: vacuum_retention.proto → vacuum_watermark.proto
- Message: VacuumWatermark (unchanged)
- Rust struct: VacuumWatermark (unchanged)

Following the common pattern where protobuf files are named after
their primary message type.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Replace extreme epoch timestamp (1970-01-01) with realistic test value:
- Old: timestamp(0, 0) → 1970-01-01 00:00:00 UTC (too extreme)
- New: timestamp(1702603569, 0) → 2023-12-15 01:26:09 UTC (realistic)

This aligns with other protobuf tests in the codebase that use similar
recent timestamps for better test readability and maintainability.
Updated corresponding serialized bytes array to match new timestamp.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Adds missing ErrorCode import to resolve compilation error in
VacuumDropTablesInterpreter after refactoring vacuum watermark
error handling.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…rity

- Renamed vacuum_retention_ident.rs to vacuum_watermark_ident.rs
- Updated VacuumRetentionIdent type alias to VacuumWatermarkIdent
- Updated VacuumRetentionRsc to VacuumWatermarkRsc
- Updated all references in garbage_collection_api.rs and schema_api.rs
- Updated test files and module exports
- Verified compilation and test execution

The new name better reflects the actual purpose: storing vacuum
watermark timestamps rather than general retention policies.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…duplicate test

- Enhanced error message to include both drop time and vacuum start time
- Improved wording: "vacuum cleanup started" → "vacuum started"
- Added explanatory text: "Data may have been cleaned up"
- Formatted timestamps in human-readable format: "YYYY-MM-DD HH:MM:SS UTC"
- Removed duplicate "Test concurrent vacuum-undrop safety" test case

The improved error message provides users with clear temporal context:
- When the table was dropped
- When vacuum started
- Why undrop is not possible

Example: "Cannot undrop table 'test': table was dropped at 2023-11-15 10:30:45 UTC
before vacuum started at 2023-12-01 01:26:09 UTC. Data may have been cleaned up."

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…tion

- Consolidated error message into single #[error(...)] annotation
- Removed duplicate AppErrorMessage::message() implementation
- Enhanced #[error(...)] to include complete time information and explanation
- Use default AppErrorMessage implementation (calls to_string())

This eliminates code duplication while maintaining full error context.
Both Display and AppErrorMessage now use the same comprehensive message:
"Cannot undrop table 'name': table was dropped at <time> before vacuum
started at <time>. Data may have been cleaned up."

The custom AppErrorMessage implementation was unnecessary since there's
no sensitive information to strip for this error type.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…atermark

- Removed Default implementation that used Unix epoch (1970-01-01)
  - No business meaning for vacuum watermark default value
  - Conflicts with Option<VacuumWatermark> design (None = unset)
  - Not used anywhere in codebase
- Fixed grammar in doc comment: "marking" → "marker"
  - "marker indicating when" is grammatically correct
  - More precise technical terminology

VacuumWatermark should only be created with explicit, meaningful timestamps.
The Option<VacuumWatermark> pattern properly handles the "unset" state.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Changed PREFIX from "__fd_vacuum_retention_ts" to "__fd_vacuum_watermark_ts"
- Updated test expectation to match new key format
- Ensures naming consistency throughout the codebase

The old PREFIX used "retention" terminology from early design iterations.
Since MetaStore contains no existing data with the old prefix, this is
a safe non-breaking change that aligns storage keys with current naming.

New key format: "__fd_vacuum_watermark_ts/{tenant}"

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Renamed vacuum_retention.rs → vacuum_watermark.rs
- Renamed vacuum_retention_from_to_protobuf_impl.rs → vacuum_watermark_from_to_protobuf_impl.rs
- Updated all module references and imports
- Fixed module documentation comments

This completes the naming transition from "retention" to "watermark" terminology:
- All source files now use consistent "watermark" naming
- Module structure reflects the actual purpose (watermark timestamps)
- Eliminates historical naming confusion from early design iterations

File structure now:
- vacuum_watermark.rs (core struct)
- vacuum_watermark_ident.rs (identifier)
- vacuum_watermark_from_to_protobuf_impl.rs (serialization)
- vacuum_watermark.proto (protobuf definition)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Replace Unix timestamp with ISO 8601 string parsing
- More readable: "2023-12-15T01:26:09Z".parse() vs from_timestamp(1702603569, 0)
- Self-documenting: timestamp value is immediately clear without calculation
- Better for code review: no need to decode Unix timestamp mentally

The test functionality remains identical, but the code is much more
maintainable and reviewer-friendly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@dantengsky dantengsky marked this pull request as ready for review October 10, 2025 06:36
@dantengsky dantengsky requested a review from SkyFan2002 October 10, 2025 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants