Skip to content

Conversation

@20001020ycx
Copy link

@20001020ycx 20001020ycx commented Nov 27, 2025

Description

This PR implements support for metadata projection.

Currently, projection results from a user SQL query are returned directly from the splits - either decompressed from SFA or read from IR stream files. However, when a projection field references a metadata column, the query is rejected because such column does not exist in the original dataset.

For example, considering the below SQL user query:
SELECT split_path, message FROM table LIMIT 1
Where:

  • split_path is a column from the metadata database that records the file path to the split (not present in the log file).
  • message is a column from the original dataset.
    This query would currently fail because split_path cannot be retrieved from the split data.

This PR allowing query results to include metadata values appended alongside dataset columns. The expected result of the above SQL query should is shown below:

split_path message
/path/to/split/file.ir "Hello world"

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

We manually submit below queries to test the end-to-end result of the implemented code:

  • All unit tests (Currently we are failing TestClpTopN, this is a known issue tracked in here )
  • Basic Functionality
    • Data column only projection: SELECT timestamp FROM clp.DEFAULT.cockroachdb_ir LIMIT 100;
    • Metadata and data column projection with different type: SELECT timestamp, tpath, num_messages FROM clp.DEFAULT.cockroachdb_ir LIMIT 100; where tpath is a VARCHAR metadata, and num_messages is a BIGINT metadata
    • Metadata column with name mapping projection: SELECT host_name FROM clp.DEFAULT.cockroachdb_ir LIMIT 100; where host_name is an exposed name of a VARCHAR typed metadata column hostname
    • Metadata column with rangemapping projection: SELECT ts FROM clp.DEFAULT.cockroachdb_ir LIMIT 100; where ts is a ranged mapped metadata column: mapping ts to creationtime and lastmodifiedtime
  • Metadata projection integration with other features:
    • With split pruning: SELECT zone FROM clp.DEFAULT.cockroachdb_ir WHERE host_name = 'host4' LIMIT 100; & SELECT tpath, timestamp FROM clp.DEFAULT.cockroachdb_ir WHERE num_messages > 100 LIMIT 100;
    • With push down: SELECT tpath, timestamp FROM clp.DEFAULT.cockroachdb_ir WHERE severity = 'INFO' LIMIT 100;
    • With UDF: SELECT tpath, CLP_GET_JSON_STRING() AS event FROM clp.DEFAULT.cockroachdb_ir WHERE severity = 'INFO' LIMIT 100;
  • Metadata projection integration with other features:
    • Non-existing metadata column: SELECT non_exist_metdata_column FROM table
    • Incorrect type metadata column: SELECT wrong_type_metadata_column FROM table

Summary by CodeRabbit

Release Notes

  • New Features

    • Added metadata projection support for enhanced query processing
    • Extended error handling for unsupported metadata operations
  • Improvements

    • Optimized table scanning with metadata-aware query planning

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 27, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces metadata projection support to the CLP Presto plugin. It adds MessagePack serialization for metadata values in split objects, extends the table layout handle to track metadata column names, implements metadata-aware table scan rewriting in the query optimizer, and enhances split providers to extract and serialize metadata columns from multiple data sources.

Changes

Cohort / File(s) Summary
Build & Dependencies
presto-clp/pom.xml
Added msgpack-core (0.9.8) and mockito-core (4.6.1, test scope) dependencies
Error Codes
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java
Added two new error code enum constants: CLP_UNSUPPORTED_METADATA_PROJECTION and CLP_SPLIT_METADATA_TYPE_MISMATCH_METADATA_DATABASE_TYPE
Core Data Structures
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java, ClpTableLayoutHandle.java
ClpSplit: added Base64/MessagePack-encoded metadata projection values with constructor and getter; ClpTableLayoutHandle: added optional splitMetadataColumnNames field with getter, equals, hashCode, and toString updates
Query Optimization
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
Implemented visitTableScan to compute and attach metadata projections to table scans, with validation against range-bound metadata columns and error signalling; updated TopN pushdown to accommodate new layout signature
Split Metadata Configuration
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java
Added getMetadataColumnsWithRangeBounds() method to identify metadata columns used as range bounds
Split Providers
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java, ClpPinotSplitProvider.java, ClpUberPinotSplitProvider.java
Updated split creation to include optional metadata column maps; ClpPinotSplitProvider: added getTypedValue() and extractMetadataColumns() methods with type coercion and metadata row mapping; ClpUberPinotSplitProvider: refactored to fetch and map metadata from split queries
Tests
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpTopN.java, optimization/TestClpComputePushDown.java, split/TestClpPinotSplitProvider.java, split/TestClpUberPinotSplitProvider.java
Updated TestClpTopN constructor calls; added TestClpComputePushDown with three test scenarios (data+metadata projection, existing layout preservation, error handling); added TestClpPinotSplitProvider with metadata extraction validation
Submodule
presto-native-execution/velox
Updated submodule reference commit

Sequence Diagram(s)

sequenceDiagram
    participant Optimizer as Query Optimizer<br/>(ClpComputePushDown)
    participant Layout as Table Layout<br/>(ClpTableLayoutHandle)
    participant Split as Split Object<br/>(ClpSplit)
    participant Provider as Split Provider<br/>(Pinot/MySql/Uber)
    participant Metadata as Metadata Config<br/>(ClpSplitMetadataConfig)

    rect rgb(230, 240, 250)
    Note over Optimizer: Metadata Projection Computation
    Optimizer->>Optimizer: Collect projection columns from output
    Optimizer->>Metadata: Get metadata columns for table
    Optimizer->>Metadata: Validate against range bounds
    end

    rect rgb(240, 250, 240)
    Note over Layout: Layout Enhancement
    Optimizer->>Layout: Create/augment layout with<br/>splitMetadataColumnNames
    Optimizer->>Optimizer: Perform copy-on-write<br/>for new TableScanNode
    end

    rect rgb(250, 240, 240)
    Note over Provider: Metadata Extraction & Serialization
    Provider->>Provider: Build split selection query<br/>with metadata projections
    Provider->>Provider: Extract metadata row as<br/>Map<String, JsonNode>
    Provider->>Provider: Convert to typed values<br/>(String, Long, Double)
    end

    rect rgb(250, 245, 230)
    Note over Split: Split Creation with Metadata
    Provider->>Split: Call ClpSplit constructor with<br/>Optional<Map<String, Object>>
    Split->>Split: Encode metadata map to<br/>MessagePack + Base64
    Split->>Split: Store as<br/>metadataProjectionNameValueEncoded
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • ClpComputePushDown.java: Metadata projection filtering logic, range-bound validation, and layout augmentation require careful verification of error handling paths and edge cases
  • ClpSplit.java: MessagePack serialization, Base64 encoding, and type validation need scrutiny for correctness and error propagation
  • ClpPinotSplitProvider.java: Type coercion logic in getTypedValue() and metadata column mapping require validation of type safety across all supported types (String, Long, Double)
  • TestClpComputePushDown.java: New test scenarios validate optimization behaviour, particularly the error handling path for unsupported range-bound metadata projections

Possibly related PRs

Suggested reviewers

  • anlowee

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.48% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description provides a clear explanation of the metadata projection feature, motivation, and validation approach, but lacks structured sections matching the template requirements. Add formal sections: Motivation and Context, Impact, Test Plan, and Release Notes following the template. Include checklist items completion and clearer documentation of public API changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Support metadata projection' clearly summarizes the main feature being added across the changeset—metadata projection support for CLP queries.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@20001020ycx 20001020ycx marked this pull request as ready for review November 28, 2025 05:00
@20001020ycx
Copy link
Author

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Nov 28, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpTopN.java (1)

289-301: Top-N tests correctly updated for new layout/split signatures

Using Optional.empty() for both splitMetadataColumnNames in ClpTableLayoutHandle and the metadata Optional in ClpSplit keeps the existing Top-N semantics unchanged while aligning with the new APIs. This is consistent with the current focus on data-column-only Top-N.

If you plan to support projecting metadata columns in Top-N queries later, it may be worth adding dedicated tests once that functionality is implemented.

Also applies to: 318-323

presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)

99-113: Metadata projection may not be applied when filter remains.

When processFilter returns a FilterNode (with remaining predicate), the code returns early without rewriting. The inner TableScanNode won't have visitTableScan called on it, potentially missing metadata projection attachment.

Consider restructuring to ensure visitTableScan is called on the underlying TableScanNode regardless of whether filters remain:

         @Override
         public PlanNode visitFilter(FilterNode node, RewriteContext<Void> context)
         {
             if (!(node.getSource() instanceof TableScanNode)) {
                 return node;
             }

             PlanNode processedNode = processFilter(node, (TableScanNode) node.getSource());

-            if (processedNode instanceof TableScanNode) {
-                return context.rewrite(processedNode, null);
-            }
-
-            return processedNode;
+            // Ensure the TableScanNode (whether standalone or as FilterNode source) gets metadata projection
+            return context.rewrite(processedNode, null);
         }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 223cb60 and 14c1d78.

📒 Files selected for processing (14)
  • presto-clp/pom.xml (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java (3 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java (6 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (6 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java (2 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java (11 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java (2 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpTopN.java (2 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/optimization/TestClpComputePushDown.java (1 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpPinotSplitProvider.java (1 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpUberPinotSplitProvider.java (2 hunks)
  • presto-native-execution/velox (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-08T02:05:14.212Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.

Applied to files:

  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.

Applied to files:

  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
📚 Learning: 2025-06-13T12:56:06.325Z
Learnt from: wraymo
Repo: y-scope/presto PR: 15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.

Applied to files:

  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
📚 Learning: 2025-07-30T15:27:04.862Z
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.

Applied to files:

  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.

Applied to files:

  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
🧬 Code graph analysis (3)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java (1)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/presto_protocol_clp.cpp (1)
  • ClpTableLayoutHandle (160-162)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
  • PrestoException (18-56)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java (2)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/presto_protocol_clp.cpp (1)
  • ClpSplit (108-110)
presto-native-execution/presto_cpp/main/operators/PartitionAndSerialize.cpp (1)
  • size (198-210)
🪛 GitHub Actions: maven checks
presto-native-execution/velox

[error] 1-1: Submodule update failed: fatal: remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346

🪛 GitHub Actions: prestissimo-worker-images-build
presto-native-execution/velox

[error] 1-1: Submodule fetch failed: git submodule update --init --force --depth=1 --recursive produced remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346.

🪛 GitHub Actions: test
presto-native-execution/velox

[error] 1-1: Git submodule update failed. remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346. Command: 'git -c protocol.version=2 submodule update --init --force --depth=1 --recursive'.

🔇 Additional comments (12)
presto-native-execution/velox (1)

1-1: I'm unable to access the repository to verify the submodule reference issue due to a cloning failure. Without access to the codebase, I cannot confirm:

  1. The actual commit hash in the presto-native-execution/velox file
  2. Whether the commit 0c4a140361c8895fa183f171608a08c45675c346 exists in the velox remote repository
  3. The actual CI/CD pipeline failure logs and error messages

Submodule commit reference requires verification.

The review identifies 0c4a140361c8895fa183f171608a08c45675c346 as an invalid submodule reference causing CI/CD failures. However, this assessment needs manual verification:

  • Confirm the commit hash in presto-native-execution/velox matches the stated value
  • Verify the commit exists in the upstream velox repository
  • Check CI/CD pipeline logs to confirm the reported "upload-pack: not our ref" errors
  • If the commit is correct but unpushed, push it to the remote; otherwise, update to the correct hash
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java (1)

32-37: New CLP error codes look consistent with existing taxonomy

The added error codes fit the existing numbering and error-type scheme and are appropriately classified as EXTERNAL and USER_ERROR. No issues from an API/semantics standpoint.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java (1)

130-146: ClpSplit constructor updates are consistent with new metadata Optional

Passing Optional.empty() for the new metadata argument keeps MySQL-based splits behaviourally identical while aligning with the extended ClpSplit API. This is appropriate for a provider that does not yet attach per-split metadata.

presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpUberPinotSplitProvider.java (1)

26-27: Test update correctly exercises new query-builder signature

Passing new ArrayList<>() into buildSplitSelectionQuery matches the new metadata-projection-aware signature, and the assertions still validate that tpath is projected via requiredProjectionColumns. This keeps the Uber-specific provider aligned with the base class behaviour.

Also applies to: 223-225

presto-clp/pom.xml (1)

21-32: Confirm msgpack/mockito versions and transitive impact

Adding msgpack-core and mockito-core here makes sense for metadata serialisation and testing, but it would be good to confirm:

  • These versions are compatible with the Presto 0.293 stack and any shading/packaging rules you follow.
  • msgpack-core does not unintentionally conflict with any existing msgpack usage elsewhere in the project or deployment.

If you have a dependency policy, please double-check these coordinates against it (e.g., via your usual dependency-audit tooling).

presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpPinotSplitProvider.java (2)

76-118: Test coverage looks solid for the metadata extraction logic.

The test covers:

  • Null value filtering (correctly excluded from result)
  • String, BigInt, and Double type coercion
  • Scientific notation for doubles (both e and E)
  • Name mapping from original to exposed column names

120-145: LGTM!

The error handling test correctly validates that a type mismatch (text for BIGINT) raises PrestoException.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java (1)

112-118: Top-N path intentionally excludes metadata projection values.

The Optional.empty() for metadata columns in the Top-N path is appropriate since Top-N splits use archive metadata for selection rather than projection. This asymmetry with the non-TopN path is by design.

presto-clp/src/test/java/com/facebook/presto/plugin/clp/optimization/TestClpComputePushDown.java (2)

224-227: Identity comparison to verify layout mutation may indicate a design concern.

The assertion assertTrue(layout == existingLayout) verifies the optimizer mutates the existing layout in place. While this tests the current behaviour, mutating layout handles can be fragile. Consider whether immutable layouts with copy-on-write would be safer long-term.

Is the mutable layout design intentional? If so, consider adding a comment in ClpTableLayoutHandle documenting that getSplitMetadataColumnNames() returns a mutable set that can be modified by optimizers.


97-164: Test coverage for metadata projection with data projection looks comprehensive.

The test validates:

  • Layout is set on rewritten scan
  • Metadata projections are correctly mapped using exposed-to-original mapping
  • TableScan ID changes after rewrite (immutability respected)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)

278-282: Layout constructor updated for Top-N path.

The additional Optional.empty() parameter for split metadata column names is appropriate for Top-N pushdown, which handles metadata differently (for archive selection rather than projection).

presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java (1)

172-189: Based on the provided context and learnings, I can now provide a rewritten review comment.

The learnings document indicates that metadata projections are used for archive selection during the split filtering stage and cannot be evaluated at the row level. This provides important context for understanding the design decision. The exclusion of metadataProjectionNameValueEncoded from hashCode and equals appears intentional: two ClpSplit instances with the same query parameters (path, type, kqlQuery) but different metadata projections represent the same logical split for row-level query execution, even though they may filter different archives.

Metadata projection exclusion from hashCode/equals appears intentional for split identity.

Two ClpSplit instances with the same path, type, and kqlQuery but different metadata projections are considered equal. This is by design: metadata projections determine which archives are scanned during split filtering, but do not affect the split's identity for row-level query execution. Both instances will execute the same query logic on their respective archive contents.

20001020ycx and others added 5 commits November 28, 2025 08:40
…it.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…zation/ClpComputePushDown.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants