-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Support metadata projection #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-0.293-clp-connector-snapshot-metadata-refactor
Are you sure you want to change the base?
feat: Support metadata projection #109
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis pull request introduces metadata projection support to the CLP Presto plugin. It adds MessagePack serialization for metadata values in split objects, extends the table layout handle to track metadata column names, implements metadata-aware table scan rewriting in the query optimizer, and enhances split providers to extract and serialize metadata columns from multiple data sources. Changes
Sequence Diagram(s)sequenceDiagram
participant Optimizer as Query Optimizer<br/>(ClpComputePushDown)
participant Layout as Table Layout<br/>(ClpTableLayoutHandle)
participant Split as Split Object<br/>(ClpSplit)
participant Provider as Split Provider<br/>(Pinot/MySql/Uber)
participant Metadata as Metadata Config<br/>(ClpSplitMetadataConfig)
rect rgb(230, 240, 250)
Note over Optimizer: Metadata Projection Computation
Optimizer->>Optimizer: Collect projection columns from output
Optimizer->>Metadata: Get metadata columns for table
Optimizer->>Metadata: Validate against range bounds
end
rect rgb(240, 250, 240)
Note over Layout: Layout Enhancement
Optimizer->>Layout: Create/augment layout with<br/>splitMetadataColumnNames
Optimizer->>Optimizer: Perform copy-on-write<br/>for new TableScanNode
end
rect rgb(250, 240, 240)
Note over Provider: Metadata Extraction & Serialization
Provider->>Provider: Build split selection query<br/>with metadata projections
Provider->>Provider: Extract metadata row as<br/>Map<String, JsonNode>
Provider->>Provider: Convert to typed values<br/>(String, Long, Double)
end
rect rgb(250, 245, 230)
Note over Split: Split Creation with Metadata
Provider->>Split: Call ClpSplit constructor with<br/>Optional<Map<String, Object>>
Split->>Split: Encode metadata map to<br/>MessagePack + Base64
Split->>Split: Store as<br/>metadataProjectionNameValueEncoded
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… into 11-25-metadata-projection
…esto into 11-25-metadata-projection
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpTopN.java (1)
289-301: Top-N tests correctly updated for new layout/split signaturesUsing
Optional.empty()for bothsplitMetadataColumnNamesinClpTableLayoutHandleand the metadata Optional inClpSplitkeeps the existing Top-N semantics unchanged while aligning with the new APIs. This is consistent with the current focus on data-column-only Top-N.If you plan to support projecting metadata columns in Top-N queries later, it may be worth adding dedicated tests once that functionality is implemented.
Also applies to: 318-323
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
99-113: Metadata projection may not be applied when filter remains.When
processFilterreturns aFilterNode(with remaining predicate), the code returns early without rewriting. The innerTableScanNodewon't havevisitTableScancalled on it, potentially missing metadata projection attachment.Consider restructuring to ensure
visitTableScanis called on the underlyingTableScanNoderegardless of whether filters remain:@Override public PlanNode visitFilter(FilterNode node, RewriteContext<Void> context) { if (!(node.getSource() instanceof TableScanNode)) { return node; } PlanNode processedNode = processFilter(node, (TableScanNode) node.getSource()); - if (processedNode instanceof TableScanNode) { - return context.rewrite(processedNode, null); - } - - return processedNode; + // Ensure the TableScanNode (whether standalone or as FilterNode source) gets metadata projection + return context.rewrite(processedNode, null); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (14)
presto-clp/pom.xml(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java(3 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java(6 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java(6 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java(2 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java(11 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java(2 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpTopN.java(2 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/optimization/TestClpComputePushDown.java(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpPinotSplitProvider.java(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpUberPinotSplitProvider.java(2 hunks)presto-native-execution/velox(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-08T02:05:14.212Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
📚 Learning: 2025-06-13T12:56:06.325Z
Learnt from: wraymo
Repo: y-scope/presto PR: 15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
📚 Learning: 2025-07-30T15:27:04.862Z
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
🧬 Code graph analysis (3)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java (1)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/presto_protocol_clp.cpp (1)
ClpTableLayoutHandle(160-162)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java (2)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/presto_protocol_clp.cpp (1)
ClpSplit(108-110)presto-native-execution/presto_cpp/main/operators/PartitionAndSerialize.cpp (1)
size(198-210)
🪛 GitHub Actions: maven checks
presto-native-execution/velox
[error] 1-1: Submodule update failed: fatal: remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346
🪛 GitHub Actions: prestissimo-worker-images-build
presto-native-execution/velox
[error] 1-1: Submodule fetch failed: git submodule update --init --force --depth=1 --recursive produced remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346.
🪛 GitHub Actions: test
presto-native-execution/velox
[error] 1-1: Git submodule update failed. remote error: upload-pack: not our ref 0c4a140361c8895fa183f171608a08c45675c346. Command: 'git -c protocol.version=2 submodule update --init --force --depth=1 --recursive'.
🔇 Additional comments (12)
presto-native-execution/velox (1)
1-1: I'm unable to access the repository to verify the submodule reference issue due to a cloning failure. Without access to the codebase, I cannot confirm:
- The actual commit hash in the
presto-native-execution/veloxfile- Whether the commit
0c4a140361c8895fa183f171608a08c45675c346exists in the velox remote repository- The actual CI/CD pipeline failure logs and error messages
Submodule commit reference requires verification.
The review identifies
0c4a140361c8895fa183f171608a08c45675c346as an invalid submodule reference causing CI/CD failures. However, this assessment needs manual verification:
- Confirm the commit hash in
presto-native-execution/veloxmatches the stated value- Verify the commit exists in the upstream velox repository
- Check CI/CD pipeline logs to confirm the reported "upload-pack: not our ref" errors
- If the commit is correct but unpushed, push it to the remote; otherwise, update to the correct hash
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java (1)
32-37: New CLP error codes look consistent with existing taxonomyThe added error codes fit the existing numbering and error-type scheme and are appropriately classified as
EXTERNALandUSER_ERROR. No issues from an API/semantics standpoint.presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java (1)
130-146: ClpSplit constructor updates are consistent with new metadata OptionalPassing
Optional.empty()for the new metadata argument keeps MySQL-based splits behaviourally identical while aligning with the extendedClpSplitAPI. This is appropriate for a provider that does not yet attach per-split metadata.presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpUberPinotSplitProvider.java (1)
26-27: Test update correctly exercises new query-builder signaturePassing
new ArrayList<>()intobuildSplitSelectionQuerymatches the new metadata-projection-aware signature, and the assertions still validate thattpathis projected viarequiredProjectionColumns. This keeps the Uber-specific provider aligned with the base class behaviour.Also applies to: 223-225
presto-clp/pom.xml (1)
21-32: Confirm msgpack/mockito versions and transitive impactAdding
msgpack-coreandmockito-corehere makes sense for metadata serialisation and testing, but it would be good to confirm:
- These versions are compatible with the Presto 0.293 stack and any shading/packaging rules you follow.
msgpack-coredoes not unintentionally conflict with any existing msgpack usage elsewhere in the project or deployment.If you have a dependency policy, please double-check these coordinates against it (e.g., via your usual dependency-audit tooling).
presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpPinotSplitProvider.java (2)
76-118: Test coverage looks solid for the metadata extraction logic.The test covers:
- Null value filtering (correctly excluded from result)
- String, BigInt, and Double type coercion
- Scientific notation for doubles (both
eandE)- Name mapping from original to exposed column names
120-145: LGTM!The error handling test correctly validates that a type mismatch (text for BIGINT) raises
PrestoException.presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java (1)
112-118: Top-N path intentionally excludes metadata projection values.The
Optional.empty()for metadata columns in the Top-N path is appropriate since Top-N splits use archive metadata for selection rather than projection. This asymmetry with the non-TopN path is by design.presto-clp/src/test/java/com/facebook/presto/plugin/clp/optimization/TestClpComputePushDown.java (2)
224-227: Identity comparison to verify layout mutation may indicate a design concern.The assertion
assertTrue(layout == existingLayout)verifies the optimizer mutates the existing layout in place. While this tests the current behaviour, mutating layout handles can be fragile. Consider whether immutable layouts with copy-on-write would be safer long-term.Is the mutable layout design intentional? If so, consider adding a comment in
ClpTableLayoutHandledocumenting thatgetSplitMetadataColumnNames()returns a mutable set that can be modified by optimizers.
97-164: Test coverage for metadata projection with data projection looks comprehensive.The test validates:
- Layout is set on rewritten scan
- Metadata projections are correctly mapped using exposed-to-original mapping
- TableScan ID changes after rewrite (immutability respected)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
278-282: Layout constructor updated for Top-N path.The additional
Optional.empty()parameter for split metadata column names is appropriate for Top-N pushdown, which handles metadata differently (for archive selection rather than projection).presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpSplit.java (1)
172-189: Based on the provided context and learnings, I can now provide a rewritten review comment.The learnings document indicates that metadata projections are used for archive selection during the split filtering stage and cannot be evaluated at the row level. This provides important context for understanding the design decision. The exclusion of
metadataProjectionNameValueEncodedfromhashCodeandequalsappears intentional: twoClpSplitinstances with the same query parameters (path,type,kqlQuery) but different metadata projections represent the same logical split for row-level query execution, even though they may filter different archives.Metadata projection exclusion from
hashCode/equalsappears intentional for split identity.Two
ClpSplitinstances with the samepath,type, andkqlQuerybut different metadata projections are considered equal. This is by design: metadata projections determine which archives are scanned during split filtering, but do not affect the split's identity for row-level query execution. Both instances will execute the same query logic on their respective archive contents.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java
Show resolved
Hide resolved
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
Show resolved
Hide resolved
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
Show resolved
Hide resolved
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpPinotSplitProvider.java
Show resolved
Hide resolved
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java
Show resolved
Hide resolved
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpUberPinotSplitProvider.java
Show resolved
Hide resolved
...to-clp/src/test/java/com/facebook/presto/plugin/clp/optimization/TestClpComputePushDown.java
Show resolved
Hide resolved
presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/TestClpPinotSplitProvider.java
Show resolved
Hide resolved
…it.java Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…zation/ClpComputePushDown.java Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…esto into 11-25-metadata-projection
… into 11-25-metadata-projection
… into 11-25-metadata-projection
…esto into 11-25-metadata-projection
Description
This PR implements support for metadata projection.
Currently, projection results from a user SQL query are returned directly from the splits - either decompressed from SFA or read from IR stream files. However, when a projection field references a metadata column, the query is rejected because such column does not exist in the original dataset.
For example, considering the below SQL user query:
SELECT split_path, message FROM table LIMIT 1Where:
split_pathis a column from the metadata database that records the file path to the split (not present in the log file).messageis a column from the original dataset.This query would currently fail because
split_pathcannot be retrieved from the split data.This PR allowing query results to include metadata values appended alongside dataset columns. The expected result of the above SQL query should is shown below:
Checklist
breaking change.
Validation performed
We manually submit below queries to test the end-to-end result of the implemented code:
SELECT timestamp FROM clp.DEFAULT.cockroachdb_ir LIMIT 100;SELECT timestamp, tpath, num_messages FROM clp.DEFAULT.cockroachdb_ir LIMIT 100;wheretpathis aVARCHARmetadata, andnum_messagesis aBIGINTmetadataSELECT host_name FROM clp.DEFAULT.cockroachdb_ir LIMIT 100;wherehost_nameis an exposed name of aVARCHARtyped metadata columnhostnameSELECT ts FROM clp.DEFAULT.cockroachdb_ir LIMIT 100;wheretsis a ranged mapped metadata column: mappingtstocreationtimeandlastmodifiedtimeSELECT zone FROM clp.DEFAULT.cockroachdb_ir WHERE host_name = 'host4' LIMIT 100;&SELECT tpath, timestamp FROM clp.DEFAULT.cockroachdb_ir WHERE num_messages > 100 LIMIT 100;SELECT tpath, timestamp FROM clp.DEFAULT.cockroachdb_ir WHERE severity = 'INFO' LIMIT 100;SELECT tpath, CLP_GET_JSON_STRING() AS event FROM clp.DEFAULT.cockroachdb_ir WHERE severity = 'INFO' LIMIT 100;Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.