-
Notifications
You must be signed in to change notification settings - Fork 4
refactor: Improve metadata expression extraction and validation; Update split metadata config format; Add split metadata columns during metadata resolution. #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-0.293-clp-connector
Are you sure you want to change the base?
Conversation
WalkthroughReplaces the split-filter abstraction with a structured split-metadata configuration: adds ClpSplitMetadataConfig and expression-based metadata handling, removes ClpSplitFilterProvider and related files, updates connector/module/optimizer/split-provider/metadata wiring, converts metadata SQL strings to RowExpression-based flows, updates tests/resources and docs, and removes jackson-core from the POM. Changes
Sequence Diagram(s)sequenceDiagram
participant Planner as QueryPlanner
participant Optimizer as ClpComputePushDown
participant Config as ClpSplitMetadataConfig
participant Converter as ClpFilterToKqlConverter
participant SplitProv as ClpMySqlSplitProvider
participant DB as MySQL
Note over Planner,Optimizer: Planning phase
Planner->>Optimizer: plan with filter RowExpression
Optimizer->>Config: request metadata columns & required columns
alt metadata pushdown allowed
Optimizer->>Converter: convert filter -> {KQL?, metadataExpression}
Converter-->>Optimizer: returns KQL and metadataExpression
Optimizer->>SplitProv: include metadataExpression in layout
SplitProv->>Converter: transform metadataExpression -> SQL predicate
Converter-->>SplitProv: SQL predicate
SplitProv->>DB: execute archive query + predicate
else fallback
Optimizer->>Planner: leave filter as remaining expression
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related issues
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧠 Learnings (6)📓 Common learnings📚 Learning: 2025-06-13T12:56:06.325ZApplied to files:
📚 Learning: 2025-09-16T17:39:45.102ZApplied to files:
📚 Learning: 2025-11-08T02:05:14.212ZApplied to files:
📚 Learning: 2025-07-30T15:27:04.862ZApplied to files:
📚 Learning: 2025-09-12T14:47:20.734ZApplied to files:
🧬 Code graph analysis (2)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnectorFactory.java (1)
56-61: Remove duplicate TypeManager bindingWe now bind
TypeManagertwice in the same module (binder.bind(TypeManager.class).toInstance(...)appears on both Line 57 and Line 60). Guice rejects duplicate explicit bindings, so the connector bootstrap will fail with aCreationExceptionas soon as this path is executed. Drop one of the bindings (keep the existing one at Line 60, or the new one, but not both) so the injector can be created successfully.binder.bind(FunctionMetadataManager.class).toInstance(context.getFunctionMetadataManager()); - binder.bind(TypeManager.class).toInstance(context.getTypeManager()); binder.bind(NodeManager.class).toInstance(context.getNodeManager()); binder.bind(StandardFunctionResolution.class).toInstance(context.getStandardFunctionResolution()); binder.bind(TypeManager.class).toInstance(context.getTypeManager());
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (30)
presto-clp/pom.xml(0 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConfig.java(2 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.java(3 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnectorFactory.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpModule.java(2 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java(5 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java(4 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpExpression.java(4 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java(16 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpPlanOptimizerProvider.java(3 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java(4 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java(1 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpMySqlSplitFilterProvider.java(0 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterConfig.java(0 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterConfigCustomOptionsDeserializer.java(0 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterProvider.java(0 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java(0 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java(5 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMySqlSplitMetadataExpressionConverter.java(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpSplit.java(4 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpSplitMetadataConfig.java(1 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.java(7 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/filter/TestClpMySqlSplitFilterConfig.java(0 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/filter/TestClpSplitFilterConfigCommon.java(0 hunks)presto-clp/src/test/resources/test-mysql-split-filter.json(0 hunks)presto-clp/src/test/resources/test-mysql-split-metadata.json(1 hunks)presto-docs/src/main/sphinx/connector/clp.rst(2 hunks)
💤 Files with no reviewable changes (9)
- presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java
- presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/filter/TestClpMySqlSplitFilterConfig.java
- presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterConfigCustomOptionsDeserializer.java
- presto-clp/pom.xml
- presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpMySqlSplitFilterProvider.java
- presto-clp/src/test/java/com/facebook/presto/plugin/clp/split/filter/TestClpSplitFilterConfigCommon.java
- presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterConfig.java
- presto-clp/src/test/resources/test-mysql-split-filter.json
- presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/filter/ClpSplitFilterProvider.java
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-07-30T15:27:04.862Z
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpPlanOptimizerProvider.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-06-13T12:56:06.325Z
Learnt from: wraymo
Repo: y-scope/presto PR: 15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpPlanOptimizerProvider.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpModule.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpSplit.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpPlanOptimizerProvider.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpModule.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpErrorCode.javapresto-docs/src/main/sphinx/connector/clp.rst
📚 Learning: 2025-06-19T08:41:46.288Z
Learnt from: anlowee
Repo: y-scope/presto PR: 0
File: :0-0
Timestamp: 2025-06-19T08:41:46.288Z
Learning: In the y-scope/presto repository, prefer using Guava's Immutable collections over regular collections when possible.
Applied to files:
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.java
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java
🧬 Code graph analysis (12)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpPlanOptimizerProvider.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpSplitMetadataConfig.java (3)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
ClpMySqlSplitMetadataExpressionConverter(60-245)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.java (4)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/mockdb/ClpMockMetadataDatabase.java (1)
ClpMockMetadataDatabase(49-288)presto-clp/src/test/java/com/facebook/presto/plugin/clp/mockdb/table/ColumnMetadataTableRows.java (1)
ColumnMetadataTableRows(30-63)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (1)
ClpComputePushDown(44-157)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMySqlSplitMetadataExpressionConverter.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
ClpMySqlSplitMetadataExpressionConverter(60-245)presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpModule.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpSplit.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpExpression.java (1)
ClpExpression(29-102)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java (1)
ClpFilterToKqlConverter(106-1071)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConfig.java (1)
ClpConfig(21-186)presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpTableLayoutHandle.java (1)
presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.h (1)
RowExpression(394-416)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: conventional-commits
- GitHub Check: test (17.0.13, :presto-main-base)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-local-queries)
- GitHub Check: test (17.0.13, :presto-tests -P presto-tests-general)
- GitHub Check: test (8.0.442, :presto-main-base)
- GitHub Check: test (8.0.442, :presto-tests -P presto-tests-general)
- GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (1)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (1)
118-121: LGTM!The new helper method correctly delegates to the existing
toRowExpressionimplementation and enables tests to provide customTypeProviderinstances. This addition supports the metadata-expression testing infrastructure described in the PR objectives.
| TableHandle tableHandle = node.getTable(); | ||
| ClpTableHandle clpTableHandle = (ClpTableHandle) tableHandle.getConnectorHandle(); | ||
| tableScopeSet.add(format("%s.%s", CONNECTOR_NAME, clpTableHandle.getSchemaTableName())); | ||
| if (!metadataConfig.getRequiredColumns(clpTableHandle.getSchemaTableName()).isEmpty()) { | ||
| throw new PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "required filters must be specified"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t throw before metadata filters are processed.
visitTableScan is invoked while rewriting every table scan, including ones that still have a metadata filter sitting above them. Because the new guard executes before the filter is analysed, any table that lists required columns now fails even when the query provides the correct metadata filter. We need to defer the “required filters must be specified” check until after we’ve inspected the filter’s RowExpression (so genuine missing filters still fail, but valid ones pass).
Apply this diff to remove the premature throw:
- if (!metadataConfig.getRequiredColumns(clpTableHandle.getSchemaTableName()).isEmpty()) {
- throw new PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "required filters must be specified");
- }Then, inside processFilter, enforce the requirement only when no metadata expression was produced:
@@
- SchemaTableName schemaTableName = clpTableHandle.getSchemaTableName();
+ SchemaTableName schemaTableName = clpTableHandle.getSchemaTableName();
+ Set<String> requiredColumns = metadataConfig.getRequiredColumns(schemaTableName);
@@
- Optional<RowExpression> metadataExpression = clpExpression.getMetadataExpression();
+ Optional<RowExpression> metadataExpression = clpExpression.getMetadataExpression();
+
+ if (!requiredColumns.isEmpty() && !metadataExpression.isPresent()) {
+ throw new PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "required filters must be specified");
+ }Don’t forget to add import java.util.Set; at the top. This keeps the enforcement intact for missing filters while allowing valid filtered queries to proceed.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
around lines 82-86, remove the premature PrestoException throw that checks
metadataConfig.getRequiredColumns(...) in visitTableScan so the check does not
run before the metadata filter is processed; then update processFilter to
enforce the "required filters must be specified" rule only when no metadata
expression was produced (i.e. when the metadata expression is absent/empty), and
add an import line for java.util.Set at the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be no filters. No filter should be treated as a failure if required columns are specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we need to do it in visitTableScan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wraymo, you're absolutely right! The check in visitTableScan is necessary and correct for the no-filter case.
The visitor pattern here works as follows:
- When there's no FilterNode above the TableScanNode:
visitTableScanis called, and the required-columns check appropriately throws an exception. - When there IS a FilterNode above the TableScanNode:
visitFilteris called, which directly invokesprocessFilteron the sourceTableScanNodewithout delegating tovisitTableScan. This meansvisitTableScanis never invoked in this path.
So the check in visitTableScan only fires when there's no filter, which is exactly when you want to reject queries that don't provide the required metadata filters. It doesn't interfere with queries that have filters because those go through the visitFilter → processFilter path instead.
CodeRabbit's concern about a "premature throw before metadata filters are processed" doesn't apply here—when a metadata filter exists in the plan, visitTableScan isn't called at all.
One minor observation (not blocking): processFilter could also add a check to ensure that when required columns exist and the filter produces no metadataExpression, an exception is thrown. This would catch the edge case where a filter exists but doesn't cover the required metadata columns. However, that might be a separate consideration beyond the scope of this specific discussion.
🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wraymo, you're absolutely right! I see the distinction now.
The visitor flow means:
- Query with no filter →
visitTableScanis called → your check catches it ✓ - Query with filter →
visitFilteris called →processFilterhandles it directly →visitTableScanis NOT called for that TableScanNode
So the check in visitTableScan specifically guards against queries that have zero filters when required columns are configured. That's the correct placement for that scenario.
One follow-up question: for queries that do have a filter but the filter doesn't reference the required columns (doesn't produce a metadataExpression), where is that validated? I see that processFilter extracts the metadataExpression but doesn't appear to check whether required columns were satisfied. Is that validation happening later in ClpMySqlSplitMetadataExpressionConverter.transform() during split generation, or should we add an explicit check in processFilter to fail fast at plan time?
🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
| List<String> pushdownKql = new ArrayList<>(); | ||
| List<RowExpression> metadataExpressions = new ArrayList<>(); | ||
|
|
||
| boolean hasUnpushable = false; | ||
|
|
||
| for (RowExpression argument : node.getArguments()) { | ||
| ClpExpression expression = argument.accept(this, null); | ||
| // Note: It is possible in the future that an expression cannot be pushed down as a KQL query, but can be | ||
| // pushed down as a metadata SQL query. | ||
| if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) { | ||
| allPushedDown = false; | ||
| continue; | ||
| ClpExpression expr = argument.accept(this, null); | ||
|
|
||
| boolean hasRemaining = expr.getRemainingExpression().isPresent(); | ||
| boolean hasKql = expr.getPushDownExpression().isPresent(); | ||
| boolean hasMeta = expr.getMetadataExpression().isPresent(); | ||
|
|
||
| // If this arg cannot be pushed down at all, bail early | ||
| if (hasRemaining || (!hasKql && !hasMeta)) { | ||
| hasUnpushable = true; | ||
| break; | ||
| } | ||
| queryBuilder.append(expression.getPushDownExpression().get()); | ||
| queryBuilder.append(" OR "); | ||
| if (hasAllMetadataSql && expression.getMetadataSqlQuery().isPresent()) { | ||
| metadataQueryBuilder.append(expression.getMetadataSqlQuery().get()); | ||
| metadataQueryBuilder.append(" OR "); | ||
|
|
||
| if (hasKql) { | ||
| pushdownKql.add(expr.getPushDownExpression().get()); | ||
| } | ||
| else { | ||
| hasAllMetadataSql = false; | ||
|
|
||
| if (hasMeta) { | ||
| metadataExpressions.add(expr.getMetadataExpression().get()); | ||
| } | ||
| } | ||
| if (allPushedDown) { | ||
| // Remove the last " OR " from the query | ||
| return new ClpExpression( | ||
| queryBuilder.substring(0, queryBuilder.length() - 4) + ")", | ||
| hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null); | ||
|
|
||
| if (hasUnpushable) { | ||
| return new ClpExpression(node); | ||
| } | ||
|
|
||
| String combinedKql = null; | ||
| if (!pushdownKql.isEmpty()) { | ||
| combinedKql = "(" + String.join(" OR ", pushdownKql) + ")"; | ||
| } | ||
| return new ClpExpression(node); | ||
|
|
||
| // Only use metadata if every argument has metadata and none had KQL | ||
| RowExpression combinedMetadata = null; | ||
| if (metadataExpressions.size() == node.getArguments().size() && pushdownKql.isEmpty()) { | ||
| combinedMetadata = (metadataExpressions.size() == 1) | ||
| ? metadataExpressions.get(0) | ||
| : new SpecialFormExpression( | ||
| node.getSourceLocation(), | ||
| OR, | ||
| BOOLEAN, | ||
| metadataExpressions); | ||
| } | ||
|
|
||
| return new ClpExpression(combinedKql, combinedMetadata); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not drop metadata-only branches when mixing with KQL in OR.
When an OR clause contains one operand that is metadata-only (e.g., fare = 0) and another that produces KQL (e.g., city.Name LIKE 'b%'), this block treats the node as pushable yet the resulting ClpExpression only keeps the KQL branch—no metadata expression and no remaining expression. The metadata predicate is therefore lost, so rows satisfying fare = 0 but not the KQL branch are incorrectly filtered out. Please fall back to the original expression whenever OR mixes metadata-only and KQL-only operands.
Apply this diff to restore correctness:
- boolean hasUnpushable = false;
+ boolean hasUnpushable = false;
+ boolean sawMetadataOnly = false;
+ boolean sawKqlOnly = false;
@@
- if (hasRemaining || (!hasKql && !hasMeta)) {
+ if (hasRemaining || (!hasKql && !hasMeta)) {
hasUnpushable = true;
break;
}
+ if (hasMeta && !hasKql) {
+ sawMetadataOnly = true;
+ }
+ if (hasKql && !hasMeta) {
+ sawKqlOnly = true;
+ }
+
if (hasKql) {
pushdownKql.add(expr.getPushDownExpression().get());
}
if (hasMeta) {
metadataExpressions.add(expr.getMetadataExpression().get());
}
}
- if (hasUnpushable) {
+ if (hasUnpushable || (sawMetadataOnly && sawKqlOnly)) {
return new ClpExpression(node);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| List<String> pushdownKql = new ArrayList<>(); | |
| List<RowExpression> metadataExpressions = new ArrayList<>(); | |
| boolean hasUnpushable = false; | |
| for (RowExpression argument : node.getArguments()) { | |
| ClpExpression expression = argument.accept(this, null); | |
| // Note: It is possible in the future that an expression cannot be pushed down as a KQL query, but can be | |
| // pushed down as a metadata SQL query. | |
| if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) { | |
| allPushedDown = false; | |
| continue; | |
| ClpExpression expr = argument.accept(this, null); | |
| boolean hasRemaining = expr.getRemainingExpression().isPresent(); | |
| boolean hasKql = expr.getPushDownExpression().isPresent(); | |
| boolean hasMeta = expr.getMetadataExpression().isPresent(); | |
| // If this arg cannot be pushed down at all, bail early | |
| if (hasRemaining || (!hasKql && !hasMeta)) { | |
| hasUnpushable = true; | |
| break; | |
| } | |
| queryBuilder.append(expression.getPushDownExpression().get()); | |
| queryBuilder.append(" OR "); | |
| if (hasAllMetadataSql && expression.getMetadataSqlQuery().isPresent()) { | |
| metadataQueryBuilder.append(expression.getMetadataSqlQuery().get()); | |
| metadataQueryBuilder.append(" OR "); | |
| if (hasKql) { | |
| pushdownKql.add(expr.getPushDownExpression().get()); | |
| } | |
| else { | |
| hasAllMetadataSql = false; | |
| if (hasMeta) { | |
| metadataExpressions.add(expr.getMetadataExpression().get()); | |
| } | |
| } | |
| if (allPushedDown) { | |
| // Remove the last " OR " from the query | |
| return new ClpExpression( | |
| queryBuilder.substring(0, queryBuilder.length() - 4) + ")", | |
| hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null); | |
| if (hasUnpushable) { | |
| return new ClpExpression(node); | |
| } | |
| String combinedKql = null; | |
| if (!pushdownKql.isEmpty()) { | |
| combinedKql = "(" + String.join(" OR ", pushdownKql) + ")"; | |
| } | |
| return new ClpExpression(node); | |
| // Only use metadata if every argument has metadata and none had KQL | |
| RowExpression combinedMetadata = null; | |
| if (metadataExpressions.size() == node.getArguments().size() && pushdownKql.isEmpty()) { | |
| combinedMetadata = (metadataExpressions.size() == 1) | |
| ? metadataExpressions.get(0) | |
| : new SpecialFormExpression( | |
| node.getSourceLocation(), | |
| OR, | |
| BOOLEAN, | |
| metadataExpressions); | |
| } | |
| return new ClpExpression(combinedKql, combinedMetadata); | |
| } | |
| List<String> pushdownKql = new ArrayList<>(); | |
| List<RowExpression> metadataExpressions = new ArrayList<>(); | |
| boolean hasUnpushable = false; | |
| boolean sawMetadataOnly = false; | |
| boolean sawKqlOnly = false; | |
| for (RowExpression argument : node.getArguments()) { | |
| ClpExpression expr = argument.accept(this, null); | |
| boolean hasRemaining = expr.getRemainingExpression().isPresent(); | |
| boolean hasKql = expr.getPushDownExpression().isPresent(); | |
| boolean hasMeta = expr.getMetadataExpression().isPresent(); | |
| // If this arg cannot be pushed down at all, bail early | |
| if (hasRemaining || (!hasKql && !hasMeta)) { | |
| hasUnpushable = true; | |
| break; | |
| } | |
| if (hasMeta && !hasKql) { | |
| sawMetadataOnly = true; | |
| } | |
| if (hasKql && !hasMeta) { | |
| sawKqlOnly = true; | |
| } | |
| if (hasKql) { | |
| pushdownKql.add(expr.getPushDownExpression().get()); | |
| } | |
| if (hasMeta) { | |
| metadataExpressions.add(expr.getMetadataExpression().get()); | |
| } | |
| } | |
| if (hasUnpushable || (sawMetadataOnly && sawKqlOnly)) { | |
| return new ClpExpression(node); | |
| } | |
| String combinedKql = null; | |
| if (!pushdownKql.isEmpty()) { | |
| combinedKql = "(" + String.join(" OR ", pushdownKql) + ")"; | |
| } | |
| // Only use metadata if every argument has metadata and none had KQL | |
| RowExpression combinedMetadata = null; | |
| if (metadataExpressions.size() == node.getArguments().size() && pushdownKql.isEmpty()) { | |
| combinedMetadata = (metadataExpressions.size() == 1) | |
| ? metadataExpressions.get(0) | |
| : new SpecialFormExpression( | |
| node.getSourceLocation(), | |
| OR, | |
| BOOLEAN, | |
| metadataExpressions); | |
| } | |
| return new ClpExpression(combinedKql, combinedMetadata); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, because it's an OR, if we push down metadata in this case, it will turns it to a AND.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wraymo, you're absolutely right about the root cause—pushing down both metadata and KQL predicates in an OR would apply them at different stages (metadata filters determine which archives to scan, while KQL filters within those archives), effectively turning the OR into an AND.
However, the current implementation has a correctness issue: when mixing metadata-only and KQL-only predicates in an OR, the metadata predicate is silently dropped.
For example, with WHERE fare = 0 OR city.Name LIKE 'b%':
fare = 0(metadata-only) gets collected but not usedcity.Name LIKE 'b%'(KQL-only) is pushed down- Result: only the KQL filter is applied, so rows satisfying
fare = 0but not the LIKE pattern would be incorrectly filtered out
This produces wrong results. The safe approach is to fall back to the original expression (no pushdown) when OR mixes metadata-only and KQL-only predicates, as CodeRabbit suggested. This ensures Presto evaluates the full OR condition correctly, even though it's less efficient.
Would you like to apply CodeRabbit's fix to prevent incorrect results, or is there a different strategy you have in mind for handling these mixed OR cases?
🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fare is a metadata column, dropping it is the expected behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wraymo, thank you for clarifying! I understand now that dropping the metadata predicate in mixed OR scenarios is the intended design behavior.
To confirm my understanding: metadata columns (like fare in your example) are used for archive selection at the split/metadata filtering stage, while KQL predicates operate at the row level within archives. When you have fare = 0 OR city.Name LIKE 'b%', the OR semantics would require evaluating both predicates at the same level, but since metadata columns can't be evaluated at the row level, the expected behavior is to:
- Drop the metadata predicate (
fare = 0) - Push down only the KQL part (
city.Name LIKE 'b%') - Scan all archives (no metadata filtering) and apply the KQL filter
This is a trade-off where the connector prioritizes pushing down what it can (KQL) while accepting that metadata-based archive filtering isn't possible in this mixed OR scenario. The alternative would be to throw an error for such queries, but silently dropping the metadata predicate allows the query to proceed (albeit scanning more archives than optimal).
This design makes sense given the two-stage filtering architecture. Thanks for the clarification!
✏️ Learnings added
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.192Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
...main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
Show resolved
Hide resolved
| for (String ns : namespaces) { | ||
| TableConfig cfg = tableConfigs.get(ns); | ||
| if (cfg != null) { | ||
| merged.metaColumns.putAll(cfg.metaColumns); | ||
|
|
||
| for (FilterRule rule : cfg.filterRules) { | ||
| boolean exists = merged.filterRules.stream() | ||
| .anyMatch(r -> r.column.equals(rule.column)); | ||
| if (!exists) { | ||
| merged.filterRules.add(rule); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow table-level filter rules to override broader scopes.
This merge keeps the first occurrence of each rule (global) and ignores schema/table overrides, so a table-specific rule (e.g., marking a column required) never takes effect. That contradicts the documented precedence and breaks configuration correctness. Please let later scopes replace earlier ones.
Apply this diff to honour overrides:
- for (String ns : namespaces) {
- TableConfig cfg = tableConfigs.get(ns);
- if (cfg != null) {
- merged.metaColumns.putAll(cfg.metaColumns);
-
- for (FilterRule rule : cfg.filterRules) {
- boolean exists = merged.filterRules.stream()
- .anyMatch(r -> r.column.equals(rule.column));
- if (!exists) {
- merged.filterRules.add(rule);
- }
- }
- }
- }
+ Map<String, FilterRule> mergedRules = new LinkedHashMap<>();
+ for (String ns : namespaces) {
+ TableConfig cfg = tableConfigs.get(ns);
+ if (cfg != null) {
+ merged.metaColumns.putAll(cfg.metaColumns);
+ for (FilterRule rule : cfg.filterRules) {
+ mergedRules.put(rule.column, rule);
+ }
+ }
+ }
+ merged.filterRules.addAll(mergedRules.values());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (String ns : namespaces) { | |
| TableConfig cfg = tableConfigs.get(ns); | |
| if (cfg != null) { | |
| merged.metaColumns.putAll(cfg.metaColumns); | |
| for (FilterRule rule : cfg.filterRules) { | |
| boolean exists = merged.filterRules.stream() | |
| .anyMatch(r -> r.column.equals(rule.column)); | |
| if (!exists) { | |
| merged.filterRules.add(rule); | |
| } | |
| } | |
| Map<String, FilterRule> mergedRules = new LinkedHashMap<>(); | |
| for (String ns : namespaces) { | |
| TableConfig cfg = tableConfigs.get(ns); | |
| if (cfg != null) { | |
| merged.metaColumns.putAll(cfg.metaColumns); | |
| for (FilterRule rule : cfg.filterRules) { | |
| mergedRules.put(rule.column, rule); | |
| } | |
| } | |
| } | |
| merged.filterRules.addAll(mergedRules.values()); |
...c/test/java/com/facebook/presto/plugin/clp/TestClpMySqlSplitMetadataExpressionConverter.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-11-08T02:05:14.192Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.192Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
🧬 Code graph analysis (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-queries)
- GitHub Check: test (:presto-docs)
- GitHub Check: maven-checks (8.0.442)
- GitHub Check: prestissimo-worker-images-build
- GitHub Check: maven-checks (17.0.13)
🔇 Additional comments (4)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (4)
94-104: LGTM: Required-column tracking correctly reset.The addition of
seenRequired.clear()on line 96 ensures that eachtransforminvocation independently validates required columns without contamination from prior calls.
156-178: LGTM: Constant handling is correct and secure.The method properly handles:
- SQL string escaping (line 168:
replace("'", "''")),- Decimal scale preservation for both Slice and Long representations, and
- Safe conversion of numeric types.
180-186: LGTM: Column reference handling is correct.The method correctly:
- Tracks the exposed name for required-column validation (line 184), and
- Returns the original MySQL column name for SQL generation (line 185).
188-245: LGTM: Range-bound rewriting logic is correct.The method correctly rewrites comparisons using metadata bounds:
- Greater-than operators check the upper bound,
- Less-than operators check the lower bound, and
- Equality rewrites to a range containment check:
(lower <= literal) AND (upper >= literal).The fallback cases (lines 233-238) reasonably handle single-bound scenarios.
...main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
Show resolved
Hide resolved
...main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
Outdated
Show resolved
Hide resolved
...main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
Outdated
Show resolved
Hide resolved
e5bfe33 to
ddfca65
Compare
ddfca65 to
bfa18ff
Compare
…lution; fixed a few bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMySqlMetadataProvider.java(5 hunks)presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java(4 hunks)presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMetadata.java(6 hunks)presto-docs/src/main/sphinx/connector/clp.rst(2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rstpresto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMySqlMetadataProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMetadata.java
📚 Learning: 2025-06-13T12:56:06.325Z
Learnt from: wraymo
Repo: y-scope/presto PR: 15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMySqlMetadataProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMetadata.java
📚 Learning: 2025-11-08T02:05:14.212Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-07-30T15:27:04.862Z
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
🧬 Code graph analysis (3)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMySqlMetadataProvider.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpMetadata.java (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java (1)
ClpSplitMetadataConfig(58-277)
| if (!metadataExpression.isPresent() && | ||
| !metadataConfig.getRequiredColumns(clpTableHandle.getSchemaTableName()).isEmpty()) { | ||
| throw new PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "required filters must be specified"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required columns still go unenforced.
metadataExpression.isPresent() only tells us that some metadata predicate was extracted; it does not guarantee that every entry in metadataConfig.getRequiredColumns(...) appeared in the filter. For example, with a config that marks msg.timestamp as required, a query filtering only on a different metadata column (e.g., level = 5) yields a metadata expression, so this guard passes and the query is accepted even though the required column is absent. That regresses the “must supply timestamp” contract and re-opens the expensive full-scan path the config was supposed to forbid.
Please track the specific metadata/data columns referenced during conversion and compare them against the required set, e.g. have ClpFilterToKqlConverter return the set of satisfied columns (or extend ClpExpression to expose it) and reject whenever requiredColumns - satisfiedColumns is non-empty.
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
around lines 120-123, the current check only tests
metadataExpression.isPresent() which does not guarantee that all configured
required columns were actually referenced; change the flow so the filter-to-KQL
conversion also returns (or exposes via ClpExpression) the set of metadata/data
columns that were satisfied, then in this location compute missing =
requiredColumns - satisfiedColumns and throw
PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "...") whenever missing is
non-empty; implement the minimal API change by extending ClpFilterToKqlConverter
to return both the converted expression and the Set<String> of satisfied column
names (or augment ClpExpression to expose satisfiedColumns), update callers
accordingly, and use that satisfied set here to enforce requiredColumns instead
of relying solely on metadataExpression.isPresent().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's enforced in the converter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check ClpFilterToKqlConverter for required column enforcement
cat presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java | grep -A 20 -B 5 "required"
Length of output: 1243
🏁 Script executed:
#!/bin/bash
# Check ClpMySqlSplitMetadataExpressionConverter for required column enforcement
cat presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java | grep -A 30 "transform\|required"
Length of output: 4969
🏁 Script executed:
#!/bin/bash
# Check how ClpMySqlSplitMetadataExpressionConverter is used in ClpMySqlSplitProvider
cat presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitProvider.java | grep -A 40 -B 10 "ClpMySqlSplitMetadataExpressionConverter"
Length of output: 2667
@wraymo You're correct! I can see that required column enforcement happens in ClpMySqlSplitMetadataExpressionConverter. The architecture uses a two-tier validation approach:
-
Early fail-fast check (lines 120-123 in
ClpComputePushDown): If no metadata expression was generated at all but required columns are configured, it fails immediately with "required filters must be specified." -
Fine-grained enforcement (in
ClpMySqlSplitMetadataExpressionConverter.transform()): When a metadata expression IS present, this converter tracks which required columns were actually referenced via theseenRequiredset. After transformation, it computesmissing = requiredColumns - seenRequiredand throwsCLP_MANDATORY_COLUMN_NOT_IN_FILTERwith "Missing required filter columns: " + missing if any required columns weren't referenced.
This means the scenario CodeRabbit flagged (filtering on one metadata column while missing a different required metadata column) is caught downstream in ClpMySqlSplitProvider when it calls converter.transform(). The enforcement is sound—it's just split between the optimizer layer (coarse check) and the split provider layer (detailed validation).
🧬 Code Graph Analysis Results
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpComputePushDown.java
-
Summary: Enforcement of mandatory columns is performed directly in the CLP push-down converter/rewriter logic.
- Where to look:
- In ClpComputePushDown.Rewriter.visitTableScan:
- If there are any required columns for the target CLP table (as reported by metadataConfig.getRequiredColumns(schemaTableName)) and the scan does not have those columns specified, a PrestoException is thrown with error code CLP_MANDATORY_COLUMN_NOT_IN_FILTER and message "required filters must be specified".
- In ClpComputePushDown.Rewriter.processFilter:
- After converting the filter to a KQL/metadata expression, if no metadata expression is produced (i.e., nothing was pushed down to metadata) and there are still required columns for the target table, the same PrestoException is thrown with the same message, enforcing that required columns must be included in the filter/request.
- In ClpComputePushDown.Rewriter.visitTableScan:
- Key behavior:
- Throws PrestoException(CLP_MANDATORY_COLUMN_NOT_IN_FILTER, "required filters must be specified") when mandatory metadata/filter columns are not present in the input filter table scan.
- This enforcement ensures that queries cannot proceed without pushing down the required metadata-filter columns.
- Where to look:
-
Related snippet for context (non-code excerpt):
- getRequiredColumns(...) is invoked to determine the set of columns deemed mandatory for filtering, from ClpSplitMetadataConfig, which aggregates global/schema/table-level rules.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpSplitMetadataConfig.java
- Summary: getRequiredColumns(SchemaTableName name)
- Description: Returns the set of metadata columns that are marked as required for filtering for the given table. This set is used by the enforcement logic in ClpComputePushDown to ensure mandatory filters are supplied.
- Parameters:
- name: SchemaTableName of the target table.
- Returns:
- Set of column names that are required for filtering.
- Exceptions:
- None described; it relies on the merged configuration built from global/schema/table rules.
- Important implementation details:
- Walks the merged TableConfig (including hierarchical global/schema/table configs) and collects rule.column for rules where rule.required is true.
- Used by CLP push-down logic to decide whether to error if required columns are not part of the filter.
🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/presto PR: 46
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java:105-105
Timestamp: 2025-07-30T15:27:04.862Z
Learning: In ClpPlanOptimizer, the local construction of tableScope in visitFilter() method is necessary and correct because join queries can have multiple TableScanNodes for different tables. Each FilterNode needs to determine its scope from its immediate source TableScanNode, not from a shared instance field that gets overwritten.
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
| "default": { | ||
| "metaColumns": { | ||
| "partition_date": { | ||
| "type": "DATE", | ||
| "exposedAs": "partition_date", | ||
| "description": "Logical partition of the data file" | ||
| } | ||
| }, | ||
| }, | ||
| "default.table_1": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the JSON example’s dangling comma.
The example namespace object under "default" ends with a trailing comma after metaColumns, leaving no subsequent property. That’s invalid JSON and will break anyone who copies it verbatim. Drop that comma (or add the intended filterRules block) so the sample parses cleanly.
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst around lines 235 to 244, the
JSON example for the "default" namespace contains a trailing comma after the
"metaColumns" object which results in invalid JSON; remove the dangling comma
(or alternatively add the intended "filterRules" property if that was omitted)
so the "default" object is valid JSON and the sample parses correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-11-08T02:05:14.212Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
🧬 Code graph analysis (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: prestocpp-linux-build-for-test
- GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (4)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (4)
95-105: LGTM! Required column validation is correct.The method properly resets state before each transformation and validates that all required columns appear in the expression.
140-156: LGTM! Special form handling is correct.The method properly handles AND, OR, and IS_NULL with appropriate parenthesization and throws for unsupported forms.
158-180: LGTM! Constant formatting handles decimals and strings correctly.The SQL escaping and decimal handling (both short and long forms) are implemented correctly.
182-188: LGTM! Variable reference handling correctly tracks exposed names.The method properly tracks exposed names for required-column validation and returns original names for SQL output.
| public ClpMySqlSplitMetadataExpressionConverter( | ||
| FunctionMetadataManager functionManager, | ||
| StandardFunctionResolution functionResolution, | ||
| Map<String, String> exposedToOriginal, | ||
| Map<String, Map<String, String>> dataToMetadataBounds, | ||
| Set<String> requiredColumns) | ||
| { | ||
| this.functionManager = requireNonNull(functionManager, "functionManager is null"); | ||
| this.functionResolution = requireNonNull(functionResolution, "functionResolution is null"); | ||
| this.exposedToOriginal = exposedToOriginal; | ||
| this.dataToMetadataBounds = dataToMetadataBounds; | ||
| this.requiredColumns = requiredColumns; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add null checks for remaining constructor parameters.
The constructor validates functionManager and functionResolution with requireNonNull, but the Map and Set parameters (exposedToOriginal, dataToMetadataBounds, requiredColumns) are not checked. If any are null, the code will fail with an NPE during traversal rather than at construction time.
Apply this diff to add the missing checks:
public ClpMySqlSplitMetadataExpressionConverter(
FunctionMetadataManager functionManager,
StandardFunctionResolution functionResolution,
Map<String, String> exposedToOriginal,
Map<String, Map<String, String>> dataToMetadataBounds,
Set<String> requiredColumns)
{
this.functionManager = requireNonNull(functionManager, "functionManager is null");
this.functionResolution = requireNonNull(functionResolution, "functionResolution is null");
- this.exposedToOriginal = exposedToOriginal;
- this.dataToMetadataBounds = dataToMetadataBounds;
- this.requiredColumns = requiredColumns;
+ this.exposedToOriginal = requireNonNull(exposedToOriginal, "exposedToOriginal is null");
+ this.dataToMetadataBounds = requireNonNull(dataToMetadataBounds, "dataToMetadataBounds is null");
+ this.requiredColumns = requireNonNull(requiredColumns, "requiredColumns is null");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public ClpMySqlSplitMetadataExpressionConverter( | |
| FunctionMetadataManager functionManager, | |
| StandardFunctionResolution functionResolution, | |
| Map<String, String> exposedToOriginal, | |
| Map<String, Map<String, String>> dataToMetadataBounds, | |
| Set<String> requiredColumns) | |
| { | |
| this.functionManager = requireNonNull(functionManager, "functionManager is null"); | |
| this.functionResolution = requireNonNull(functionResolution, "functionResolution is null"); | |
| this.exposedToOriginal = exposedToOriginal; | |
| this.dataToMetadataBounds = dataToMetadataBounds; | |
| this.requiredColumns = requiredColumns; | |
| } | |
| public ClpMySqlSplitMetadataExpressionConverter( | |
| FunctionMetadataManager functionManager, | |
| StandardFunctionResolution functionResolution, | |
| Map<String, String> exposedToOriginal, | |
| Map<String, Map<String, String>> dataToMetadataBounds, | |
| Set<String> requiredColumns) | |
| { | |
| this.functionManager = requireNonNull(functionManager, "functionManager is null"); | |
| this.functionResolution = requireNonNull(functionResolution, "functionResolution is null"); | |
| this.exposedToOriginal = requireNonNull(exposedToOriginal, "exposedToOriginal is null"); | |
| this.dataToMetadataBounds = requireNonNull(dataToMetadataBounds, "dataToMetadataBounds is null"); | |
| this.requiredColumns = requireNonNull(requiredColumns, "requiredColumns is null"); | |
| } |
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
around lines 71 to 83, the constructor currently null-checks functionManager and
functionResolution but not exposedToOriginal, dataToMetadataBounds, and
requiredColumns; add requireNonNull checks for each of those three parameters
(e.g., this.exposedToOriginal = requireNonNull(exposedToOriginal,
"exposedToOriginal is null"); and similarly for dataToMetadataBounds and
requiredColumns) so that any null is detected at construction time and clear
error messages are provided.
| if (operatorType == OperatorType.NEGATION) { | ||
| String value = node.getArguments().get(0).accept(this, null); | ||
| return "-" + value; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider adding parentheses around negation operand for robustness.
While the NEGATION operator typically applies to simple numeric values or column references, wrapping the operand in parentheses would ensure correctness if it ever receives a complex expression.
Apply this diff for defensive coding:
if (operatorType == OperatorType.NEGATION) {
String value = node.getArguments().get(0).accept(this, null);
- return "-" + value;
+ return "-(" + value + ")";
}🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
around lines 119 to 122, the NEGATION branch returns "-" + value without
parentheses which can produce incorrect SQL for complex operands; wrap the
operand in parentheses by returning "-(" + value + ")" so the negation always
applies to the entire expression.
| private String rewriteComparisonWithBounds(String variableName, OperatorType operator, String literal) | ||
| { | ||
| String original = exposedToOriginal.getOrDefault(variableName, variableName); | ||
| Map<String, String> bounds = dataToMetadataBounds.get(original); | ||
| if (bounds == null) { | ||
| return null; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove redundant mapping—variableName is already the original name.
Line 209 attempts to map variableName from exposed to original, but variableName is already the original name (returned by visitVariableReference at line 187). Since exposedToOriginal maps exposed→original, this lookup will fail and fall back to the default, making line 209 a no-op. This is confusing and could break if the calling logic changes.
Apply this diff to clarify the code:
private String rewriteComparisonWithBounds(String variableName, OperatorType operator, String literal)
{
- String original = exposedToOriginal.getOrDefault(variableName, variableName);
- Map<String, String> bounds = dataToMetadataBounds.get(original);
+ // variableName is already the original name (from visitVariableReference)
+ Map<String, String> bounds = dataToMetadataBounds.get(variableName);
if (bounds == null) {
return null;
}Alternatively, rename the parameter to originalName for clarity:
- private String rewriteComparisonWithBounds(String variableName, OperatorType operator, String literal)
+ private String rewriteComparisonWithBounds(String originalName, OperatorType operator, String literal)
{
- String original = exposedToOriginal.getOrDefault(variableName, variableName);
- Map<String, String> bounds = dataToMetadataBounds.get(original);
+ Map<String, String> bounds = dataToMetadataBounds.get(originalName);🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
around lines 207 to 214, remove the redundant exposedToOriginal lookup for
variableName (it is already the original name from visitVariableReference) —
either replace the lookup with direct use of variableName when fetching bounds
from dataToMetadataBounds, or rename the method parameter to originalName and
use that variable directly; update any references and javadoc/comments
accordingly to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-11-08T02:05:14.212Z
Learnt from: wraymo
Repo: y-scope/presto PR: 89
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:835-884
Timestamp: 2025-11-08T02:05:14.212Z
Learning: In ClpFilterToKqlConverter's handleOr method, when an OR expression mixes metadata-only predicates with KQL-only predicates, dropping the metadata predicate is expected behavior. Metadata columns are used for archive selection (split filtering stage) and cannot be evaluated at the row level, so in mixed OR scenarios, only the KQL predicates are pushed down and all archives are scanned. This is a design trade-off, not a correctness issue.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
🧬 Code graph analysis (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java (1)
presto-spi/src/main/java/com/facebook/presto/spi/PrestoException.java (1)
PrestoException(18-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-resource-manager)
- GitHub Check: test (17.0.13, :presto-main)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-plan-determinism)
- GitHub Check: test (17.0.13, :presto-main-base)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-local-queries)
- GitHub Check: test (8.0.442, :presto-main)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-plan-determinism)
- GitHub Check: test (17.0.13, :presto-tests -P presto-tests-general)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
- GitHub Check: test (17.0.13, :presto-tests -P ci-only-resource-manager)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-non-hash-gen)
- GitHub Check: test (8.0.442, :presto-main-base)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-tpch-distributed-queries)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-aggregation-queries)
- GitHub Check: test (8.0.442, :presto-tests -P presto-tests-general)
- GitHub Check: test (:presto-docs)
- GitHub Check: maven-checks (17.0.13)
- GitHub Check: maven-checks (8.0.442)
- GitHub Check: prestocpp-linux-build-for-test
- GitHub Check: prestissimo-worker-images-build
| if (type instanceof DecimalType && value instanceof Long) { | ||
| DecimalType decimalType = (DecimalType) type; | ||
| BigDecimal decimalValue = new BigDecimal(BigInteger.valueOf((Long) value), decimalType.getScale()); | ||
| return decimalValue.toPlainString(); | ||
| } | ||
|
|
||
| return value.toString(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix literal formatting for DATE/TIMESTAMP types.
visitConstant currently falls back to value.toString() for every non-Slice value. For DATE, TIME, and TIMESTAMP RowExpressions Presto encodes the literal as a long (days or micros since epoch). With the current logic, a predicate like metadata_date >= DATE '2024-01-01' is emitted as metadata_date >= 19358, which MySQL interprets numerically and yields the wrong archives. This breaks metadata pushdown for every non-decimal temporal column.
Please materialize proper SQL literals before returning—for example, convert DateType longs via LocalDate.ofEpochDay, and format TimestampType/TimeType using the appropriate precision-aware helper (e.g., Instant/LocalDateTime plus DateTimeFormatter), quoting the result so MySQL receives ISO strings.
if (value instanceof Slice) {
...
}
+
+ if (type instanceof DateType && value instanceof Long) {
+ LocalDate date = LocalDate.ofEpochDay((Long) value);
+ return format("DATE '%s'", date);
+ }
+ if (type instanceof TimestampType && value instanceof Long) {
+ long micros = (Long) value;
+ LocalDateTime dateTime = LocalDateTime.ofEpochSecond(
+ floorDiv(micros, MICROSECONDS_PER_SECOND),
+ (int) floorMod(micros, MICROSECONDS_PER_SECOND) * NANOSECONDS_PER_MICROSECOND,
+ ZoneOffset.UTC);
+ return format("TIMESTAMP '%s'", dateTime);
+ }
+ // add similar handling for TIME / TIMESTAMP WITH TIME ZONE as needed
+
return value.toString();Make sure to add the necessary java.time imports and constants when implementing this. Until these conversions are in place, temporal metadata filters will continue to misfire.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/split/ClpMySqlSplitMetadataExpressionConverter.java
around lines 172 to 179, the code falls back to value.toString() for non-Slice
literals which causes DATE/TIME/TIMESTAMP longs to be emitted as raw numbers;
update visitConstant to detect DateType, TimeType and TimestampType and
materialize SQL literals: for DateType convert the long (days) with
LocalDate.ofEpochDay(...) and format as an ISO date string quoted for SQL; for
TimestampType and TimeType treat the long as microseconds (or the Presto-encoded
unit), convert to Instant/LocalTime/LocalDateTime using
Instant.ofEpochSecond(micros/1_000_000, (micros%1_000_000)*1000) or appropriate
epoch/micro->nano math, format with a precision-aware DateTimeFormatter, and
return the quoted ISO string; keep existing Decimal handling intact; add
necessary java.time and java.time.format imports and any constants (e.g.,
MICROS_PER_SECOND, NANOS_PER_MICRO) used for conversions.
Description
Background
To support metadata filtering, we maintain a split metadata configuration file that specifies available metadata columns and the possible mappings between data columns and metadata columns. However, the current design has several major issues:
columnNamefield is inconsistent. It is intended to represent a metadata column name, but when a range mapping is included, it actually refers to a data column name, which is confusing.To address these issues and provide more robust support for metadata column projection, we redesigned the configuration format and reimplemented metadata expression extraction and validation using Presto
RowExpressioninstead of plain strings. Validation and SQL translation are now handled in theSplitProvider, since split filtering logic is inherently tied to the split provider rather than the plan optimizer.Implementation
This PR includes:
New metadata config format
presto-docs/src/main/sphinx/connector/clp.rst.ClpSplitMetadataConfigClpMySqlSplitMetadataExpressionConverterRowExpressiontrees to SQL filters for metadata pushdown.Refactored
ClpFilterToKqlConverterRowExpressionfor metadata pushdown instead of strings.ANDandOR.VARCHARmetadata filters.Updated
ClpTableLayoutHandleandClpExpressionRowExpressioninstead of strings.Removed old metadata-related classes
Added split metadata columns in
ClpMySqlMetadataProvider::listColumnHandles.Removed
ClpMetadataDbSetup.Integrated with
ClpMySqlSplitProviderlistSplits.Updated tests and documentation
Checklist
breaking change.
Validation performed
And here's the query results.
In this setup, we have two archives containing identical data except for the
begin_timestampmetadata column. From the results of the first two queries, we can see that metadata filters are correctly extracted and applied during split filtering. In the third query, the system successfully verifies whether a required metadata column is present in the filter.Summary by CodeRabbit
New Features
Bug Fixes / API
Tests
Documentation
Chores