Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible schema changes break file skipping #712

Open
scovich opened this issue Feb 21, 2025 · 2 comments
Open

Incompatible schema changes break file skipping #712

scovich opened this issue Feb 21, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@scovich
Copy link
Collaborator

scovich commented Feb 21, 2025

Describe the bug

Suppose a query includes a skipping-eligible predicate over LONG column c.

Then we expect the add.stats column to include min/max stats that can parse as LONG.

However, it is possible that a recent table replace operation changed the schema of c -- previously it was a STRING column. This is incompatible with c new type. In that case, every file that had been in the table will be canceled by a remove (table replacement always truncates the original table), ensuring that no incompatible file actions survive log replay.

Unfortunately, kernel currently attempts to parse the entire add.stats column before deduplicating (in order to avoid tracking pruned files), and is thus exposed to parsing failures for rows that contain canceled add actions (e.g. add.stats.minValues.c = 'A' cannot parse as LONG).

This issue has a second aspect: Data skipping doesn't track or exclude partition columns directly. So we attempt data skipping over c, with the same risk of parsing failures, even if it's now a partition column. Fixing the general problem would make this harmless, but it's probably worth specifically tracking and excluding partition columns from the data skipping machinery so we don't waste time trying to parse (usually non-existent) stats and evaluating (provably useless) data skipping expressions for partition columns.

NOTE: Ideally, this issue should not arise if column mapping is enabled, because the physical names of the new columns should differ from the originals even if their logical names still seem to match.

To Reproduce

Invoke LogReplayScanner::process_scan_batch twice -- once with a batch containing an incompatible remove (to mark the file as "seen"), and again with a batch containing a matching incompatible add. It will fail with e.g.

Arrow(JsonError("whilst decoding field 'minValues': whilst decoding field 'c': failed to parse \"A\" as Int64"))

Expected behavior

The previously-seen remove should eliminate the add before it gets a chance to cause trouble.

Additional context

No response

@scovich scovich added the bug Something isn't working label Feb 21, 2025
@scovich
Copy link
Collaborator Author

scovich commented Feb 24, 2025

NOTE: The java kernel avoids this issue by deduplicating file actions before attempting to parse add.stats, and also because the json parser honors selection vectors and ignores unselected rows.

The rust kernel json parser also ignores null rows, but we don't currently (have a way to) update the null mask based on the deduplication kernel performed. We'll need to figure out how to do that. Additionally, we would want to split the deduplication into "check" and "update" passes, so that we can do:

  1. Sanitize the rows of a batch (eliminate non-file action rows, eliminate previously seen files, etc
  2. Parse stats and partition values of surviving rows, apply further pruning
  3. Update the "seen" set only for files that survived pruning

That way, we get the best of both worlds: pruning minimizes the cardinality of the "seen" set, but the "seen" set can still protect pruning attempts from incompatible schema changes.

scovich added a commit that referenced this issue Mar 12, 2025
## What changes are proposed in this pull request?

Add basic support for partition pruning by combining two pieces of
existing infra:
1. The log replay row visitor already needs to parse partition values
and already filters out unwanted rows
2. The default predicate evaluator works directly with scalars

Result: partition pruning gets applied during log replay, just before
deduplication so we don't have to remember pruned files.

WARNING: The implementation currently has a flaw, in case the history
contains a table-replace that affected partition columns. For example,
changing a value column into a non-nullable partition column, or an
incompatible type change to a partition column. In such cases, the
remove actions generated by the table-replace operation (for old files)
would have the wrong type or even be entirely absent. While the code can
handle an absent partition value, an incompatibly typed value would
cause a parsing error that fails the whole query. Note that stats-based
data skipping already has the same flaw, so we are not making the
problem worse. We will fix the problem for both as a follow-up item,
tracked by #712

NOTE: While this is a convenient way to achieve partition pruning in the
immediate term, Delta
[checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1)
can provide strongly-typed `stats_parsed` and `partitionValues_parsed`
columns which would have a completely different access.
* For `stats` vs. `stats_parsed`, the likely solution is simple enough
because we already json-parse `stats` into a strongly-typed nested
struct in order to evaluate the data skipping predicate over its record
batch. We just avoid the parsing overhead if `stats_parsed` is already
available.
* The `partitionValues` field poses a bigger challenge, because it's a
string-string map, not a JSON literal. In order to turn it into a
strongly-typed nested struct, we would need a SQL expression that can
extract the string values and try-cast them to the desired types. That's
ugly enough we might prefer to keep completely different code paths for
parsed vs. string partition values, but then there's a risk that
partition pruning behavior changes depending on which path got invoked.

## How was this change tested?

New unit tests, and adjusted one unit test that assumed no partition
pruning.
@scovich
Copy link
Collaborator Author

scovich commented Mar 13, 2025

The same issue affects partition pruning, and the same check-parse-update approach should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant