You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
【Iceberg equality scan】fix duplicate key issue when querying metadata column "$data_sequence_number" from an iceberg table within equality deletes
#24629
Open
Dream-hu opened this issue
Feb 26, 2025
· 3 comments
When update the new TableScanNode, check whether need to add the extra metadata column to the assignments and outputs.
Steps to Reproduce
Create an iceberg table with equality deletes by flink cdc
Configure the IcebergQueryRunner, the properties are:
Running the IcebergQueryRunner and query with the sql:
select "$data_sequence_number", * from tbl_cdc_with_equality;
or
select "$data_sequence_number" from tbl_cdc_with_equality;
Query failed, and receive the error
Screenshots (if appropriate)
Context
The text was updated successfully, but these errors were encountered:
Thank you for reporting this. We will look into it and try to get a fix in before the next release. I will try to reproduce this myself, but if you are able to provide instructions on how to reproduce it without using Flink it would be helpful. Specifically a CREATE TABLE statement and some sequence of queries which insert data from the tpch or tpcds connectors or even just using VALUES statements in order to get the table in a state where this error appears will help speed up the process
However, Flink is the primary tool that supports writing Iceberg equality deletes Currently.
Flink provides two methods to handle data equality in Iceberg tables: upsert and CDC (Change Data Capture). Upsert is the simpler approach. Here's an example of creating and using an Iceberg table with upsert capability.
It`s easier to install flink and write iceberg table on local env then an object store like aws s3.
-- Set up database and table
USE CATALOG hadoop_catalog;
CREATE DATABASE iceberg_flink_db;
USE iceberg_flink_db;
CREATE TABLE hadoop_catalog.iceberg_flink_db.sample ( id INT COMMENT 'unique id', data STRING NOT NULL,
PRIMARY KEY(id) NOT ENFORCED
) WITH (
'format-version' = '2',
'write.upsert.enabled' = 'true'
);
-- Insert test data
INSERT INTO hadoop_catalog.iceberg_flink_db.sample VALUES (1, 'a');
INSERT INTO hadoop_catalog.iceberg_flink_db.sample VALUES
(1, 'b'), (2, 'a'), (3, 'a'), (4, 'a'),
(5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a');
`
Then, Querying select "$data_sequence_number", * from iceberg_flink_db.sample by presto will fail as above.
And even though I have set up Flink and written data to S3, please note that sharing metadata and data files alone here isn't sufficient for direct use.
Your Environment
Expected Behavior
Select the hidden column "$data_sequence_number" successfully
Current Behavior
Query failed, error messages as follow:
Query 20250226_023830_00016_bkm44 failed: Duplicate key $data_sequence_number
Possible Solution
When update the new TableScanNode, check whether need to add the extra metadata column to the assignments and outputs.
Steps to Reproduce
select "$data_sequence_number", * from tbl_cdc_with_equality;
or
select "$data_sequence_number" from tbl_cdc_with_equality;
Screenshots (if appropriate)
Context
The text was updated successfully, but these errors were encountered: