-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Iceberg samples #5
Conversation
From the original thread from @aaneja:
The schema should exactly map to the real table. Any updates to the real table's schema would also happen on the sample table.
This was not mentioned in the RFC, but our prototype implementation included this info. We stored it in the iceberg table properties. I will add this to the RFC |
|
||
#### Sample Maintenance | ||
|
||
Sample maintenance can be performed through a set of SQL statements which replace records in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add an example for this?
Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been | ||
updated without needing to re-sample the entire table. | ||
|
||
Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do better than this comment. We can say, with the help of an orchestration tool, sample maintenance using Presto or Spark is as simple as running the following queries on a fixed cadence according to their desire for up to date samples--right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples? CC: @hantangwangd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples?
Sure, in terms of feasibility, I think we can do this through a customized distributed procedure.
INSERT INTO "{schema}.{table}$samples" | ||
SELECT reservoir_sample(...) | ||
FROM {schema}.{table}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can reservoir_sample(...)
use directly here? As described in PR #21296, reservoir_sample(...)
will output a single row type with two columns.
for sample maintenance is available already once the samples are generated. The biggest hurdle for | ||
users will be that sample maintenance will need to be done manually. We can document the maintenance | ||
process for samples, but there currently isn't a way to schedule or automatically run the correct | ||
set of maintenance queries for the sample tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides of incremental sample maintenance, should we describe the mechanism of sample table refreshment as well? How to refresh the sample table and what granularity of refreshment is supported.
In addition, should we maintain the snapshot version correspondence between the sample table and the source table? As we do analyzing on the sample table, meanwhile do querying on the source table. We need a way to find the nearest statistics on sample table for the specified snapshot version on source table.
sample with updates, deletes, or inserts that occur on the real table. With the introduction of | ||
Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been | ||
updated without needing to re-sample the entire table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg changelog table currently do not support V2 row level deletion. It seems to be a big problem for incremental maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZacBlanco for this doc. Gives a good picture of the work being done.
logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the | ||
true table. | ||
|
||
One benefit of creating the sample tables as an iceberg table is that we can re-use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also use the file-format of the original table, or have decided on something else ? Would be good to clarify that as well.
|
||
Currently, iceberg does not fully support support partition-level statistics[^3]. Once partitions | ||
statistics are officially released, the iceberg connector should be updated to support collecting | ||
and reading the partition-level stats. As long as the sample table creation code ensures the sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a simple explanation of what the partition-level statistics are and why they are not affected by sample tables.
Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure | ||
for sample maintenance is available already once the samples are generated. The biggest hurdle for | ||
users will be that sample maintenance will need to be done manually. We can document the maintenance | ||
process for samples, but there currently isn't a way to schedule or automatically run the correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically maintenance happens using an ETL tool or Kafka pipelines. We could investigate how to integrate with these. WxD has https://www.ibm.com/topics/data-pipeline these options.
|
||
### Table-Level Storage in Puffin files | ||
|
||
One alternative approach is to generate the samples and store them in Puffin files[^puffin_files]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to write a technical doc in a third person tone.
No description provided.