Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Iceberg samples #5

Closed
wants to merge 3 commits into from
Closed

Conversation

ZacBlanco
Copy link
Contributor

No description provided.

@ZacBlanco
Copy link
Contributor Author

From the original thread from @aaneja:

Some questions I had -
What would be the schema for the $samples table ? Would there be any extra columns besides the ones from the underlying table

The schema should exactly map to the real table. Any updates to the real table's schema would also happen on the sample table.

Would we store some metadata for the $samples table that could be used to identify -
What table sampling method was used to seed the sample
When was the sample last updated w.r.t to the changelog

This was not mentioned in the RFC, but our prototype implementation included this info. We stored it in the iceberg table properties. I will add this to the RFC


#### Sample Maintenance

Sample maintenance can be performed through a set of SQL statements which replace records in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add an example for this?

Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been
updated without needing to re-sample the entire table.

Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better than this comment. We can say, with the help of an orchestration tool, sample maintenance using Presto or Spark is as simple as running the following queries on a fixed cadence according to their desire for up to date samples--right?

Copy link
Contributor

@tdcmeehan tdcmeehan Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples? CC: @hantangwangd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples?

Sure, in terms of feasibility, I think we can do this through a customized distributed procedure.

@ZacBlanco ZacBlanco marked this pull request as draft June 28, 2024 19:34
Comment on lines +65 to +67
INSERT INTO "{schema}.{table}$samples"
SELECT reservoir_sample(...)
FROM {schema}.{table};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can reservoir_sample(...) use directly here? As described in PR #21296, reservoir_sample(...) will output a single row type with two columns.

Comment on lines +162 to +165
for sample maintenance is available already once the samples are generated. The biggest hurdle for
users will be that sample maintenance will need to be done manually. We can document the maintenance
process for samples, but there currently isn't a way to schedule or automatically run the correct
set of maintenance queries for the sample tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides of incremental sample maintenance, should we describe the mechanism of sample table refreshment as well? How to refresh the sample table and what granularity of refreshment is supported.

In addition, should we maintain the snapshot version correspondence between the sample table and the source table? As we do analyzing on the sample table, meanwhile do querying on the source table. We need a way to find the nearest statistics on sample table for the specified snapshot version on source table.

Comment on lines +157 to +159
sample with updates, deletes, or inserts that occur on the real table. With the introduction of
Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been
updated without needing to re-sample the entire table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg changelog table currently do not support V2 row level deletion. It seems to be a big problem for incremental maintenance.

Copy link

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZacBlanco for this doc. Gives a good picture of the work being done.

logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the
true table.

One benefit of creating the sample tables as an iceberg table is that we can re-use the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also use the file-format of the original table, or have decided on something else ? Would be good to clarify that as well.


Currently, iceberg does not fully support support partition-level statistics[^3]. Once partitions
statistics are officially released, the iceberg connector should be updated to support collecting
and reading the partition-level stats. As long as the sample table creation code ensures the sample

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a simple explanation of what the partition-level statistics are and why they are not affected by sample tables.

Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure
for sample maintenance is available already once the samples are generated. The biggest hurdle for
users will be that sample maintenance will need to be done manually. We can document the maintenance
process for samples, but there currently isn't a way to schedule or automatically run the correct

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically maintenance happens using an ETL tool or Kafka pipelines. We could investigate how to integrate with these. WxD has https://www.ibm.com/topics/data-pipeline these options.


### Table-Level Storage in Puffin files

One alternative approach is to generate the samples and store them in Puffin files[^puffin_files].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its better to write a technical doc in a third person tone.

@ZacBlanco ZacBlanco closed this Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants