Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to estimate file size lazily #51

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

sauliusvl
Copy link
Contributor

Currently the RecordBatchingSinker checks whether a batch should be finalized after processing every record, to do that it passes the current record count and the current file size to a given FileCommitStrategy. The problem is that estimating the file size for formats like parquet can be very expensive, we observe ~15% of total time spent there. In addition, the file size in case of parquet is only an estimate, so doing it so frequently is pointless.

We thus change the interface of FileCommitStrategy to accept the file size lazily, i.e. if the strategy does not invoke it, it will not be calculated. In addition we tune FuzzyReachedAnyOf to have a configurable fileSizeSamplingBatchSize, i.e. a number of records processed that triggers a file size check (e.g. every 1000 records). By default it checks every time so that the default behavior remains the same.

@shivam247 shivam247 merged commit 85d30bd into adform:master Sep 23, 2024
1 check passed
@sauliusvl sauliusvl deleted the lazy-file-size-check branch September 23, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants