-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045
Comments
@nicklan or @scottsand-db - FYI this is a pretty significant bug. I'd be happy to help if I could be given a pointer or two on what the root cause might be or where to look. I know this bug does not exist in Databricks so hopefully the fix is not too hard to identify. thx! tagging the reviewers of the original PR as well: |
Thanks for the bug report @mwc360! If something's wrong it's probably in We will work to get this triaged and someone to investigate asap |
Have you tried turning off delta/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala Line 308 in 92a8a22
It seems like by default it will try to trigger AC when a partition is written to (I guess for unpartitioned tables that would be everytime) From my understanding without this conf it will respect the minFiles count. It would be great if you can try turning this off (remember to append spark.databricks.delta. to the conf name) and let me know if it resolved your issue |
@rahulsmahadev turning that config off resulted in compaction not being triggered the first write operation when there weren't sufficient small files (which is great as I raised this as a feature request: #4043), however it doesn't fix the behavior where both compacted and small files are counted towards the minNumFiles threshold which results in increasing frequency of compaction till it runs w/ every write operation.
Also - I confirmed it isn't a session issue. I ran addition writes from a different session and it had the exact same behavior: both compacted and uncompacted files are counted torwards the minNumFiles threhold but then only uncompacted files get compacted. |
FYI @nicklan and @rahulsmahadev I think I found the issue: AutoCompactUtils.choosePartitionsBasedOnMinNumSmallFiles is supposed to filter by only small files via AutoCompactPartitionStats.filterPartitionsWithSmallFiles. This function ultimate references these two functions which don't have any filter on the size of the file: delta/spark/src/main/scala/org/apache/spark/sql/delta/stats/AutoCompactPartitionStats.scala Lines 77 to 80 in 4676bf4
I seems like maxFileSize and the postCommitSnapshot needs to be passed in and then used to filter via something like the below: def hasSufficientSmallFilesOrHasNotBeenCompacted(minNumFiles: Long, maxFileSize: Long): Boolean =
!wasAutoCompacted || hasSufficientFiles(minNumFiles, maxFileSize)
def hasSufficientFiles(minNumFiles: Long, maxFileSize: Long): Boolean = {
val smallFilesCount = files.filter(_.size <= maxFileSize / 2).size
smallFilesCount >= minNumFiles
} I will propose a PR for fixing this. |
Bug
The logic for when auto compaction is triggered does not work as documented: already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered.
Which Delta project/connector is this regarding?
Describe the problem
Already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered. This results in compactions running more frequently as the number of compacted files increases and approaches minNumFiles.
Steps to reproduce
Observed results
I ran 200 iterations of writing to a Delta table in Databricks vs. OSS Delta and logged the active file count following each write operation and with the same exact configs and code, OSS Delta never exceeds the default minNumFiles of 50. As the accumulated right sized files approaches 50, every write operation triggers compaction to take place. In Databricks it is clear that minNumFiles is based only on uncompacted files.
In the above screenshot it can be seen that there's three different points where the compaction frequency increases until every single addition of ~ 16 files puts the total files over 50 and therefore runs compaction.
I ran inputFile() stats on various different versions that triggered compaction and see data like the below: the number of uncompacted files does not exceed 50 but the total number of files does.
Uncompacted files below minFileSize: 31
Compacted files below maxFileSize, above minFileSize: 33
Total files: 64
Here's what I'd expect to see (based off of running this same code in Databricks):

Expected results
Auto-compaction should only trigger once files below minFileSize is >= 50.
Environment information
Willingness to contribute
The text was updated successfully, but these errors were encountered: