[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

mwc360 · 2025-01-14T00:41:09Z

Bug

The logic for when auto compaction is triggered does not work as documented: already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered.

Which Delta project/connector is this regarding?

Describe the problem

Already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered. This results in compactions running more frequently as the number of compacted files increases and approaches minNumFiles.

Steps to reproduce

# RUN ON CLUSTER w/ 2x8vCore Workers
spark.conf.set("spark.databricks.delta.autoCompact.minNumFiles", "50")
spark.conf.set("spark.databricks.delta.autoCompact.maxFileSize", "134217728b")

spark.sql(f"""
    CREATE TABLE dbo.ac_test
    TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')
""")

import pyspark.sql.functions as sf

for i in range(200):
    data = spark.range(1_000_000) \
            .withColumn("id", sf.monotonically_increasing_id()) \
            .withColumn("category", sf.concat(sf.lit("category_"), (sf.col("id") % 10))) \
            .withColumn("value1", sf.round(sf.rand() * (sf.rand() * 1000), 2)) \
            .withColumn("value2", sf.round(sf.rand() * (sf.rand() * 10000), 2)) \
            .withColumn("value3", sf.round(sf.rand() * (sf.rand() * 100000), 2)) \
            .withColumn("date1", sf.date_add(sf.lit("2022-01-01"), sf.round(sf.rand() * 1000, 0).cast("int"))) \
            .withColumn("date2", sf.date_add(sf.lit("2020-01-01"), sf.round(sf.rand() * 2000, 0).cast("int"))) \
            .withColumn("is_cancelled", (sf.col("id") % 3 != 0))

    data.write.mode('append').option("mergeSchema", "true").saveAsTable(f"dbo.ac_test")

Observed results

I ran 200 iterations of writing to a Delta table in Databricks vs. OSS Delta and logged the active file count following each write operation and with the same exact configs and code, OSS Delta never exceeds the default minNumFiles of 50. As the accumulated right sized files approaches 50, every write operation triggers compaction to take place. In Databricks it is clear that minNumFiles is based only on uncompacted files.

In the above screenshot it can be seen that there's three different points where the compaction frequency increases until every single addition of ~ 16 files puts the total files over 50 and therefore runs compaction.
I ran inputFile() stats on various different versions that triggered compaction and see data like the below: the number of uncompacted files does not exceed 50 but the total number of files does.
Uncompacted files below minFileSize: 31
Compacted files below maxFileSize, above minFileSize: 33
Total files: 64

Here's what I'd expect to see (based off of running this same code in Databricks):

Expected results

Auto-compaction should only trigger once files below minFileSize is >= 50.

Environment information

Delta Lake version: 3.2.0.8
Spark version: 3.5.1.5.4.20241017.1
Scala version: 2.12.17

Willingness to contribute

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
No. I cannot contribute a bug fix at this time.

The text was updated successfully, but these errors were encountered:

mwc360 · 2025-01-14T16:40:24Z

tagging @nicklan who made the original PR #2414

mwc360 · 2025-01-23T20:08:57Z

@nicklan or @scottsand-db - FYI this is a pretty significant bug. I'd be happy to help if I could be given a pointer or two on what the root cause might be or where to look. I know this bug does not exist in Databricks so hopefully the fix is not too hard to identify. thx!

tagging the reviewers of the original PR as well:
@felipepessoto
@tdas
@jaceklaskowski
@vkorukanti

nicklan · 2025-01-23T20:34:56Z

Thanks for the bug report @mwc360! If something's wrong it's probably in AutoCompactUtils.reserveTablePartitions or something called from that.

We will work to get this triaged and someone to investigate asap

rahulsmahadev · 2025-01-31T03:40:18Z

Have you tried turning off

delta/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

Line 308 in 92a8a22

val DELTA_AUTO_COMPACT_MODIFIED_PARTITIONS_ONLY_ENABLED =

.

It seems like by default it will try to trigger AC when a partition is written to (I guess for unpartitioned tables that would be everytime)

From my understanding without this conf it will respect the minFiles count. It would be great if you can try turning this off (remember to append spark.databricks.delta. to the conf name) and let me know if it resolved your issue

mwc360 · 2025-01-31T17:05:15Z

@rahulsmahadev turning that config off resulted in compaction not being triggered the first write operation when there weren't sufficient small files (which is great as I raised this as a feature request: #4043), however it doesn't fix the behavior where both compacted and small files are counted towards the minNumFiles threshold which results in increasing frequency of compaction till it runs w/ every write operation.

FYI - the above chart shows the post write file count by merge iteration, anytime a write operation results

Also - I confirmed it isn't a session issue. I ran addition writes from a different session and it had the exact same behavior: both compacted and uncompacted files are counted torwards the minNumFiles threhold but then only uncompacted files get compacted.

mwc360 · 2025-02-05T20:41:37Z

FYI @nicklan and @rahulsmahadev I think I found the issue:

AutoCompactUtils.choosePartitionsBasedOnMinNumSmallFiles is supposed to filter by only small files via AutoCompactPartitionStats.filterPartitionsWithSmallFiles. This function ultimate references these two functions which don't have any filter on the size of the file:

delta/spark/src/main/scala/org/apache/spark/sql/delta/stats/AutoCompactPartitionStats.scala

Lines 77 to 80 in 4676bf4

    
           def hasSufficientSmallFilesOrHasNotBeenCompacted(minNumFiles: Long): Boolean = 
        
             !wasAutoCompacted || hasSufficientFiles(minNumFiles) 
        
           def hasSufficientFiles(minNumFiles: Long): Boolean = numFiles >= minNumFiles

I seems like maxFileSize and the postCommitSnapshot needs to be passed in and then used to filter via something like the below:

def hasSufficientSmallFilesOrHasNotBeenCompacted(minNumFiles: Long, maxFileSize: Long): Boolean =
    !wasAutoCompacted || hasSufficientFiles(minNumFiles, maxFileSize)

  def hasSufficientFiles(minNumFiles: Long, maxFileSize: Long): Boolean = {
    val smallFilesCount = files.filter(_.size <= maxFileSize / 2).size
    smallFilesCount >= minNumFiles
  }

I will propose a PR for fixing this.

mwc360 · 2025-02-19T22:25:23Z

@nicklan - FYI PR has been submitted: #4178

mwc360 added the bug Something isn't working label Jan 14, 2025

nicklan assigned rahulsmahadev Jan 29, 2025

mwc360 mentioned this issue Jan 31, 2025

[SPARK] Set autoCompact.modifiedPartitionsOnly.enabled to false as default #4109

Closed

5 tasks

mwc360 closed this as completed Jan 31, 2025

mwc360 reopened this Jan 31, 2025

mwc360 linked a pull request Feb 19, 2025 that will close this issue

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

mwc360 commented Jan 14, 2025 •

edited

Loading

mwc360 commented Jan 14, 2025

mwc360 commented Jan 23, 2025

nicklan commented Jan 23, 2025

rahulsmahadev commented Jan 31, 2025

mwc360 commented Jan 31, 2025 •

edited

Loading

mwc360 commented Feb 5, 2025 •

edited

Loading

mwc360 commented Feb 19, 2025

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

Comments

mwc360 commented Jan 14, 2025 • edited Loading

Bug

Which Delta project/connector is this regarding?

Describe the problem

Steps to reproduce

Observed results

Expected results

Environment information

Willingness to contribute

mwc360 commented Jan 14, 2025

mwc360 commented Jan 23, 2025

nicklan commented Jan 23, 2025

rahulsmahadev commented Jan 31, 2025

mwc360 commented Jan 31, 2025 • edited Loading

mwc360 commented Feb 5, 2025 • edited Loading

mwc360 commented Feb 19, 2025

mwc360 commented Jan 14, 2025 •

edited

Loading

mwc360 commented Jan 31, 2025 •

edited

Loading

mwc360 commented Feb 5, 2025 •

edited

Loading