Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

Open
2 of 8 tasks
mwc360 opened this issue Jan 14, 2025 · 1 comment
Open
2 of 8 tasks
Labels
bug Something isn't working

Comments

@mwc360
Copy link
Contributor

mwc360 commented Jan 14, 2025

Bug

The logic for when auto compaction is triggered does not work as documented: already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered.

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

Already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered. This results in compactions running more frequently as the number of compacted files increases and approaches minNumFiles.

Steps to reproduce

# RUN ON CLUSTER w/ 2x8vCore Workers
spark.conf.set("spark.databricks.delta.autoCompact.minNumFiles", "50")
spark.conf.set("spark.databricks.delta.autoCompact.maxFileSize", "134217728b")

spark.sql(f"""
    CREATE TABLE dbo.ac_test
    TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')
""")

import pyspark.sql.functions as sf

for i in range(200):
    data = spark.range(1_000_000) \
            .withColumn("id", sf.monotonically_increasing_id()) \
            .withColumn("category", sf.concat(sf.lit("category_"), (sf.col("id") % 10))) \
            .withColumn("value1", sf.round(sf.rand() * (sf.rand() * 1000), 2)) \
            .withColumn("value2", sf.round(sf.rand() * (sf.rand() * 10000), 2)) \
            .withColumn("value3", sf.round(sf.rand() * (sf.rand() * 100000), 2)) \
            .withColumn("date1", sf.date_add(sf.lit("2022-01-01"), sf.round(sf.rand() * 1000, 0).cast("int"))) \
            .withColumn("date2", sf.date_add(sf.lit("2020-01-01"), sf.round(sf.rand() * 2000, 0).cast("int"))) \
            .withColumn("is_cancelled", (sf.col("id") % 3 != 0))

    data.write.mode('append').option("mergeSchema", "true").saveAsTable(f"dbo.ac_test")

Observed results

I ran 200 iterations of writing to a Delta table in Databricks vs. OSS Delta and logged the active file count following each write operation and with the same exact configs and code, OSS Delta never exceeds the default minNumFiles of 50. As the accumulated right sized files approaches 50, every write operation triggers compaction to take place. In Databricks it is clear that minNumFiles is based only on uncompacted files.

image
In the above screenshot it can be seen that at iteration 163, every since addition of ~ 16 files puts the total files over 50 and therefore runs compaction. Details from the MERGE from that iteration return:
Uncompacted files below minFileSize: 31
Compacted files below maxFileSize, above minFileSize: 33
Total files: 64

Expected results

Auto-compaction should only trigger once files below minFileSize is >= 50.

Environment information

  • Delta Lake version: 3.2.0.8
  • Spark version: 3.5.1.5.4.20241017.1
  • Scala version: 2.12.17

Willingness to contribute

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@mwc360 mwc360 added the bug Something isn't working label Jan 14, 2025
@mwc360
Copy link
Contributor Author

mwc360 commented Jan 14, 2025

tagging @nicklan who made the original PR #2414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant