[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045
Open
2 of 8 tasks
Labels
bug
Something isn't working
Bug
The logic for when auto compaction is triggered does not work as documented: already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered.
Which Delta project/connector is this regarding?
Describe the problem
Already compacted files (files that are >= minFileSize (or maxFileSize / 2) seem to be counted towards the minNumFiles for compaction to be triggered. This results in compactions running more frequently as the number of compacted files increases and approaches minNumFiles.
Steps to reproduce
Observed results
I ran 200 iterations of writing to a Delta table in Databricks vs. OSS Delta and logged the active file count following each write operation and with the same exact configs and code, OSS Delta never exceeds the default minNumFiles of 50. As the accumulated right sized files approaches 50, every write operation triggers compaction to take place. In Databricks it is clear that minNumFiles is based only on uncompacted files.
In the above screenshot it can be seen that at iteration 163, every since addition of ~ 16 files puts the total files over 50 and therefore runs compaction. Details from the MERGE from that iteration return:
Uncompacted files below minFileSize: 31
Compacted files below maxFileSize, above minFileSize: 33
Total files: 64
Expected results
Auto-compaction should only trigger once files below minFileSize is >= 50.
Environment information
Willingness to contribute
The text was updated successfully, but these errors were encountered: