Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][Spark] Auto Compaction shouldn't be trigged if compaction hasn't been run yet #4043

Open
2 of 8 tasks
mwc360 opened this issue Jan 13, 2025 · 1 comment
Open
2 of 8 tasks
Labels
enhancement New feature or request

Comments

@mwc360
Copy link
Contributor

mwc360 commented Jan 13, 2025

Feature request

The OSS implementation will run compaction when auto-compaction is enabled if compaction hasn't been run yet. I.e. running a CTAS w/ the table property enabled will perform compaction after the write even if the small file count doesn't meet minNumFiles.

/**
* Determine whether this partition can be autocompacted based on the number of small files or
* if this [[AutoCompactPartitionStats]] instance has not auto compacted it yet.
* @param minNumFiles The minimum number of files this table-partition should have to trigger
* Auto Compaction in case it has already been compacted once.
*/
def hasSufficientSmallFilesOrHasNotBeenCompacted(minNumFiles: Long): Boolean =
!wasAutoCompacted || hasSufficientFiles(minNumFiles)

Auto compaction in Databricks does not perform this unnecessary initial compaction operation. It should only be evaluated based on the presence of small files which meet or exceed the minNumFiles.

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Motivation

Improve performance of tables that get created with auto compaction enabled.

Further details

Willingness to contribute

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@mwc360 mwc360 added the enhancement New feature or request label Jan 13, 2025
@mwc360
Copy link
Contributor Author

mwc360 commented Jan 14, 2025

@nicklan - what was the intent behind always triggering auto compact when the feature is enabled? Auto compact in DBX doesn't work this same way.

i.e. running a CTAS w/ the autoCompact property runs optimize right after the CTAS completes, even when the count of small files is well below minNumFiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant