Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression speed too slow #1895

Open
a10y opened this issue Jan 10, 2025 · 2 comments
Open

Compression speed too slow #1895

a10y opened this issue Jan 10, 2025 · 2 comments

Comments

@a10y
Copy link
Contributor

a10y commented Jan 10, 2025

Using the sample file provided in #1749, Vortex compression is nearly ~10x slower than the equivalent Parquet compression:

image image

A couple of thoughts

  • This schema is large (14,000 columns) and deeply nested (Struct(Struct(List) * 14,000))
  • Tree search for best encodings is probably a large part of this
  • The search could be trivially parallelized
  • The compression step for structs/chunks could similarly be parallelized trivially
  • There is probably a certain amount of wasteful work being done in the single-threaded case that we should fix. Need to dig into a full profile to get a solid breakdown
@robert3005
Copy link
Member

I am curious if pyarrow parallelizes parquet compression and if it does it would useful to see without parallelization.

@a10y
Copy link
Contributor Author

a10y commented Jan 10, 2025

It appears to also be single-threaded

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants