Compression speed too slow #1895

a10y · 2025-01-10T17:18:45Z

Using the sample file provided in #1749, Vortex compression is nearly ~10x slower than the equivalent Parquet compression:

A couple of thoughts

This schema is large (14,000 columns) and deeply nested (Struct(Struct(List) * 14,000))
Tree search for best encodings is probably a large part of this
The search could be trivially parallelized
The compression step for structs/chunks could similarly be parallelized trivially
There is probably a certain amount of wasteful work being done in the single-threaded case that we should fix. Need to dig into a full profile to get a solid breakdown

The text was updated successfully, but these errors were encountered:

robert3005 · 2025-01-10T17:26:20Z

I am curious if pyarrow parallelizes parquet compression and if it does it would useful to see without parallelization.

a10y · 2025-01-10T22:38:02Z

It appears to also be single-threaded

Provide feedback