Write performance disparity between Rust / Python #1804

a10y · 2025-01-03T15:07:14Z

Connected to the sample file in #1749, specifically the 500MB A0.small.50.vortex file.

Loading the file into memory and writing it back out via the VortexFileWriter is really snappy in Rust.

On Python, doing the same load runs for several minutes without completing:

import vortex as vx
arr = vx.io.read_path("A0.small.50.vortex")
vx.io.write_path(arr, "A0.small.50-out.vortex")    #... hangs for a really long time

The text was updated successfully, but these errors were encountered:

a10y · 2025-01-03T16:01:54Z

Ah ok, apparently we compress by default in the write path in Python.

@danking I'm wondering if we should revisit this default setting? Or perhaps find a better way to decide if we should re-compress?

danking · 2025-01-03T17:20:27Z

Defaulting to compress=False seems to me unexpected behavior if you're coming from Parquet or ORC.

What is the tree display for this file? I suppose if it's incompressible data, then the sampling compressor will always receive a PrimitiveArray which it will attempt to compress.

This also seems like a problem with the compressor. Maybe we should add this file as a compressor benchmark?

a10y · 2025-01-03T21:34:38Z

The file is compressible, it's just that the user has already run vx.compress() on it before trying to write it.

Perhaps we can store something in python-land that acknowledges that an array has already been compressed so that when we pass it to vx.io.write_path it doesn't re-compress it needlessly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write performance disparity between Rust / Python #1804

Write performance disparity between Rust / Python #1804

a10y commented Jan 3, 2025

a10y commented Jan 3, 2025

danking commented Jan 3, 2025

a10y commented Jan 3, 2025

Write performance disparity between Rust / Python #1804

Write performance disparity between Rust / Python #1804

Comments

a10y commented Jan 3, 2025

a10y commented Jan 3, 2025

danking commented Jan 3, 2025

a10y commented Jan 3, 2025