Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write performance disparity between Rust / Python #1804

Open
a10y opened this issue Jan 3, 2025 · 3 comments
Open

Write performance disparity between Rust / Python #1804

a10y opened this issue Jan 3, 2025 · 3 comments

Comments

@a10y
Copy link
Contributor

a10y commented Jan 3, 2025

Connected to the sample file in #1749, specifically the 500MB A0.small.50.vortex file.

Loading the file into memory and writing it back out via the VortexFileWriter is really snappy in Rust.

On Python, doing the same load runs for several minutes without completing:

import vortex as vx
arr = vx.io.read_path("A0.small.50.vortex")
vx.io.write_path(arr, "A0.small.50-out.vortex")    #... hangs for a really long time
@a10y
Copy link
Contributor Author

a10y commented Jan 3, 2025

Ah ok, apparently we compress by default in the write path in Python.

@danking I'm wondering if we should revisit this default setting? Or perhaps find a better way to decide if we should re-compress?

@danking
Copy link
Member

danking commented Jan 3, 2025

Defaulting to compress=False seems to me unexpected behavior if you're coming from Parquet or ORC.

What is the tree display for this file? I suppose if it's incompressible data, then the sampling compressor will always receive a PrimitiveArray which it will attempt to compress.

This also seems like a problem with the compressor. Maybe we should add this file as a compressor benchmark?

@a10y
Copy link
Contributor Author

a10y commented Jan 3, 2025

The file is compressible, it's just that the user has already run vx.compress() on it before trying to write it.

Perhaps we can store something in python-land that acknowledges that an array has already been compressed so that when we pass it to vx.io.write_path it doesn't re-compress it needlessly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants