-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support filter out strip by provided range #126
Conversation
Could you help me understand the intended use case for this? This API seems a bit unintuitive in requiring the user to specify the exact byte range that should be read from a file (where a stripe only needs to begin inside specified range, not necessarily being contained within the range itself). Would it be a more intuitive API to allow users to specify which stripes they would want to read via their indices perhaps? |
Spark will slice file Spark filePartitionsFilePartition(0,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@1fbe9d4): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 0-15485986, partition values: [empty row] Orc file metadataProcessing data file part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc [length: 119693586] Stripe Statistics: File Statistics: Stripes: File length: 119693586 bytes User Metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit hesitant on this API itself, but I suppose since there is a need for this functionality and we don't provide any alternative (e.g. select stripe index) then we can introduce this.
#[test] | ||
pub fn basic_test_with_range() { | ||
let path = basic_path("test.orc"); | ||
let reader = new_arrow_reader_range(&path, 0..2000); | ||
let batch = reader.collect::<Result<Vec<_>, _>>().unwrap(); | ||
|
||
assert_eq!(5, batch[0].column(0).len()); | ||
} | ||
|
||
#[test] | ||
pub fn basic_test_with_range_without_data() { | ||
let path = basic_path("test.orc"); | ||
let reader = new_arrow_reader_range(&path, 100..2000); | ||
let batch = reader.collect::<Result<Vec<_>, _>>().unwrap(); | ||
|
||
assert_eq!(0, batch.len()); | ||
} | ||
|
||
#[cfg(feature = "async")] | ||
#[tokio::test] | ||
pub async fn async_basic_test_with_range() { | ||
let path = basic_path("test.orc"); | ||
let reader = new_arrow_stream_reader_range(&path, 0..2000).await; | ||
let batch = reader.try_collect::<Vec<_>>().await.unwrap(); | ||
|
||
assert_eq!(5, batch[0].column(0).len()); | ||
} | ||
|
||
#[cfg(feature = "async")] | ||
#[tokio::test] | ||
pub async fn async_basic_test_with_range_without_data() { | ||
let path = basic_path("test.orc"); | ||
let reader = new_arrow_stream_reader_range(&path, 100..2000).await; | ||
let batch = reader.try_collect::<Vec<_>>().await.unwrap(); | ||
|
||
assert_eq!(0, batch.len()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this 👍
Since spark orc file format will slice a file into multiple orc splits, support filter out strip by provided range will avoid reading whole orc data file.