-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40592: [C++][Parquet] Implement SizeStatistics #40594
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level questions/suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another quick pass through and didn't see any major blockers, as long as @pitrou is happy with changes I'm OK to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @wgtmac ! The approach looks generally fine, some assorted comments below.
Thanks for your thorough review! I think I have addressed all comments except for the benchmark. Could you please take a look again? @pitrou |
Gentle ping @pitrou :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @wgtmac and sorry for having overlooked this. LGTM, I only have minor comments below.
Should we look into benchmarks to change the default size_statistics_level
, or should we leave this for another issue/PR?
Damn, I also hit the ORC timeout issue from the failed CI:
|
Rebased and got following (unrelated) failure from the MacOS CI:
|
The macOS failure was fixed in #45057, so you can rebase another time 😁 |
Here we go! |
Thanks a lot @wgtmac , we can merge now. |
Thank you for your thorough review! Let me work on the benchmark. |
Grats! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f93004f. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.
What changes are included in this PR?
Implement reading and writing size statistics for parquet-cpp.
Are these changes tested?
Yes, a bunch of test cases have been added.
Are there any user-facing changes?
Yes, now parquet users are able to read and write size statistics.