Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40592: [C++][Parquet] Implement SizeStatistics #40594

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 16, 2024

Rationale for this change

Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.

What changes are included in this PR?

Implement reading and writing size statistics for parquet-cpp.

Are these changes tested?

Yes, a bunch of test cases have been added.

Are there any user-facing changes?

Yes, now parquet users are able to read and write size statistics.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 19, 2024
@wgtmac wgtmac marked this pull request as ready for review April 5, 2024 15:39
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 10, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level questions/suggestions.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 23, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another quick pass through and didn't see any major blockers, as long as @pitrou is happy with changes I'm OK to merge.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Nov 23, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @wgtmac ! The approach looks generally fine, some assorted comments below.

cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics_test.cc Show resolved Hide resolved
cpp/src/parquet/size_statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Show resolved Hide resolved
cpp/src/parquet/encoder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Nov 27, 2024

Thanks for your thorough review! I think I have addressed all comments except for the benchmark. Could you please take a look again? @pitrou

@wgtmac
Copy link
Member Author

wgtmac commented Dec 16, 2024

Gentle ping @pitrou :)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wgtmac and sorry for having overlooked this. LGTM, I only have minor comments below.

Should we look into benchmarks to change the default size_statistics_level, or should we leave this for another issue/PR?

cpp/src/parquet/encoder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Show resolved Hide resolved
cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Dec 17, 2024

Thanks for the review! @pitrou I've created #45045 to follow up with the benchmark and default value of size_statistics_level.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 18, 2024

Damn, I also hit the ORC timeout issue from the failed CI:

          --- LOG END ---
          error: downloading 'https://archive.apache.org/dist/orc/orc-format-1.0.0/orc-format-1.0.0.tar.gz' failed
          status_code: 28
          status_string: "Timeout was reached"
          log:
          --- LOG BEGIN ---
          Host archive.apache.org:443 was resolved.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 18, 2024

Rebased and got following (unrelated) failure from the MacOS CI:

[549/1117] Building CXX object src/arrow/CMakeFiles/arrow_testing_objlib.dir/testing/process.cc.o
FAILED: src/arrow/CMakeFiles/arrow_testing_objlib.dir/testing/process.cc.o 
/usr/local/bin/ccache /Applications/Xcode_15.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_TESTING_EXPORTING -DBOOST_ATOMIC_DYN_LINK -DBOOST_ATOMIC_NO_LIB -DBOOST_CONTEXT_DYN_LINK -DBOOST_CONTEXT_NO_LIB -DBOOST_DATE_TIME_DYN_LINK -DBOOST_DATE_TIME_NO_LIB -DBOOST_FILESYSTEM_DYN_LINK -DBOOST_FILESYSTEM_NO_LIB -DBOOST_PROCESS_DYN_LINK -DBOOST_PROCESS_HAVE_V1 -DBOOST_PROCESS_HAVE_V2 -DBOOST_PROCESS_NO_LIB -DBOOST_SYSTEM_DYN_LINK -DBOOST_SYSTEM_NO_LIB -I/Users/runner/work/arrow/arrow/build/cpp/src -I/Users/runner/work/arrow/arrow/cpp/src -I/Users/runner/work/arrow/arrow/cpp/src/generated -isystem /Users/runner/work/arrow/arrow/build/cpp/_deps/googletest-src/googletest/include -isystem /Users/runner/work/arrow/arrow/build/cpp/_deps/googletest-src/googletest -isystem /Users/runner/work/arrow/arrow/build/cpp/_deps/googletest-src/googlemock/include -isystem /Users/runner/work/arrow/arrow/build/cpp/_deps/googletest-src/googlemock -isystem /Users/runner/work/arrow/arrow/cpp/thirdparty/flatbuffers/include -isystem /usr/local/Cellar/rapidjson/1.1.0/include -isystem /usr/local/include -fno-aligned-new  -Qunused-arguments -fcolor-diagnostics  -Wall -Wextra -Wdocumentation -DARROW_WARN_DOCUMENTATION -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -msse4.2  -g -Werror -O0 -ggdb -g1 -std=c++17 -isysroot /Applications/Xcode_15.2.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk -mmacosx-version-min=13.7 -fPIC -MD -MT src/arrow/CMakeFiles/arrow_testing_objlib.dir/testing/process.cc.o -MF src/arrow/CMakeFiles/arrow_testing_objlib.dir/testing/process.cc.o.d -o src/arrow/CMakeFiles/arrow_testing_objlib.dir/testing/process.cc.o -c /Users/runner/work/arrow/arrow/cpp/src/arrow/testing/process.cc
/Users/runner/work/arrow/arrow/cpp/src/arrow/testing/process.cc:88:18: error: expected namespace name
namespace asio = BOOST_PROCESS_V2_ASIO_NAMESPACE;
                 ^
/Users/runner/work/arrow/arrow/cpp/src/arrow/testing/process.cc:246:3: error: use of undeclared identifier 'asio'; did you mean 'boost::asio'?
  asio::io_context ctx_;
  ^~~~
  boost::asio
/usr/local/include/boost/asio/writable_pipe.hpp:26:11: note: 'boost::asio' declared here
namespace asio {
          ^
2 errors generated.

@pitrou
Copy link
Member

pitrou commented Dec 18, 2024

The macOS failure was fixed in #45057, so you can rebase another time 😁

@wgtmac
Copy link
Member Author

wgtmac commented Dec 18, 2024

Here we go!

@pitrou
Copy link
Member

pitrou commented Dec 18, 2024

Thanks a lot @wgtmac , we can merge now.

@pitrou pitrou merged commit f93004f into apache:main Dec 18, 2024
37 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Dec 18, 2024
@wgtmac
Copy link
Member Author

wgtmac commented Dec 18, 2024

Thank you for your thorough review! Let me work on the benchmark.

@mapleFU
Copy link
Member

mapleFU commented Dec 18, 2024

Grats!

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f93004f.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants