Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2261: Implement SizeStatistics #1177

Merged
merged 5 commits into from
Feb 27, 2024
Merged

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Oct 19, 2023

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@wgtmac wgtmac marked this pull request as draft October 19, 2023 07:49
@wgtmac
Copy link
Member Author

wgtmac commented Oct 19, 2023

I have drafted the POC to read/write SizeStatistics. The feature implementation should be complete and associated tests will be added progressively. Please take a look when you have time. Thanks! @emkornfield

cc @mapleFU

@etseidl
Copy link
Contributor

etseidl commented Oct 19, 2023

Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the column-index command could write out the unencoded sizes and histograms. The former is pretty straightforward, but I'm not entirely sure what the right approach is for the histograms. Also, rewrite currently ignores the new statistics. For a straight copy, all that's needed is to add chunk.getSizeStatistics() to the arg list of ColumnChunkMetadata.get() here.

@wgtmac
Copy link
Member Author

wgtmac commented Oct 20, 2023

Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the column-index command could write out the unencoded sizes and histograms. The former is pretty straightforward, but I'm not entirely sure what the right approach is for the histograms. Also, rewrite currently ignores the new statistics. For a straight copy, all that's needed is to add chunk.getSizeStatistics() to the arg list of ColumnChunkMetadata.get() here.

Thanks for the suggestion! Yes, it definitely should be done. @etseidl

@emkornfield
Copy link
Contributor

@wgtmac took a scan through and this generally seems like what I expected. Thank you for doing it. Agree unit tests are needed.

@wgtmac wgtmac force-pushed the size_stats branch 2 times, most recently from cc0d75d to 0acf99f Compare November 4, 2023 17:25
@wgtmac wgtmac force-pushed the size_stats branch 2 times, most recently from 26ced88 to 4ff9d3d Compare November 22, 2023 04:20
@wgtmac wgtmac marked this pull request as ready for review November 22, 2023 04:21
@wgtmac wgtmac force-pushed the size_stats branch 7 times, most recently from 339d397 to ed3a89e Compare November 23, 2023 06:15
@wgtmac
Copy link
Member Author

wgtmac commented Nov 23, 2023

I have just rebased on the latest master branch and fixed all CI falures. As this PR gets too large, I will add print cli command and rewriter support for SizeStatistics in follow-up PRs. This is now ready for review. @emkornfield @etseidl @gszadovszky @shangxinli

@wgtmac
Copy link
Member Author

wgtmac commented Nov 23, 2023

cc @ConeyLiu as I have modified mergeColumnStatistics method which you've just refactored.

@ConeyLiu
Copy link
Contributor

Thank @wgtmac for your notification.

pom.xml Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

Gentle ping @emkornfield

Took another pass through, I'm less familiar with Parquet MR but overall looks ok to me (with the exception of confirming if the one place I found we should be changing length to 0)

@shangxinli
Copy link
Contributor

LGTM

@ConeyLiu @etseidl @emkornfield Do you still have pending comments?

@etseidl
Copy link
Contributor

etseidl commented Feb 19, 2024

Looks good to me too. I'd still like to see the CLI changed at some point to print the new statistics, but if no one else has cycles to work on that, I could try cleaning up what I have locally.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 19, 2024

@etseidl I have filed https://issues.apache.org/jira/browse/PARQUET-2433 and https://issues.apache.org/jira/browse/PARQUET-2434 as follow-up work items. Will work on them once this PR gets merged.

Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks for the great work.

@emkornfield
Copy link
Contributor

I think all of my suggestions have been addressed. Thanks @wgtmac !

@wgtmac
Copy link
Member Author

wgtmac commented Feb 23, 2024

@gszadovszky It would be good if you can take a look if possible.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments but LGTM overall.

This change is about writing these new statistics. Are there any benefits in actually using them at reading? Do we plan to implement?

Side note: This is yet another statistics that we gather during writing data. We already have min/max statistics, null counts, bloom filter, and now some additional ones. I think, we should implement a centralized builder that we call once for each value/dl/rl and can generate the statistics we need. We need to implement the gathering part as optimal as it can be. One check less can significantly decrease write performance.

public void add(int repetitionLevel, int definitionLevel, Binary value) {
add(repetitionLevel, definitionLevel);
if (type.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.BINARY && value != null) {
unencodedByteArrayDataBytes = Math.addExact(unencodedByteArrayDataBytes, value.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think, we shall fear of an overflow while adding an int to a long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to keep this check just in case.

Copy link
Contributor

@gszadovszky gszadovszky Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, it is good to have checks like this one. But it can significantly hit performance when used in places called regularly. This method is invoked for every values. We shall be as effective as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I have removed the overflow check.

pom.xml Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Feb 24, 2024

Thanks for the feedback! I've addressed all the comments and added a new internal ColumnValueCollector class for common stats collections. @gszadovszky

void write(int value, int repetitionLevel, int definitionLevel) {
statistics.updateStats(value);
sizeStatisticsBuilder.add(repetitionLevel, definitionLevel);
if (bloomFilter != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using a no-op BloomFilter implementation instead of a null-check? I am not sure if would perform better, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Could you please review it again?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wgtmac!

@wgtmac wgtmac merged commit d31a891 into apache:master Feb 27, 2024
9 checks passed
pitrou pushed a commit to apache/arrow that referenced this pull request Dec 18, 2024
### Rationale for this change

Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.

### What changes are included in this PR?

Implement reading and writing size statistics for parquet-cpp.

### Are these changes tested?

Yes, a bunch of test cases have been added.

### Are there any user-facing changes?

Yes, now parquet users are able to read and write size statistics.

* GitHub Issue: #40592

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants