Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2365 : Fixes NPE when rewriting column without column index #1173

Merged
merged 6 commits into from
Nov 4, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Oct 17, 2023

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The ColumnIndex could be null in some scenes, for example, the float/double column contains NaN or the size has exceeded the expected value. And the page header statistics are not written anymore after we supported ColumnIndex. So we will get NPE when rewriting the column without ColumnIndex due to we will get NULL page statistics when converted from the ColumnIndex(NULL) or page header statistics(NULL). Such as the following:

java.lang.NullPointerException
    at org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
    at org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
    at org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)

@ConeyLiu
Copy link
Contributor Author

Hi @wgtmac, please help to review this when you are free. Thanks a lot.

@@ -543,6 +546,11 @@ public static ColumnIndex build(
* the statistics to be added
*/
public void add(Statistics<?> stats) {
if (stats.isEmpty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be invalid if the stats are empty. Previously we set it as a null page. @gszadovszky please correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how column index writing worked for the "rewriter". This rewriting thing is newer than the column index. In the normal writing scenario stats cannot be empty since we are just creating these objects during the write path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is especially used when the ColumnIndex is null during rewriting. And we pass empty statistics to the ColumnIndexBuilder to avoid NPE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to understand what happens here. convertStatistics is used to recover page statistics from ColumnIndex or original page header if the ColumnIndex is unavailable. The problem emerges when ColumnIndex is unavailable. Am I correct? If true, then why do we need those changes in the ColumnIndexBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem happens when both the ColumnIndex and the page header Statistics are null. Because we get null returned from the convertStatistics. However, the ParquetFileWriter.writeDataPage needs the page statistics. So here we pass invalid page statistics to avoid the NPE and overwrite the column statistics in the end. Otherwise, we need to add some methods that don't need page statistics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what you have said, it seems that the problem comes from the input file which has valid aggregate statistics for the column chunk but does not write page statistics in the page header. Should we just fix the NPE in the page header and leave other parts as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just fix the NPE in the page header and leave other parts as is?

Updated the implementation.

* @param totalStatistics the column total statistics
* @throws IOException if there is an error while writing
*/
public void endColumn(Statistics<?> totalStatistics) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing this to public is not a good idea. Other good suggestions are welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used only for invalid stats, is it better to add a void invalidateStatistics() which simply include line 988 and 990? Then you just need to call invalidateStatistics() and endColumn() in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to set the column aggregated statistics. Do you mean to modify the public void endColumn(Statistics<?> totalStatistics) to public void invalidateStatistics(Statistics<?> totalStatistics)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with invalidateStatistics

@@ -543,6 +546,11 @@ public static ColumnIndex build(
* the statistics to be added
*/
public void add(Statistics<?> stats) {
if (stats.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to understand what happens here. convertStatistics is used to recover page statistics from ColumnIndex or original page header if the ColumnIndex is unavailable. The problem emerges when ColumnIndex is unavailable. Am I correct? If true, then why do we need those changes in the ColumnIndexBuilder?

@@ -543,6 +546,11 @@ public static ColumnIndex build(
* the statistics to be added
*/
public void add(Statistics<?> stats) {
if (stats.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what you have said, it seems that the problem comes from the input file which has valid aggregate statistics for the column chunk but does not write page statistics in the page header. Should we just fix the NPE in the page header and leave other parts as is?

* @param totalStatistics the column total statistics
* @throws IOException if there is an error while writing
*/
public void endColumn(Statistics<?> totalStatistics) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used only for invalid stats, is it better to add a void invalidateStatistics() which simply include line 988 and 990? Then you just need to call invalidateStatistics() and endColumn() in this case.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

@wgtmac
Copy link
Member

wgtmac commented Nov 3, 2023

It seems that a rebase is required to make CIs happy.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Nov 3, 2023

Just rebased

@wgtmac wgtmac merged commit ff36d6b into apache:master Nov 4, 2023
9 checks passed
@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Nov 4, 2023

Thanks @wgtmac @gszadovszky

@ConeyLiu ConeyLiu deleted the column-index branch November 4, 2023 13:17
@ConeyLiu
Copy link
Contributor Author

Hi, @wgtmac @gszadovszky should we port this to 1.13.x branch? And is there any plan for a new release?

@wgtmac
Copy link
Member

wgtmac commented Nov 10, 2023

IMO, there isn't any plan for 1.13.2 release. But you may port it to 1.13.x branch just in case.

@ConeyLiu
Copy link
Contributor Author

OK, I will port it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants