Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2060: Fix infinite loop while reading corrupt files #1245

Merged
merged 8 commits into from
Jan 30, 2024

Conversation

rathinb-db
Copy link
Contributor

@rathinb-db rathinb-db commented Jan 3, 2024

When a parquet file is corrupted, decompress can return 0 bytes. This can cause an infinite loop in the JDK readFully that calls the NonBlockedDecompressorStream read. Input streams are never supposed to return 0, but we do if there's a corrupt file.

By throwing an error, we break early and exit fast.

Make sure you have checked all steps below.

Jira

Tests

Add existing unit test. Manually tested with a corrupt file.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@ConeyLiu
Copy link
Contributor

ConeyLiu commented Jan 4, 2024

Could you add a UT for this?

@rathinb-db
Copy link
Contributor Author

rathinb-db commented Jan 4, 2024

Could you add a UT for this?

cc: @ConeyLiu
I've added a unit test. Please take a look and suggest any changes if any. I've added the try .. catch to make the exception handling more explicit.

@wgtmac
Copy link
Member

wgtmac commented Jan 9, 2024

Thanks for the fix! Could you please make the CI happy?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@rathinb-db
Copy link
Contributor Author

rathinb-db commented Jan 26, 2024

Thanks for the inputs! I'd like to replace the mock with an actual corrupt file that can reliably reproduce this. Are there any simple ways to read the file and call NonBlockedDecompressorStream? cc: @wgtmac

@rathinb-db rathinb-db requested review from wgtmac and ConeyLiu January 26, 2024 18:22
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Thanks! @rathinb-db

@rathinb-db rathinb-db requested a review from wgtmac January 28, 2024 15:40
@rathinb-db rathinb-db changed the title PARQUET-2060: Fix infinite loop while reading corrupt file PARQUET-2060: Fix infinite loop while reading corrupt files Jan 29, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ConeyLiu Do you want to take another pass?

Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wgtmac wgtmac merged commit e87b803 into apache:master Jan 30, 2024
9 checks passed
@rathinb-db rathinb-db deleted the PARQUET-2060 branch August 8, 2024 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants