-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2060: Fix infinite loop while reading corrupt files #1245
Conversation
Could you add a UT for this? |
cc: @ConeyLiu |
Thanks for the fix! Could you please make the CI happy? |
...t-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestNonBlockedDecompressorStream.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
...t-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestNonBlockedDecompressorStream.java
Outdated
Show resolved
Hide resolved
Thanks for the inputs! I'd like to replace the mock with an actual corrupt file that can reliably reproduce this. Are there any simple ways to read the file and call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Thanks! @rathinb-db
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestCompressionCodec.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@ConeyLiu Do you want to take another pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
When a parquet file is corrupted,
decompress
can return 0 bytes. This can cause an infinite loop in the JDKreadFully
that calls the NonBlockedDecompressorStreamread
. Input streams are never supposed to return 0, but we do if there's a corrupt file.By throwing an error, we break early and exit fast.
Make sure you have checked all steps below.
Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Add existing unit test. Manually tested with a corrupt file.
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation