Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2420 : ThriftParquetWriter Convert byte to LogicalType int32 bitWidth 8 instead of primitive int32 #1254

Merged
merged 3 commits into from
Jan 19, 2024

Conversation

shreyas-dview
Copy link
Contributor

@shreyas-dview shreyas-dview commented Jan 15, 2024

The current implementation of Parquet serialisation from Thrift Definitions results in the incorrect conversion of Thrift byte fields into INT32 without preserving the required LogicalType Metadata in the Parquet file. The correct conversion should result in INT32 with LogicalType of bitWidth 8 and signed = true.

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@shreyas-dview
Copy link
Contributor Author

cc: @Fokko @wgtmac

@shreyas-dview
Copy link
Contributor Author

Thrift Definition

struct TestLogicalType {
1: required i16 test_i16,
2: required byte test_i8
} 

Current Parquet Schema Generated

message ParquetSchema {
  required int32 test_i16 (INTEGER(16,true)) = 1;
  required int32 test_i8 = 2;
} 

Expected Parquet Schema

message ParquetSchema {
  required int32 test_i16 (INTEGER(16,true)) = 1;
  required int32 test_i8 (INTEGER(8,true)) = 2;
}

@shreyas-dview shreyas-dview changed the title PARQUET-2420 : Convert byte to logical int32 bitWidth 8 instead of primitive int32 PARQUET-2420 : ThriftParquetWriter Convert byte to LogicalType int32 bitWidth 8 instead of primitive int32 Jan 15, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks!

@shreyas-dview
Copy link
Contributor Author

@Fokko / @wgtmac Can we merge this ?

@wgtmac wgtmac merged commit d03b49b into apache:master Jan 19, 2024
9 checks passed
@wgtmac
Copy link
Member

wgtmac commented Jan 19, 2024

I just merged this. Thanks @shreyas-dview and @Fokko!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants