-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2413: Support configurable extraMetadata in ParquetWriter #1241
Conversation
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
Outdated
Show resolved
Hide resolved
@@ -113,11 +110,6 @@ public Builder withType(MessageType type) { | |||
return this; | |||
} | |||
|
|||
public Builder withExtraMetaData(Map<String, String> extraMetaData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is now available through the ParquetWriter.Builder
superclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add an overload here to suppress the complaint of japicmp? That's more preferable than an exclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea! updated + removed the exclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 thanks @wgtmac ping me. And thanks @clairemcginty for the contribution.
@clairemcginty It seems that we need to add an exclusion to japicmp to make the CI happy. |
13a5601
to
cfd0437
Compare
Added! The CI on my fork's GHA seems to be happy now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I've left some minor comments.
extraMetadata = new HashMap<>(writeContext.getExtraMetaData()); | ||
|
||
encodingProps.getExtraMetaData().forEach((metadataKey, metadataValue) -> { | ||
if (metadataKey.equals(OBJECT_MODEL_NAME_PROP)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid specializing any key? IIUC, it can also be caught at line 422 if OBJECT_MODEL_NAME_PROP has been set already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This key is kind of a special case since it's added by the delegated InternalParqueRecordWriter
at the end of writing: https://github.com/apache/parquet-mr/blob/945836c/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L132-L134
So if we remove this extra check and there's a conflicting OBJECT_MODEL_NAME_PROP
key in extraMetaData
, InternalParquetRecordWriter
will silently overwrite it at the end.
@@ -113,11 +110,6 @@ public Builder withType(MessageType type) { | |||
return this; | |||
} | |||
|
|||
public Builder withExtraMetaData(Map<String, String> extraMetaData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add an overload here to suppress the complaint of japicmp? That's more preferable than an exclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks @clairemcginty and @ConeyLiu!
https://issues.apache.org/jira/browse/PARQUET-2413
Adds support for Configurable
extraMetadata
in Parquet file footer. This makes it easier for users to migrate from Avro to Parquet (since Avro supports custom metadata keys).I chose this approach (parsing values from a preset
Configuration
key prefix) because (a)Configuration
is already Stringable, so no need to worry about object-to-String conversion, and (b) it doesn't require any API changes (i.e. addingwithExtraMetadata
Builder options to all Parquet writers/implementations).Alternate approaches:
withExtraMetadata
Builder method toParquetWriter
; inParquetWriter#build
, append all to the value ofWriteContext#getExtraMetadata()
.extraMetadata
class variables to all WriteSupport implementations, and pass toWriteContext
inWriteSupport#init
(i.e. AvroWriteSupport).Let me know if either of those approaches are preferable to this one!
Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation