Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2413: Support configurable extraMetadata in ParquetWriter #1241

Merged
merged 5 commits into from
Jan 28, 2024

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Dec 18, 2023

https://issues.apache.org/jira/browse/PARQUET-2413

Adds support for Configurable extraMetadata in Parquet file footer. This makes it easier for users to migrate from Avro to Parquet (since Avro supports custom metadata keys).

I chose this approach (parsing values from a preset Configuration key prefix) because (a) Configuration is already Stringable, so no need to worry about object-to-String conversion, and (b) it doesn't require any API changes (i.e. adding withExtraMetadata Builder options to all Parquet writers/implementations).

Alternate approaches:

  • Add withExtraMetadata Builder method to ParquetWriter; in ParquetWriter#build, append all to the value of WriteContext#getExtraMetadata().
  • Add extraMetadata class variables to all WriteSupport implementations, and pass to WriteContext in WriteSupport#init (i.e. AvroWriteSupport).

Let me know if either of those approaches are preferable to this one!

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • [z] My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@@ -113,11 +110,6 @@ public Builder withType(MessageType type) {
return this;
}

public Builder withExtraMetaData(Map<String, String> extraMetaData) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is now available through the ParquetWriter.Builder superclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add an overload here to suppress the complaint of japicmp? That's more preferable than an exclusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea! updated + removed the exclusion.

@wgtmac
Copy link
Member

wgtmac commented Jan 12, 2024

@Fokko @ConeyLiu Do you have any comment?

Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks @wgtmac ping me. And thanks @clairemcginty for the contribution.

@wgtmac
Copy link
Member

wgtmac commented Jan 12, 2024

@clairemcginty It seems that we need to add an exclusion to japicmp to make the CI happy.

@clairemcginty clairemcginty force-pushed the parquet-writer-metadata branch from 13a5601 to cfd0437 Compare January 12, 2024 19:25
@clairemcginty
Copy link
Contributor Author

@clairemcginty It seems that we need to add an exclusion to japicmp to make the CI happy.

Added! The CI on my fork's GHA seems to be happy now

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I've left some minor comments.

extraMetadata = new HashMap<>(writeContext.getExtraMetaData());

encodingProps.getExtraMetaData().forEach((metadataKey, metadataValue) -> {
if (metadataKey.equals(OBJECT_MODEL_NAME_PROP)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid specializing any key? IIUC, it can also be caught at line 422 if OBJECT_MODEL_NAME_PROP has been set already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key is kind of a special case since it's added by the delegated InternalParqueRecordWriter at the end of writing: https://github.com/apache/parquet-mr/blob/945836c/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L132-L134

So if we remove this extra check and there's a conflicting OBJECT_MODEL_NAME_PROP key in extraMetaData, InternalParquetRecordWriter will silently overwrite it at the end.

@@ -113,11 +110,6 @@ public Builder withType(MessageType type) {
return this;
}

public Builder withExtraMetaData(Map<String, String> extraMetaData) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add an overload here to suppress the complaint of japicmp? That's more preferable than an exclusion.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks @clairemcginty and @ConeyLiu!

@wgtmac wgtmac merged commit 19f2843 into apache:master Jan 28, 2024
9 checks passed
@clairemcginty clairemcginty deleted the parquet-writer-metadata branch January 29, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants