Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gowa
Copy link

@gowa gowa commented Jan 12, 2025

Note: it is not a ready to merge pull request, but a request to check if the concept of using code generation for solving some performance issues, associated with the usage of protobuf reflection when writing or reading parquet files, is of potential interest of repository owners. I decided to verify the concept at a rather early stage due to a significant effort required to implement the change. Should the approach and a new optional dependency on ByteBuddy is found satisfactorily and potentially acceptable to be included into parquet-java, I will attempt to properly finish first the 'write' part and then the 'read' part (in terms of code quality and tests). Therefore, any feedback is appreciated.

Rationale for this change

We read and write a lot of parquet data, defined by protobuf schemas from Java. It is seen that this can be done faster than what is offered out of the box now.
The change introduced improves proto-to-parquet file writing performance by means of code generation (in my synthetic tests by around 50% with SNAPPY compression, especially, when structures have a lot of primitive type fields).

What changes are included in this PR?

  1. an extension point in MessageWriter that redirects writing to a generated on-the-fly class dealing with protobuf generated classes getters directly, not via Protobuf Java Reflection methods.
  2. a separate class where all code generation logic is located.

Are these changes tested?

current unit tests work fine.

Are there any user-facing changes?

a configuration to disable code generation logic.

@wgtmac
Copy link
Member

wgtmac commented Jan 14, 2025

Thanks for your interest in contributing this! This seems to be a large feature and the performance gain is promising! However, I'm afraid that this PR may not get prompt review due to lack of active parquet-protobuf maintainers. I do not have any knowledge on ByteBuddy so it might take a long time to wrap it up. Is it possible to make it pluggable so the large portion of codegen logic does not have to exist in the parquet-java repo?

cc @gszadovszky @julienledem if you know someone can help review this.

@gszadovszky
Copy link
Contributor

The noted performance gain is promising indeed. However, it would be nice to see actual numbers for different scenarios (flat columns, general nested columns, deeply nested columns) of read/write. You might even implement your performance tests in the module parquet-benchmarks.

Unfortunately, I'm not expert in parquet-protobuf either not even talking about ByteBuddy. For the final PR review it would be a great help to have someone who have some experience with ByteBuddy even if not being a Parquet committer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants