[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: it is not a ready to merge pull request, but a request to check if the concept of using code generation for solving some performance issues, associated with the usage of protobuf reflection when writing or reading parquet files, is of potential interest of repository owners. I decided to verify the concept at a rather early stage due to a significant effort required to implement the change. Should the approach and a new optional dependency on ByteBuddy is found satisfactorily and potentially acceptable to be included into parquet-java, I will attempt to properly finish first the 'write' part and then the 'read' part (in terms of code quality and tests). Therefore, any feedback is appreciated.
Rationale for this change
We read and write a lot of parquet data, defined by protobuf schemas from Java. It is seen that this can be done faster than what is offered out of the box now.
The change introduced improves proto-to-parquet file writing performance by means of code generation (in my synthetic tests by around 50% with SNAPPY compression, especially, when structures have a lot of primitive type fields).
What changes are included in this PR?
Are these changes tested?
current unit tests work fine.
Are there any user-facing changes?
a configuration to disable code generation logic.