Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Filter Parquet pages with ParquetColumnExpr #20714

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Jan 14, 2025

This PR adds a ParquetColumnExpr which allows predicate filtering while reading Parquet pages. While this is currently implemented with many limitations, this can eventually allow for way more granular filtering of items without having to traverse all pages. This is especially beneficial for equality predicates and predicates over dictionary encoded pages. Another nice side effect is that it should
massively reduce the memory consumption for strict queries.

At the moment, this is only triggered if there is a single binary expression with a column on one side and a scalar on the other side.

It can be disabled with by setting the environment variable POLARS_NO_PARQUET_EXPR=1.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 14, 2025
@coastalwhite coastalwhite added the needs-bench Needs a benchmark run label Jan 15, 2025
@coastalwhite coastalwhite marked this pull request as ready for review January 15, 2025 16:03
@coastalwhite coastalwhite changed the title feat: Start working with ParquetColumnExpr feat: Filter Parquet pages with ParquetColumnExpr Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 46.83673% with 1042 lines in your changes missing coverage. Please review.

Project coverage is 79.44%. Comparing base (a0d96f2) to head (332161a).
Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
...-parquet/src/arrow/read/deserialize/binview/mod.rs 18.51% 110 Missing ⚠️
...w/read/deserialize/dictionary_encoded/predicate.rs 54.36% 94 Missing ⚠️
...et/src/arrow/read/deserialize/binview/predicate.rs 0.00% 83 Missing ⚠️
...olars-parquet/src/arrow/read/deserialize/simple.rs 53.65% 76 Missing ⚠️
...et/src/arrow/read/deserialize/fixed_size_binary.rs 25.74% 75 Missing ⚠️
...rs-parquet/src/arrow/read/deserialize/utils/mod.rs 48.76% 62 Missing ⚠️
crates/polars-parquet/src/arrow/read/expr.rs 12.69% 55 Missing ⚠️
...quet/src/arrow/read/deserialize/primitive/float.rs 23.52% 39 Missing ⚠️
crates/polars-io/src/predicates.rs 63.00% 37 Missing ⚠️
crates/polars-arrow/src/array/binview/mutable.rs 0.00% 34 Missing ⚠️
... and 32 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20714      +/-   ##
==========================================
+ Coverage   79.03%   79.44%   +0.40%     
==========================================
  Files        1559     1570      +11     
  Lines      221238   223230    +1992     
  Branches     2529     2530       +1     
==========================================
+ Hits       174851   177340    +2489     
+ Misses      45806    45308     -498     
- Partials      581      582       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs-bench Needs a benchmark run python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant