-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34785: [C++][Parquet] Add bloom filter write support #37400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This is port of #35691 . I'm busy previous days and now I've time on it now. The previous comment are solved. cc @pitrou @wgtmac @emkornfield |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I just did an initial review except the test.
|
Thanks for the review, @HuaHuaY. I have addressed all your comments. |
|
As I have made significant changes to this PR, I don't think I have the privilege to approve it. Could you please take a look when you have time? @pitrou |
|
Yes, I should do that. Sorry! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, I haven't looked at the tests. Overall the APIs look a bit complicated and clumsy to me.
6c05857 to
789d130
Compare
|
@pitrou Thanks for your review! I've addressed all your comments. Please take a look again. |
|
Gentle ping @pitrou. Is it possible to include it to the next release? |
|
Hmm, I'm really sorry. I will make some time to review it quickly, but I'm afraid it's probably too late for this release as feature freeze in already in full force and @raulcd has started producing a RC :( |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better now, here are some more minor comments. Thanks for tackling this @wgtmac !
| if (value == std::nullopt) { | ||
| continue; | ||
| } | ||
| EXPECT_TRUE(bloom_filter.FindHash(bloom_filter.Hash(value.value()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not mandatory, but if we had FindHash(const Scalar&), it could remove the need for templating in these test functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a separate improvement because the current bloom filter implementation does not deal with any Arrow type.
|
This needs rebasing on git main, by the way, also I think you'll need to update the fuzzing harness, for example in arrow/cpp/src/parquet/arrow/fuzz_internal.cc Lines 160 to 175 in d54a205
|
|
Thanks for your review @pitrou. I think I've addressed all comments and rebased on top of the main branch. I think fuzz_internal.cc has already dealt with reading bloom filter, am I missing something? |
I mean, perhaps the changed APIs mandate some code changes there? Let's see run the CI anyway. |
|
@github-actions crossbow submit -g cpp |
|
Revision: a126e03 Submitted crossbow builds: ursacomputing/crossbow @ actions-b54235b7c8 |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more minor comments, but LGTM in general.
cpp/src/parquet/properties.h
Outdated
| /// | 100,000 | 0.10 | ~6.0 | 75 KB | **128 KB** | | ||
| /// | 100,000 | 0.01 | ~10.5 | 131 KB | **256 KB** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how you computed those numbers? Using the formula from OptimalNumOfBits (-8.0 / log(1 - pow(fpp, 1.0 / 8)) I get 5.77 bits / key for fpp = 0.1 (and 9.68 bits / key for fpp = 0.01).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is unexpected. I'll open an issue on parquet-format. Regardless, it seems this table should list the actual numbers as obtained through our own OptimalNumOfBits function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/parquet-format#547 opened.
| /// | ndv | fpp | bits/key | size | | ||
| /// |:-----------|:------|:---------|:----------| | ||
| /// | 100,000 | 0.10 | 10.5 | 128 KiB | | ||
| /// | 100,000 | 0.01 | 10.5 | 128 KiB | | ||
| /// | 1,000,000 | 0.10 | 8.4 | 1024 KiB | | ||
| /// | 1,000,000 | 0.01 | 16.8 | 2048 KiB | | ||
| /// | 10,000,000 | 0.10 | 6.7 | 8192 KiB | | ||
| /// | 10,000,000 | 0.01 | 13.4 | 16384 KiB | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 0.05 is the default, perhaps show it as well? (sorry, should have noticed before!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense! Updated.
6e35c3e to
0638b11
Compare
Rationale for this change
Currently we allow reading bloom filter for specific column and row group. Now this patch allows writing bloom filters to Parquet files.
What changes are included in this PR?
Allow writing bf:
ParquetPageIndexRoundTripTestAre these changes tested?
Yes
Are there any user-facing changes?
User can create Bloom Filter in parquet with C++ api