Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 26, 2023

Rationale for this change

Currently we allow reading bloom filter for specific column and row group. Now this patch allows writing bloom filters to Parquet files.

What changes are included in this PR?

Allow writing bf:

  • Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
  • Add BloomFilterBuilder for parquet
  • From FileSerializer to ColumnWriter, adding bloomfilter
  • Ensure Bloom Filter info is written to the file
  • Testing logic for BloomFilterBuilder
  • Testing logic for BloomFilter and ColumnWriter
  • Testing whole roundtrip like ParquetPageIndexRoundTripTest

Are these changes tested?

Yes

Are there any user-facing changes?

User can create Bloom Filter in parquet with C++ api

@mapleFU
Copy link
Member Author

mapleFU commented Aug 26, 2023

This is port of #35691 . I'm busy previous days and now I've time on it now.

The previous comment are solved. cc @pitrou @wgtmac @emkornfield

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 26, 2023
@mapleFU mapleFU requested review from emkornfield and pitrou August 26, 2023 18:43
@mapleFU mapleFU changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement Aug 27, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I just did an initial review except the test.

@wgtmac wgtmac changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation Aug 30, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 1, 2023
@wgtmac
Copy link
Member

wgtmac commented Nov 26, 2025

Thanks for the review, @HuaHuaY. I have addressed all your comments.

@wgtmac
Copy link
Member

wgtmac commented Nov 26, 2025

As I have made significant changes to this PR, I don't think I have the privilege to approve it. Could you please take a look when you have time? @pitrou

@pitrou
Copy link
Member

pitrou commented Nov 26, 2025

Yes, I should do that. Sorry!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, I haven't looked at the tests. Overall the APIs look a bit complicated and clumsy to me.

@wgtmac wgtmac force-pushed the parquet/support-write-bloom-filter branch from 6c05857 to 789d130 Compare December 6, 2025 10:32
@wgtmac
Copy link
Member

wgtmac commented Dec 6, 2025

@pitrou Thanks for your review! I've addressed all your comments. Please take a look again.

@wgtmac
Copy link
Member

wgtmac commented Jan 13, 2026

Gentle ping @pitrou. Is it possible to include it to the next release?

@pitrou
Copy link
Member

pitrou commented Jan 13, 2026

Hmm, I'm really sorry. I will make some time to review it quickly, but I'm afraid it's probably too late for this release as feature freeze in already in full force and @raulcd has started producing a RC :(

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better now, here are some more minor comments. Thanks for tackling this @wgtmac !

if (value == std::nullopt) {
continue;
}
EXPECT_TRUE(bloom_filter.FindHash(bloom_filter.Hash(value.value())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not mandatory, but if we had FindHash(const Scalar&), it could remove the need for templating in these test functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a separate improvement because the current bloom filter implementation does not deal with any Arrow type.

@pitrou
Copy link
Member

pitrou commented Jan 13, 2026

This needs rebasing on git main, by the way, also I think you'll need to update the fuzzing harness, for example in

Status FuzzReadPageIndex(RowGroupPageIndexReader* reader, const SchemaDescriptor* schema,
int column) {
Status st;
BEGIN_PARQUET_CATCH_EXCEPTIONS
auto offset_index = reader->GetOffsetIndex(column);
if (offset_index) {
offset_index->page_locations();
offset_index->unencoded_byte_array_data_bytes();
}
auto col_index = reader->GetColumnIndex(column);
if (col_index) {
st &= FuzzReadColumnIndex(col_index.get(), schema->Column(column));
}
END_PARQUET_CATCH_EXCEPTIONS
return st;
}

@wgtmac
Copy link
Member

wgtmac commented Jan 14, 2026

Thanks for your review @pitrou. I think I've addressed all comments and rebased on top of the main branch. I think fuzz_internal.cc has already dealt with reading bloom filter, am I missing something?

@pitrou
Copy link
Member

pitrou commented Jan 14, 2026

I think fuzz_internal.cc has already dealt with reading bloom filter, am I missing something?

I mean, perhaps the changed APIs mandate some code changes there? Let's see run the CI anyway.

@pitrou
Copy link
Member

pitrou commented Jan 14, 2026

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: a126e03

Submitted crossbow builds: ursacomputing/crossbow @ actions-b54235b7c8

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-experimental-cpp-gcc-15 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more minor comments, but LGTM in general.

Comment on lines 193 to 194
/// | 100,000 | 0.10 | ~6.0 | 75 KB | **128 KB** |
/// | 100,000 | 0.01 | ~10.5 | 131 KB | **256 KB** |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you computed those numbers? Using the formula from OptimalNumOfBits (-8.0 / log(1 - pow(fpp, 1.0 / 8)) I get 5.77 bits / key for fpp = 0.1 (and 9.68 bits / key for fpp = 0.01).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is unexpected. I'll open an issue on parquet-format. Regardless, it seems this table should list the actual numbers as obtained through our own OptimalNumOfBits function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac wgtmac changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation GH-34785: [C++][Parquet] Add bloom filter write support Jan 14, 2026
Comment on lines 192 to 202
/// | ndv | fpp | bits/key | size |
/// |:-----------|:------|:---------|:----------|
/// | 100,000 | 0.10 | 10.5 | 128 KiB |
/// | 100,000 | 0.01 | 10.5 | 128 KiB |
/// | 1,000,000 | 0.10 | 8.4 | 1024 KiB |
/// | 1,000,000 | 0.01 | 16.8 | 2048 KiB |
/// | 10,000,000 | 0.10 | 6.7 | 8192 KiB |
/// | 10,000,000 | 0.01 | 13.4 | 16384 KiB |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 0.05 is the default, perhaps show it as well? (sorry, should have noticed before!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! Updated.

@wgtmac wgtmac force-pushed the parquet/support-write-bloom-filter branch from 6e35c3e to 0638b11 Compare January 14, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Allow writing BloomFilter for specific column

10 participants