GH-48728: [C++] Cache compiled regex matchers in string kernels #48729

HyukjinKwon · 2026-01-06T00:32:45Z

Rationale for this change

String operations with regex patterns (match, replace, extract) were recompiling regex patterns on every invocation. This PR implements caching to compile once and reuse.

Benchmark shows roughly 36% performance improvement (2.52s -> 1.61s for 200 operations).

arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

Line 1371 in 727106f

// TODO Cache matcher across invocations (for regex compilation)

arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

Line 1381 in 727106f

// TODO Cache matcher across invocations (for regex compilation)

arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

Line 1965 in 727106f

// TODO Cache replacer across invocations (for regex compilation)

arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

Line 2218 in 727106f

// TODO cache this once per ExtractRegexOptions

What changes are included in this PR?

Added CachedOptionsWrapper<T> template for kernel state with caching support
Updated MatchSubstringState, ReplaceState, and ExtractRegexState to use caching
Modified Exec() methods to call GetOrCreate<Matcher>() instead of direct Matcher::Make()

Are these changes tested?

Yes. All existing tests pass. Benchmark demonstrates measurable performance improvement when same pattern is used across multiple operations.

(Generated by ChatGPT)

Benchmark:

# Benchmark script: Compare WITH vs WITHOUT caching

# Step 1: Measure WITH caching (current implementation)
cd /.../arrow/cpp/build
/usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
  --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
  --gtest_repeat=200 \
  --gtest_brief=1

# Step 2: Temporarily remove caching
cd /.../arrow/cpp
git stash push -m "Temp for benchmark" src/arrow/compute/kernels/scalar_string_ascii.cc

# Step 3: Rebuild WITHOUT caching
cd build
touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
cmake --build .

# Step 4: Measure WITHOUT caching (reverted to old TODO code)
/usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
  --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
  --gtest_repeat=200 \
  --gtest_brief=1

# Step 5: Restore caching
cd ..
git stash pop
cd build && touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
cmake --build .

Results:

╔════════════════════════════════════════════════════════╗
║              BENCHMARK RESULTS                         ║
╠════════════════════════════════════════════════════════╣
║  WITHOUT Caching:      2.52 seconds                    ║
║  WITH Caching:         1.61 seconds                    ║
║  ─────────────────────────────────────                 ║
║  Time Saved:           0.91 seconds                    ║
║  Improvement:          36.1% FASTER                    ║
╚════════════════════════════════════════════════════════╝

Test Configuration:
  • Test: TestStringKernels/0.MatchSubstringRegex
  • Iterations: 200 repetitions
  • Pattern: Complex regex with groups/alternation
  • Per-operation: 12.6ms → 8.05ms (4.5ms saved)

Are there any user-facing changes?

No, this is an optiomization.

GitHub Issue: [C++] Cache compiled regex matchers in string kernels #48728

github-actions · 2026-01-06T00:33:10Z

⚠️ GitHub issue #48728 has been automatically assigned in GitHub to PR creator.

pitrou

Thanks for this. This ties the cached object's lifetime to the lifetime of the KernelState, right?

The problem is that a new KernelState is created each time GetFunctionExecutor is called (through KernelState::Init). So in most cases the caching would not be very effective. I suppose the caching may work for chunked array inputs, though?

pitrou · 2026-01-12T17:17:35Z

cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

+// Similar to OptionsWrapper, but caches a compiled object to avoid recompiling on each
+// invocation (e.g., regex matchers). Follows the same pattern as OptionsWrapper.
+template <typename OptionsType>
+struct CachedOptionsWrapper : public KernelState {


Can this perhaps inherit from OptionsWrapper if it can reduce the amount of additional code?

pitrou · 2026-01-12T17:22:39Z

cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

+  // Get or create cached object of a specific type
+  template <typename ObjectType, typename... Args>
+  Result<const ObjectType*> GetOrCreate(Args&&... args) {


Since this is meant to be independent of ObjectType, do you think it's worth to make it take a callable to avoid relying on the existence of ObjectType::Make?

Something like this perhaps:

Suggested change

// Get or create cached object of a specific type

template <typename ObjectType, typename... Args>

Result<const ObjectType*> GetOrCreate(Args&&... args) {

// Get or create cached object of a specific type

template <typename ObjectType, typename... Args>

auto GetOrCreate(Factory&& factory, Args&&... args)

-> Result<const std::decay_t<decltype(factory(args...))>*> {

pitrou · 2026-01-12T17:22:53Z

cpp/src/arrow/compute/kernels/scalar_string_ascii.cc

+  // Type-erased cache for compiled objects (can store any object type)
+  std::shared_ptr<void> cached_object;


Why not std::any?

github-actions bot added Component: C++ awaiting review Awaiting review labels Jan 6, 2026

HyukjinKwon force-pushed the GH-48728 branch from bdf2aaf to a6c7b6e Compare January 6, 2026 00:40

[C++] Cache compiled regex matchers in string kernels

7e216cd

HyukjinKwon force-pushed the GH-48728 branch from a6c7b6e to 7e216cd Compare January 6, 2026 01:51

pitrou reviewed Jan 12, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48728: [C++] Cache compiled regex matchers in string kernels #48729

GH-48728: [C++] Cache compiled regex matchers in string kernels #48729

Uh oh!

HyukjinKwon commented Jan 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

pitrou left a comment

Uh oh!

pitrou Jan 12, 2026

Uh oh!

pitrou Jan 12, 2026

Uh oh!

pitrou Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Type-erased cache for compiled objects (can store any object type)
		std::shared_ptr<void> cached_object;

GH-48728: [C++] Cache compiled regex matchers in string kernels #48729

Are you sure you want to change the base?

GH-48728: [C++] Cache compiled regex matchers in string kernels #48729

Uh oh!

Conversation

HyukjinKwon commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Jan 6, 2026 •

edited

Loading