-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: optimize CASE WHEN for divide-by-zero protection pattern #19994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: optimize CASE WHEN for divide-by-zero protection pattern #19994
Conversation
267d8df to
334785f
Compare
|
run benchmark tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Hi @andygrove , |
|
run benchmark tpcds |
|
Also FYI @pepijnve |
|
🤖 |
|
It might be useful to add a microbenchmark to https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/benches/case_when.rs so we can compare before/after. I don't think any of the existing ones will cover this particular pattern. The TCP-DS queries that contain this pattern (based on |
|
🤖: Benchmark completed Details
|
- Adds a specialization for the common pattern: CASE WHEN y > 0 THEN x / y ELSE NULL END - Add EvalMethod::DivideByZeroProtection variant - Add pattern detection in find_best_eval_method() - Implement safe_divide using Arrow kernels - Handle CastExpr wrapping on divisor
334785f to
6479884
Compare
|
Microbenchmark result:
|
|
Hi @pepijnve , thanks for pointing out the need for a microbenchmark. I compared |
| } | ||
| } | ||
|
|
||
| fn benchmark_divide_by_zero_protection(c: &mut Criterion, batch_size: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please pull this additional benchmark code code into its own PR we could then use our benchmarking scripts to compare performance of this PR with main?
Thank you 🙏
Which issue does this PR close?
Rationale for this change
The
CaseExprimplementation is expensive. A common usage pattern (particularly in TPC-DS benchmarks) is to protect against divide-by-zero:This entire expression can be replaced with a simpler divide operation that returns NULL when the divisor is zero, avoiding the overhead of full CASE evaluation.
What changes are included in this PR?
Are these changes tested?
Yes, added two new tests:
Are there any user-facing changes?
No. This is an internal optimization that produces the same results but with better performance.