Skip to content

Conversation

@KKould
Copy link
Member

@KKould KKould commented Jan 20, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Databend’s Python UDFs are currently implemented via in-process calls using PyO3, which introduces several key limitations:

  • GIL constraints: true multi-core parallelism cannot be achieved.
  • Environment conflicts: Python dependencies required by different UDFs on the same node are prone to conflicts.
  • Stability risks: segfaults in the Python layer can directly crash the query process.
  • Resource contention: Python’s memory usage is not well controlled and can interfere with the SQL executor’s buffer usage.

This PR introduces support for remotely and dynamically executing Python UDFs in Databend, organized into three layers:

  • Control Plane (Cloud Control Plane): responsible for resource scheduling, permission validation, and sandbox lifecycle management.
  • Execution Plane (Databend Query): acts as the client and issues computation requests via the Arrow Flight protocol.
  • Compute Plane (Sandbox Workers): lightweight Python environments isolated with gVisor, running the databend-udf service.

Workflow

------------------------+   ApplyResource   +------------------------+
|   Databend Query       | ----------------> |   Cloud Controller      |
|  (Execution Plane)     | <---------------- |  Resource Manager       |
|  - UDF Planner         |    Endpoint+Token |  Image Cache/Warm Pool  |
+-----------+------------+                   +-----------+------------+
            |   Arrow Flight (DoExchange)                |
            |                                            | Provision
            v                                            v
+------------------------+                   +------------------------+
|   Sandbox Worker Pod   | <---------------  |   K8s + runsc (gVisor)  |
|  (Compute Plane)       |                   +------------------------+
|  databend-udf service  |
+------------------------+

Config

Query to configure the startup of Cloud Python UDF

[query]
enable_udf_cloud_script = true
cloud_control_grpc_server_address = "http://0.0.0.0:50051"

Set the expiration time of the presigned URL in the session.

SET udf_cloud_import_presign_expire_secs = 1800;

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jan 20, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 20, 2026

🤖 CI Job Analysis

Workflow: 21282089344

📊 Summary

  • Total Jobs: 85
  • Failed Jobs: 5
  • Retryable: 0
  • Code Issues: 5

NO RETRY NEEDED

All failures appear to be code/test issues requiring manual fixes.

🔍 Job Details

  • linux / test_unit: Not retryable (Code/Test)
  • linux / test_stateless_cluster: Not retryable (Code/Test)
  • linux / test_stateful_standalone: Not retryable (Code/Test)
  • linux / test_stateless_standalone: Not retryable (Code/Test)
  • linux / test_stateful_cluster: Not retryable (Code/Test)

🤖 About

Automated analysis using job annotations to distinguish infrastructure issues (auto-retried) from code/test issues (manual fixes needed).

@KKould KKould requested review from everpcpc and sundy-li January 20, 2026 10:40
@KKould KKould self-assigned this Jan 20, 2026
@KKould KKould changed the title feat: cloud python udf feat: Sandbox UDF Jan 21, 2026
@KKould KKould force-pushed the feat/cloud_python_udf branch from 221f0f3 to 8a495ef Compare January 21, 2026 07:42

package udfproto;

message UdfImport {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UdfAsset


message ApplyUdfResourceRequest {
// JSON runtime spec (code/handler/types/packages). Control plane builds Dockerfile.
string spec = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why embed JSON in protobuf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this method ‘build_udf_cloud_spec’, pass the spec to the mock server so that it controls the specific construction of the Dockerfile.

Copy link
Collaborator

@forsaken628 forsaken628 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why JSON? Protobuf itself has a lot of compatibility designs, there is no need to use embedded JSON

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker specs are usually constructed in JSON to represent nested parameters, which makes structural adjustments convenient. Using Protobuf to pass nested structures is relatively more cumbersome. Does @everpcpc think using Protobuf would be a better choice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the structure is relatively fixed and won’t change much, I would prefer to use Protobuf. Although JSON is more flexible for handling nesting and structural adjustments, Protobuf’s strong type definitions are clearer to work with for stable structures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, for adjusting the structure of how to parse the json, the specification is completely undefined, everything depends on the implementation, there may be subtle differences in each library, json there are some details of the issue of creating countless bugs, for example, [] can be represented by null is a typical example.
pb, on the other hand, is much more explicit in how it handles this, as it is written in the specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants