Skip to content

Conversation

@sesame0224
Copy link
Contributor

@sesame0224 sesame0224 commented Jan 8, 2026

What this PR does / why we need it

This PR proposes a new method for enabling direct data communication between Numaflow vertices, and describes the motivation, resource specifications and workflows.

Internal discussions are ongoing, so we will revise the doc as appropriate.

Related issues

#2990
This PR is an initial design derived from the above post.
We would like to discuss this with the community.

Testing

This PR includes only documentation, so no tests were performed.

Special notes for reviewers

Since base branch is wrong in #3125, I recreated this PR.

@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.14%. Comparing base (7f28dc7) to head (d665ada).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3129      +/-   ##
==========================================
+ Coverage   79.73%   80.14%   +0.41%     
==========================================
  Files         291      296       +5     
  Lines       65143    67530    +2387     
==========================================
+ Hits        51941    54122    +2181     
- Misses      12649    12858     +209     
+ Partials      553      550       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vigith
Copy link
Member

vigith commented Jan 9, 2026

Thank you for putting this together. I have a couple of basic questions.

  1. How do you replay a message if we rely purely on direct communication? E.g, if the pod in VertexN is talking to another pod in VertexN+1 and that pod in VertexN crashes due to a underlying node failure, then the message will be lost.
    Today we use buffer so that the producer (VertexN) and the next vertex (VertexN+1) are decoupled. The reason we did this was because a pod could crash any time, but we have a buffer that is persistent which can be used by the VertexN+1 to read the data.

  2. How do you know which pod in VertexN should talk to which pod in VertexN+1

@sesame0224
Copy link
Contributor Author

sesame0224 commented Jan 14, 2026

  1. How do you replay a message if we rely purely on direct communication?

As a premise, we assume a use case of object detection using video inference. Therefore, video frames are used as input data. In this use case, losing some video frames is not a critical issue. As a result, retransmission of lost data is out of scope.
If this functionality is implemented, as you pointed out, users will need to choose the type of path depending on their use case.

  1. How do you know which pod in VertexN should talk to which pod in VertexN+1

First, the user defines the connectivity between Vertices in spec.edges of the manifest file, as before.
When a Pipeline resource is deployed, Pods are deployed in a chained manner by the Vertex controller.

An external controller watches Pod deployments and checks whether the domain name of the MultiNetwork Service corresponding to each Vertex and the Pod’s SecondNIC IP are registered in CoreDNS. If not, it registers them.

Now, let me move on to application execution.
Currently, the UDF container in each Vertex already has the destination Vertex specified as an environment variable. Therefore, we assume that it can also have the domain name of the corresponding MultiNetwork Service.
By querying CoreDNS using this domain name, we believe that the container can obtain the SecondNIC IP addresses of candidate destination Pods.

Additionally, standardization of the Service for MultiNetwork via the Gateway API is currently under consideration, but the specification is still undecided.

Signed-off-by: Kazuki Yamamoto <[email protected]>
Copy link
Member

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide links to all the important technology you are referencing to. It will help us understand better.


Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.

1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUDirect RDMA

Could you please provide link for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the README and embedded the links.

@sesame0224
Copy link
Contributor Author

I will likely add supplementary explanations for the MultiNetwork-related part later.

I will explain the MultiNetwork-related links as a supplement to my response to the previous question (How do you know which pod in VertexN should talk to which pod in VertexN+1)

As a prerequisite, in order to use GPUDirect RDMA, the application running inside the container must know the IP addresses of the destination Pods.
Therefore, what we essentially want to achieve is to obtain the list of Pod IPs handled by a Service, by using a Headless Service.

The related components currently work as follows:

  • When a Service resource is deployed, EndpointSlice controller collects the IP addresses of the target Pods and creates EndpointSlice resources.
  • CoreDNS watches Service and EndpointSlice resources and caches the corresponding DNS records internally.
  • When a DNS query is received, CoreDNS refers to this cache and returns the candidate Pod IP addresses.

The problem is that EndpointSlice only collects IP addresses from the default network. As a result, this mechanism does not work for MultiNetwork paths that use a Second NIC.

To address this issue, we are considering two possible approaches.

The first approach is to use the CoreDNS etcd plugin to register a domain name corresponding to each Vertex, along with the list of Second NIC IP addresses assigned to the Pods belonging to that Vertex.
This registration would be performed by a component outside of Numaflow at the time when Pods with assigned Second NIC IPs are deployed.

Given the current Numaflow specification, where each Vertex already has the name of the next destination Vertex as an environment variable, we believe it is also feasible to provide the domain name of the destination Vertex. By querying CoreDNS during application execution, the application can then obtain the candidate destination IP addresses.

The second approach is based on ongoing efforts(GEP-3539) to evolve the Service API for MultiNetwork support using the Gateway API.
Since this work is still in progress, it is not possible to fully rely on it at this stage. Therefore, the idea is to implement our own solution while monitoring the direction of the Gateway API.
In this approach, we would create HeadlessService-like and EndpointSlice-like resources for the Gateway API, along with a CoreDNS plugin that can handle these new resources.
This implementation would be positioned as a reference or sample implementation that could potentially be contributed back to Kubernetes as a standard feature in the future.

At this point, we prefer to start with the first approach because it has a lower implementation cost. However, if it turns out to be infeasible, we plan to move forward with the second approach.

- Change doc structure
  - add a new chapter(Functionality)
  - Swap the order Workflow and Resource Specification
- Update Workflow and Resource Specification

Signed-off-by: Kazuki Yamamoto <[email protected]>
@syayi syayi self-requested a review January 26, 2026 02:04
Signed-off-by: Kazuki Yamamoto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants