Feat/second nic communication #3129

sesame0224 · 2026-01-08T22:10:16Z

What this PR does / why we need it

This PR proposes a new method for enabling direct data communication between Numaflow vertices, and describes the motivation, resource specifications and workflows.

Internal discussions are ongoing, so we will revise the doc as appropriate.

Related issues

#2990
This PR is an initial design derived from the above post.
We would like to discuss this with the community.

Testing

This PR includes only documentation, so no tests were performed.

Special notes for reviewers

Since base branch is wrong in #3125, I recreated this PR.

Signed-off-by: Kazuki Yamamoto <[email protected]>

codecov · 2026-01-08T22:21:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.14%. Comparing base (7f28dc7) to head (d665ada).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3129      +/-   ##
==========================================
+ Coverage   79.73%   80.14%   +0.41%     
==========================================
  Files         291      296       +5     
  Lines       65143    67530    +2387     
==========================================
+ Hits        51941    54122    +2181     
- Misses      12649    12858     +209     
+ Partials      553      550       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vigith · 2026-01-09T03:12:20Z

Thank you for putting this together. I have a couple of basic questions.

How do you replay a message if we rely purely on direct communication? E.g, if the pod in Vertex^N is talking to another pod in Vertex^N+1 and that pod in Vertex^N crashes due to a underlying node failure, then the message will be lost.
Today we use buffer so that the producer (Vertex^N) and the next vertex (Vertex^N+1) are decoupled. The reason we did this was because a pod could crash any time, but we have a buffer that is persistent which can be used by the Vertex^N+1 to read the data.
How do you know which pod in Vertex^N should talk to which pod in Vertex^N+1

sesame0224 · 2026-01-14T09:05:48Z

How do you replay a message if we rely purely on direct communication?

As a premise, we assume a use case of object detection using video inference. Therefore, video frames are used as input data. In this use case, losing some video frames is not a critical issue. As a result, retransmission of lost data is out of scope.
If this functionality is implemented, as you pointed out, users will need to choose the type of path depending on their use case.

How do you know which pod in VertexN should talk to which pod in VertexN+1

First, the user defines the connectivity between Vertices in spec.edges of the manifest file, as before.
When a Pipeline resource is deployed, Pods are deployed in a chained manner by the Vertex controller.

An external controller watches Pod deployments and checks whether the domain name of the MultiNetwork Service corresponding to each Vertex and the Pod’s SecondNIC IP are registered in CoreDNS. If not, it registers them.

Now, let me move on to application execution.
Currently, the UDF container in each Vertex already has the destination Vertex specified as an environment variable. Therefore, we assume that it can also have the domain name of the corresponding MultiNetwork Service.
By querying CoreDNS using this domain name, we believe that the container can obtain the SecondNIC IP addresses of candidate destination Pods.

Additionally, standardization of the Service for MultiNetwork via the Gateway API is currently under consideration, but the specification is still undecided.

Signed-off-by: Kazuki Yamamoto <[email protected]>

vigith

Could you please provide links to all the important technology you are referencing to. It will help us understand better.

vigith · 2026-01-15T18:46:55Z

proposal/README.md

+
+Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.
+
+1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.


GPUDirect RDMA

Could you please provide link for this?

I’ve updated the README and embedded the links.

sesame0224 · 2026-01-16T08:55:01Z

GPUDirectRDMA
- GPUDirect
  - https://network.nvidia.com/products/GPUDirect-RDMA/
  - 1. Overview — GPUDirect RDMA 13.1 documentation
DRA(Dynamic Resource Allocation)
- Dynamic Resource Allocation
- enhancements/keps/sig-node/4381-dra-structured-parameters at master · kubernetes/enhancements
dranet
- GitHub - kubernetes-sigs/dranet: DRANET is a Kubernetes Network Driver that uses Dynamic Resource Allocation (DRA) to deliver high-performance networking for demanding applications in Kubernetes. > How It Works
MultiNetwork
- Introduction - Kubernetes Gateway API
- gateway-api/geps/gep-3539/index.md at 8dab84ae52f627a747f1a8b3a7a3f255a79f882f · kubernetes-sigs/gateway-api
  - issue: GEP: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP) · Issue #3539 · kubernetes-sigs/gateway-api
  - PR: Adding GEP-3539: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP Gateway) by ptrivedi · Pull Request #3608 · kubernetes-sigs/gateway-api

Here is a list of links that I have come up with so far.
I will likely add supplementary explanations for the MultiNetwork-related part later.
I also plan to embed these links in proposal/README.md.

sesame0224 · 2026-01-20T03:21:30Z

I will likely add supplementary explanations for the MultiNetwork-related part later.

I will explain the MultiNetwork-related links as a supplement to my response to the previous question (How do you know which pod in VertexN should talk to which pod in VertexN+1)

As a prerequisite, in order to use GPUDirect RDMA, the application running inside the container must know the IP addresses of the destination Pods.
Therefore, what we essentially want to achieve is to obtain the list of Pod IPs handled by a Service, by using a Headless Service.

The related components currently work as follows:

When a Service resource is deployed, EndpointSlice controller collects the IP addresses of the target Pods and creates EndpointSlice resources.
CoreDNS watches Service and EndpointSlice resources and caches the corresponding DNS records internally.
When a DNS query is received, CoreDNS refers to this cache and returns the candidate Pod IP addresses.

The problem is that EndpointSlice only collects IP addresses from the default network. As a result, this mechanism does not work for MultiNetwork paths that use a Second NIC.

To address this issue, we are considering two possible approaches.

The first approach is to use the CoreDNS etcd plugin to register a domain name corresponding to each Vertex, along with the list of Second NIC IP addresses assigned to the Pods belonging to that Vertex.
This registration would be performed by a component outside of Numaflow at the time when Pods with assigned Second NIC IPs are deployed.

Given the current Numaflow specification, where each Vertex already has the name of the next destination Vertex as an environment variable, we believe it is also feasible to provide the domain name of the destination Vertex. By querying CoreDNS during application execution, the application can then obtain the candidate destination IP addresses.

The second approach is based on ongoing efforts(GEP-3539) to evolve the Service API for MultiNetwork support using the Gateway API.
Since this work is still in progress, it is not possible to fully rely on it at this stage. Therefore, the idea is to implement our own solution while monitoring the direction of the Gateway API.
In this approach, we would create HeadlessService-like and EndpointSlice-like resources for the Gateway API, along with a CoreDNS plugin that can handle these new resources.
This implementation would be positioned as a reference or sample implementation that could potentially be contributed back to Kubernetes as a standard feature in the future.

At this point, we prefer to start with the first approach because it has a lower implementation cost. However, if it turns out to be infeasible, we plan to move forward with the second approach.

- Change doc structure - add a new chapter(Functionality) - Swap the order Workflow and Resource Specification - Update Workflow and Resource Specification Signed-off-by: Kazuki Yamamoto <[email protected]>

Signed-off-by: Kazuki Yamamoto <[email protected]>

sesame0224 added 2 commits January 5, 2026 06:51

docs: Add proposal for direct data transfer

a54f4d4

Signed-off-by: Kazuki Yamamoto <[email protected]>

chore: fix link paths of figures

147a678

Signed-off-by: Kazuki Yamamoto <[email protected]>

docs: proposal/README.md

93faf2e

Signed-off-by: Kazuki Yamamoto <[email protected]>

vigith reviewed Jan 15, 2026

View reviewed changes

docs: Update proposal/README.md

b8afc99

- Change doc structure - add a new chapter(Functionality) - Swap the order Workflow and Resource Specification - Update Workflow and Resource Specification Signed-off-by: Kazuki Yamamoto <[email protected]>

syayi self-requested a review January 26, 2026 02:04

docs: Update Workflow

d665ada

Signed-off-by: Kazuki Yamamoto <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/second nic communication #3129

Feat/second nic communication #3129

Uh oh!

sesame0224 commented Jan 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

vigith commented Jan 9, 2026 •

edited

Loading

Uh oh!

sesame0224 commented Jan 14, 2026 •

edited

Loading

Uh oh!

vigith left a comment

Uh oh!

vigith Jan 15, 2026

Uh oh!

sesame0224 Jan 26, 2026

Uh oh!

sesame0224 commented Jan 16, 2026 •

edited

Loading

Uh oh!

sesame0224 commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.

		1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.

Feat/second nic communication #3129

Are you sure you want to change the base?

Feat/second nic communication #3129

Uh oh!

Conversation

sesame0224 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Related issues

Testing

Special notes for reviewers

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vigith commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigith left a comment

Choose a reason for hiding this comment

Uh oh!

vigith Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

sesame0224 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

sesame0224 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sesame0224 commented Jan 8, 2026 •

edited

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

vigith commented Jan 9, 2026 •

edited

Loading

sesame0224 commented Jan 14, 2026 •

edited

Loading

sesame0224 commented Jan 16, 2026 •

edited

Loading