-
Notifications
You must be signed in to change notification settings - Fork 858
Description
Describe the bug
When updating a Pipeline CRD, the scheduler incorrectly removes the pipeline from envoy routing after the old pipeline version is deleted, even though the new version was successfully loaded. This leaves the pipeline in a broken state where:
- Pipeline status shows
Ready: True - Actual requests return
503 no healthy upstream
Environment
- Seldon Core version: 2.10.2
- Kubernetes version: 1.28
- Installation method: Helm
- Kafka: Tested with both local Strimzi and Confluent Cloud (same behavior)
To Reproduce
- Deploy a working pipeline
- Verify pipeline is functional (returns 200/400, not 503)
- Apply an update to the pipeline spec (e.g., change
stepsJoin: innertostepsJoin: outer) - Observe pipeline gateway logs
Expected behavior
Pipeline should remain routable after the update. The old version should be deleted without affecting the new version's routing.
Actual behavior
The pipeline becomes unroutable (503 errors) despite showing Ready: True in the CRD status.
Logs
Scheduler logs showing the issue:
time="2026-01-06T16:14:47Z" level=info msg="Received pipeline status event update:{op:Create pipeline:\"mlserver-example-pipeline\" version:9 ...} success:true reason:\"Pipeline mlserver-example-pipeline loaded\""
time="2026-01-06T16:14:47Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 ready"
time="2026-01-06T16:14:47Z" level=info msg="Adding normal pipeline route mlserver-example-pipeline"
time="2026-01-06T16:14:48Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 terminated"
time="2026-01-06T16:14:49Z" level=info msg="Received pipeline status event update:{op:Delete pipeline:\"mlserver-example-pipeline\" version:8 ...} success:true reason:\"Pipeline mlserver-example-pipeline deleted\""
time="2026-01-06T16:14:49Z" level=info msg="Pipeline mlserver-example-pipeline has been terminated, removing from conflict resolution and envoy"
Pipeline gateway logs:
time="2026-01-06T15:51:16Z" level=info msg="Pipeline mlserver-example-pipeline loaded"
time="2026-01-06T15:51:17Z" level=info msg="Deleted pipeline mlserver-example-pipeline"
time="2026-01-06T15:51:17Z" level=info msg="Pipeline mlserver-example-pipeline deleted"
Key observation: Version 9 is created and loaded successfully, but when version 8 is deleted, the scheduler's GetPipelineStatus function reports 1/1 terminated and removes the pipeline from envoy - even though version 9 is still active.
Root cause analysis
The bug appears to be in the scheduler's dataflow-conflict-resolution component. When processing the delete event for the old pipeline version, GetPipelineStatus incorrectly counts the pipeline as terminated and triggers removal from envoy, ignoring that a newer version is still loaded.
The sequence is:
- Pipeline v9 created → "1/1 ready" → added to envoy ✓
- Pipeline v8 delete event received
GetPipelineStatusreturns "1/1 terminated" (BUG: should still show ready because v9 exists)- Pipeline removed from envoy (BUG: v9 is still valid)
Workaround
Restarting the pipeline gateway pod after any pipeline update resolves the issue:
kubectl rollout restart deployment/seldon-pipelinegateway -n seldon-meshThis works because the fresh pod connects to the scheduler and loads the current pipeline version without any "old version" delete events to process.
Additional context
- This is 100% reproducible on every pipeline spec update
- Scaling pipeline gateway to multiple replicas does NOT help - all replicas experience the race condition simultaneously
- Initial pipeline creation works fine; only updates trigger the bug
- The bug was present with Strimzi Kafka and persists with Confluent Cloud, ruling out Kafka-specific issues
- PR fix(scheduler, dataflow): pipeline loading/unloading on pipeline-gw and dataflow engine topology #6849 addressed related pipeline loading/unloading issues in v2.10.0, but this race condition persists in v2.10.2
Impact
- Requires manual intervention (pod restart) after every pipeline update
- Makes pipelines unsafe for production CI/CD without workarounds
- CRD status is misleading (shows Ready when routing is broken)