Skip to content

Pipeline update causes race condition - pipeline removed from envoy despite Ready status #7072

@jackson-wright

Description

@jackson-wright

Describe the bug

When updating a Pipeline CRD, the scheduler incorrectly removes the pipeline from envoy routing after the old pipeline version is deleted, even though the new version was successfully loaded. This leaves the pipeline in a broken state where:

  • Pipeline status shows Ready: True
  • Actual requests return 503 no healthy upstream

Environment

  • Seldon Core version: 2.10.2
  • Kubernetes version: 1.28
  • Installation method: Helm
  • Kafka: Tested with both local Strimzi and Confluent Cloud (same behavior)

To Reproduce

  1. Deploy a working pipeline
  2. Verify pipeline is functional (returns 200/400, not 503)
  3. Apply an update to the pipeline spec (e.g., change stepsJoin: inner to stepsJoin: outer)
  4. Observe pipeline gateway logs

Expected behavior

Pipeline should remain routable after the update. The old version should be deleted without affecting the new version's routing.

Actual behavior

The pipeline becomes unroutable (503 errors) despite showing Ready: True in the CRD status.

Logs

Scheduler logs showing the issue:

time="2026-01-06T16:14:47Z" level=info msg="Received pipeline status event update:{op:Create  pipeline:\"mlserver-example-pipeline\"  version:9  ...}  success:true  reason:\"Pipeline mlserver-example-pipeline loaded\""
time="2026-01-06T16:14:47Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 ready"
time="2026-01-06T16:14:47Z" level=info msg="Adding normal pipeline route mlserver-example-pipeline"
time="2026-01-06T16:14:48Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 terminated"
time="2026-01-06T16:14:49Z" level=info msg="Received pipeline status event update:{op:Delete  pipeline:\"mlserver-example-pipeline\"  version:8  ...}  success:true  reason:\"Pipeline mlserver-example-pipeline deleted\""
time="2026-01-06T16:14:49Z" level=info msg="Pipeline mlserver-example-pipeline has been terminated, removing from conflict resolution and envoy"

Pipeline gateway logs:

time="2026-01-06T15:51:16Z" level=info msg="Pipeline mlserver-example-pipeline loaded"
time="2026-01-06T15:51:17Z" level=info msg="Deleted pipeline mlserver-example-pipeline"
time="2026-01-06T15:51:17Z" level=info msg="Pipeline mlserver-example-pipeline deleted"

Key observation: Version 9 is created and loaded successfully, but when version 8 is deleted, the scheduler's GetPipelineStatus function reports 1/1 terminated and removes the pipeline from envoy - even though version 9 is still active.

Root cause analysis

The bug appears to be in the scheduler's dataflow-conflict-resolution component. When processing the delete event for the old pipeline version, GetPipelineStatus incorrectly counts the pipeline as terminated and triggers removal from envoy, ignoring that a newer version is still loaded.

The sequence is:

  1. Pipeline v9 created → "1/1 ready" → added to envoy ✓
  2. Pipeline v8 delete event received
  3. GetPipelineStatus returns "1/1 terminated" (BUG: should still show ready because v9 exists)
  4. Pipeline removed from envoy (BUG: v9 is still valid)

Workaround

Restarting the pipeline gateway pod after any pipeline update resolves the issue:

kubectl rollout restart deployment/seldon-pipelinegateway -n seldon-mesh

This works because the fresh pod connects to the scheduler and loads the current pipeline version without any "old version" delete events to process.

Additional context

  • This is 100% reproducible on every pipeline spec update
  • Scaling pipeline gateway to multiple replicas does NOT help - all replicas experience the race condition simultaneously
  • Initial pipeline creation works fine; only updates trigger the bug
  • The bug was present with Strimzi Kafka and persists with Confluent Cloud, ruling out Kafka-specific issues
  • PR fix(scheduler, dataflow): pipeline loading/unloading on pipeline-gw and dataflow engine topology #6849 addressed related pipeline loading/unloading issues in v2.10.0, but this race condition persists in v2.10.2

Impact

  • Requires manual intervention (pod restart) after every pipeline update
  • Makes pipelines unsafe for production CI/CD without workarounds
  • CRD status is misleading (shows Ready when routing is broken)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions