Commit 8e69f67

committed

agent: prevent restarting failed shards during publication cooldown

Addresses an issue from the previous commit, which introduced generalized publication cooldowns. That cooldown interacts poorly with our shard restart behavior. Our restart behavior is to always restart shards immediately after the first 2 failures of any particular build (the build changes every time it's published), and only starts backing off after the 3rd consecutive failure. So in scenarios where we're waiting on the publication cooldown for a materialization, we're now doing 2 immediate restarts of a materialization that is very likely to fail. Restarting failed materialization shards can be pretty resource intensive for data planes, so it's worth it to avoid doing 2 extra restarts on materializations. So this augments the shard restart behavior so that we won't attempt to restart a failed shard if we're awaiting a publication cooldown. It does this for all task types, even though materializations are the main ones that are impacted, due to them commonly having a large number of bindings for collections that use schema inference.

1 parent 9e19a30 commit 8e69f67Copy full SHA for 8e69f67

6 files changed

+283

-197

lines changed

crates/agent/src
- controllers
- integration_tests
  - inferred_schemas.rs

6 files changed

+283

-197

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 8e69f67

6 files changed

6 files changed

File tree

6 files changed

6 files changed

0 commit comments