Commit 8e69f67
committed
agent: prevent restarting failed shards during publication cooldown
Addresses an issue from the previous commit, which introduced generalized
publication cooldowns. That cooldown interacts poorly with our shard restart
behavior. Our restart behavior is to always restart shards immediately after
the first 2 failures of any particular build (the build changes every time it's
published), and only starts backing off after the 3rd consecutive failure. So
in scenarios where we're waiting on the publication cooldown for a
materialization, we're now doing 2 immediate restarts of a materialization that
is very likely to fail. Restarting failed materialization shards can be pretty
resource intensive for data planes, so it's worth it to avoid doing 2 extra
restarts on materializations. So this augments the shard restart behavior so
that we won't attempt to restart a failed shard if we're awaiting a publication
cooldown. It does this for all task types, even though materializations are the
main ones that are impacted, due to them commonly having a large number of
bindings for collections that use schema inference.1 parent 9e19a30 commit 8e69f67
File tree
6 files changed
+283
-197
lines changed- crates/agent/src
- controllers
- integration_tests
6 files changed
+283
-197
lines changed
0 commit comments