Skip to content

Commit 8e69f67

Browse files
committed
agent: prevent restarting failed shards during publication cooldown
Addresses an issue from the previous commit, which introduced generalized publication cooldowns. That cooldown interacts poorly with our shard restart behavior. Our restart behavior is to always restart shards immediately after the first 2 failures of any particular build (the build changes every time it's published), and only starts backing off after the 3rd consecutive failure. So in scenarios where we're waiting on the publication cooldown for a materialization, we're now doing 2 immediate restarts of a materialization that is very likely to fail. Restarting failed materialization shards can be pretty resource intensive for data planes, so it's worth it to avoid doing 2 extra restarts on materializations. So this augments the shard restart behavior so that we won't attempt to restart a failed shard if we're awaiting a publication cooldown. It does this for all task types, even though materializations are the main ones that are impacted, due to them commonly having a large number of bindings for collections that use schema inference.
1 parent 9e19a30 commit 8e69f67

File tree

6 files changed

+283
-197
lines changed

6 files changed

+283
-197
lines changed

0 commit comments

Comments
 (0)