chore: add critical error metric for numaplane analysis template #3154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it
This is for both Pipeline and MonoVertex. The idea is that we want to improve upon our existing AnalysisTemplates for Numaplane assessment.
The problem with our current assessments is they simply look for any message to get acked. Let's say the first message gets acked, but the second message fails to get acked. Numaplane will call this a success.
If we can have a metric which is able to detect failures like EOT, udf crashes etc, then if any of our new Pipeline Vertices or our new Monovertex has a count > 1, we could fail.
Specifically, the idea is to emit a metric from the numa container whenever there is a critical errors in the numaflow pipeline/monovertex.
Pipeline Metric:
forwarder_critical_error_totalwith labelsvertex,pipeline,vertex_type,replicaandreasonMonoVertex Metric:
mvtx_critical_error_totalwith labelsmvtx_name,replicaandreasonTesting