Skip to content

Commit 8daffb4

Browse files
committed
Add Alert Rule Classification Mapping enhancement
Signed-off-by: Shirly Radco <sradco@redhat.com>
1 parent ac10097 commit 8daffb4

File tree

1 file changed

+195
-0
lines changed

1 file changed

+195
-0
lines changed
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
---
2+
title: alert-rule-classification-mapping
3+
authors:
4+
- "@sradco"
5+
reviewers:
6+
- "@jan--f"
7+
- "@jgbernalp"
8+
approvers:
9+
- "@jan--f"
10+
- "@jgbernalp"
11+
api-approvers:
12+
- TBD
13+
creation-date: 2026-01-07
14+
last-updated: 2026-01-07
15+
tracking-link:
16+
- ""
17+
---
18+
# Alert Rule Classification Mapping and Layer Defaults
19+
20+
## Summary
21+
This enhancement defines a mapping for alert rule classification in OpenShift monitoring, covering:
22+
- How `component` and `layer` are determined for alerting rules and alerts.
23+
- The fallback behavior when a component cannot be determined by classifiers.
24+
- The allowed `layer` values and how `layer` is derived from the `source` (platform vs user).
25+
- Persistence and override model via ConfigMaps.
26+
- API enrichment and UI alignment for consistent terminology and filtering.
27+
28+
The outcome is a predictable and documented classification that UIs and integrations can rely on. It aligns with the monitoring plugin implementation and prevents empty, ambiguous `layer` values.
29+
30+
Related enhancement: [Alerts UI Management](alerts-ui-management.md)
31+
32+
## Motivation
33+
- Enable computation of component- and cluster-level health to help the UI prioritize clusters needing attention, highlight failing components, and speed troubleshooting in single and multi-cluster views.
34+
- Provide a single source of truth for classification behavior across backend and UI.
35+
- Ensure consistent and stable `component` and `layer` values for alerts and rules.
36+
- Eliminate historical ambiguity where `layer` could be empty, by defining defaulting rules.
37+
- Document how user overrides are validated and persisted to enable GitOps-friendly workflows.
38+
39+
## Goals
40+
1. Define allowed `layer` values and defaulting rules.
41+
2. Document classifier-based mapping and fallback behavior.
42+
3. Standardize persistence and override schema in a per-`PrometheusRule` ConfigMap.
43+
4. Specify API enrichment fields for alerts and rules, and expected UI filters/columns.
44+
5. Maintain compatibility with Prometheus/Thanos schemas (additive enrichment only).
45+
46+
## Non‑Goals
47+
- Changing upstream Prometheus/Thanos APIs or schemas.
48+
- Redefining platform vs user source detection beyond what is documented here.
49+
- Enforcing a specific UI, this defines the model that UIs should follow.
50+
51+
## Terminology
52+
- component: Logical owner of the alert or rule (e.g., `kube-apiserver`, `etcd`, a namespace, a team).
53+
- layer: Impact scope. Allowed values: `cluster`, `compute`, `namespace`.
54+
- source: Origin of the rule/alert. Either `platform` (cluster monitoring stack) or `user` (UWM).
55+
- platform stack: The `openshift-monitoring` stack managed by Red Hat–supported operators.
56+
- user stack (UWM): User monitoring stack for application namespaces.
57+
58+
## Mapping Logic
59+
### Primary Mapping (Classifier)
60+
- The backend uses a classifier (CHA-derived matchers) to compute a `(layer, component)` tuple from rule/alert labels.
61+
- Typical mappings:
62+
- Core control-plane components → `layer=cluster`, `component=<cp-subsystem>`
63+
- Node/compute-related → `layer=compute`
64+
- Workload/namespace-level alerts → `layer=namespace`
65+
66+
### Fallback Mapping (When component is unknown)
67+
If the classifier returns an empty component or `Others`:
68+
- `component = <PrometheusRule namespace>`
69+
- `layer` is derived from `source`:
70+
- `platform``cluster`
71+
- `user``namespace`
72+
73+
Notes:
74+
- The backend no longer generates an empty `layer`. Generated values are always one of `cluster|compute|namespace`.
75+
- The `compute` layer is reserved for compute/node-related cases and may expand in future iterations.
76+
77+
### Source Determination
78+
- For rules: a rule is considered `platform` if it belongs to the cluster monitoring namespace (`openshift-monitoring`). Otherwise it is `user`.
79+
- For alerts: considered `platform` when either:
80+
- `openshift_io_alert_source == platform`, or
81+
- `prometheus` label is prefixed with `openshift-monitoring/`.
82+
Otherwise `user`.
83+
84+
## Persistence and Overrides
85+
### Per‑PrometheusRule ConfigMap
86+
- Name: `alertrule-classification-<prometheusrule-name>`
87+
- Namespace: same namespace as the `PrometheusRule`
88+
- OwnerReference: points to the `PrometheusRule`
89+
- Annotation: a stable signature used for traceability
90+
- Data key: `alert-rule-classification.yaml`
91+
- Value: YAML map from `alertRuleId` → object:
92+
- `component: <string>`
93+
- `layer: <string>`
94+
- `errors: [ ... ]` (optional: set when validation fails)
95+
96+
Example:
97+
```yaml
98+
<alertRuleIdA>:
99+
component: kube-apiserver
100+
layer: cluster
101+
<alertRuleIdB>:
102+
component: ns-a
103+
layer: namespace
104+
```
105+
106+
### User Overrides
107+
- Users may override `component` and/or `layer` by editing the same ConfigMap.
108+
- Validation:
109+
- `component`: non-empty, 1–253 chars, `[A-Za-z0-9._-]`, must start/end alphanumeric.
110+
- `layer`: one of `cluster|compute|namespace`.
111+
- Invalid overrides are preserved but annotated with `errors` and ignored for effective values.
112+
- Unknown `alertRuleId` entries are ignored.
113+
114+
## Alerts API Enrichment
115+
- Endpoint aligns with Prometheus `/api/v1/alerts` and adds fields (additive):
116+
- `alertRuleId`
117+
- `component`
118+
- `layer`
119+
- Classification for alerts is computed by correlating alerts to relabeled rules. When correlation fails, the fallback mapping above applies and derives `layer` from `source`.
120+
121+
## UI Alignment
122+
- Columns for both Alerts and Alerting Rules should include `Layer` and `Component`.
123+
- Filters should include `Layer (cluster|namespace|compute)` and `Source (platform|user)`.
124+
- Creation/edit flows should allow choosing `layer` from the allowed set. `component` free-form (validated).
125+
- An admin-facing “Manage layers” section can describe the meaning of layers:
126+
- Cluster: control plane, cluster-wide components (API server, etcd, network, …)
127+
- Namespace: workloads and components scoped to a project/namespace
128+
- Compute: node/compute-layer (may be phased in as needed)
129+
130+
### Implementation Details/Notes/Constraints
131+
- Classification is computed server-side using CHA-derived matchers and persisted per `PrometheusRule` in a namespaced ConfigMap. Users may override `component`/`layer` in the same ConfigMap with validation.
132+
- Alerts are enriched additively (Prometheus-compatible), correlating to relabeled rules where possible and applying source-based defaults on fallback.
133+
- No new CRDs or aggregated API servers are introduced.standard RBAC applies.
134+
135+
#### Hypershift / Hosted Control Planes
136+
137+
138+
#### Standalone Clusters
139+
140+
#### Single-node Deployments or MicroShift
141+
142+
#### OpenShift Kubernetes Engine
143+
144+
## Upgrade / Downgrade Strategy
145+
- User overrides remain intact. only invalid values are annotated with `errors`.
146+
147+
## Test Plan (High Level)
148+
- Unit tests:
149+
- Unknown component fallback for user rules → `layer=namespace`, `component=<rule ns>`.
150+
- Unknown component fallback for platform rules → `layer=cluster`, `component=<rule ns>`.
151+
- Valid overrides are merged. Invalid overrides are recorded in `errors` and ignored.
152+
- Signature annotation stored and updated deterministically.
153+
- Integration/e2e (as available):
154+
- ConfigMap creation/update on rule changes.
155+
- Alerts API includes additive fields and respects relabel configs.
156+
157+
### Risks and Mitigations
158+
- Misclassification by classifier: mitigated by clear overrides and validation paths.
159+
- Drift between docs and implementation: mitigated by this enhancement and regular verification in tests.
160+
- Client assumptions about additional `layer` values: documented allowed set and guidance to pass through unknown values without interpretation.
161+
162+
### Drawbacks
163+
- Additional reconciliation and ConfigMap writes on rule changes.
164+
- Classifier rules require maintenance as platform components evolve.
165+
166+
## Alternatives (Not Implemented)
167+
- Setting the labels with alertRelabelConfig CR for all alerts, except for operator alerts in UWM.
168+
- Introduce a dedicated classification CRD (adds operational overhead with limited benefit).
169+
- Compute classification only in the UI (duplicates logic, hard to validate).
170+
171+
## Graduation Criteria
172+
173+
### Dev Preview -> Tech Preview
174+
- End-to-end classification (compute, persist, enrich) with unit tests and docs.
175+
- UI consumes `component`/`layer` for display and filtering.
176+
177+
### Tech Preview -> GA
178+
- Full test coverage (upgrade/downgrade/scale).
179+
- Stable defaulting across supported topologies (standalone, Hypershift, SNO/MicroShift).
180+
181+
### Removing a deprecated feature
182+
- If the classifier or persistence format changes, document migration and keep backward compatibility for one minor release.
183+
184+
## Version Skew Strategy
185+
- Server-side enrichment ensures older/newer UIs receive consistent fields. Unknown `layer` values must be passed through and displayed as-is.
186+
187+
## Operational Aspects of API Extensions
188+
- No new API extensions are introduced. OwnerReferences ensure GC of ConfigMaps. Failures surface in controller logs.
189+
190+
## Support Procedures
191+
- Verify `alertrule-classification-<prometheusrule>` ConfigMaps and their `errors` fields.
192+
- Check controller logs for validation failures.
193+
- Confirm alert `prometheus` or `openshift_io_alert_source` labels for source detection.
194+
## Open Questions
195+

0 commit comments

Comments
 (0)