Add Alert Rule Classification Mapping enhancement

sradco · sradco · commit 8daffb49e1d3 · 2026-01-07T14:00:59.000+02:00
Signed-off-by: Shirly Radco &lt;sradco@redhat.com&gt;
diff --git a/enhancements/monitoring/alert-rule-classification-mapping.md b/enhancements/monitoring/alert-rule-classification-mapping.md
@@ -0,0 +1,195 @@
+---
+title: alert-rule-classification-mapping
+authors:
+  - "@sradco"
+reviewers:
+  - "@jan--f"
+  - "@jgbernalp"
+approvers:
+  - "@jan--f"
+  - "@jgbernalp"
+api-approvers:
+  - TBD
+creation-date: 2026-01-07
+last-updated: 2026-01-07
+tracking-link:
+  - ""
+---
+# Alert Rule Classification Mapping and Layer Defaults
+
+## Summary
+This enhancement defines a mapping for alert rule classification in OpenShift monitoring, covering:
+- How `component` and `layer` are determined for alerting rules and alerts.
+- The fallback behavior when a component cannot be determined by classifiers.
+- The allowed `layer` values and how `layer` is derived from the `source` (platform vs user).
+- Persistence and override model via ConfigMaps.
+- API enrichment and UI alignment for consistent terminology and filtering.
+
+The outcome is a predictable and documented classification that UIs and integrations can rely on. It aligns with the monitoring plugin implementation and prevents empty, ambiguous `layer` values.
+
+Related enhancement: [Alerts UI Management](alerts-ui-management.md)
+
+## Motivation
+- Enable computation of component- and cluster-level health to help the UI prioritize clusters needing attention, highlight failing components, and speed troubleshooting in single and multi-cluster views.
+- Provide a single source of truth for classification behavior across backend and UI.
+- Ensure consistent and stable `component` and `layer` values for alerts and rules.
+- Eliminate historical ambiguity where `layer` could be empty, by defining defaulting rules.
+- Document how user overrides are validated and persisted to enable GitOps-friendly workflows.
+
+## Goals
+1. Define allowed `layer` values and defaulting rules.
+2. Document classifier-based mapping and fallback behavior.
+3. Standardize persistence and override schema in a per-`PrometheusRule` ConfigMap.
+4. Specify API enrichment fields for alerts and rules, and expected UI filters/columns.
+5. Maintain compatibility with Prometheus/Thanos schemas (additive enrichment only).
+
+## Non‑Goals
+- Changing upstream Prometheus/Thanos APIs or schemas.
+- Redefining platform vs user source detection beyond what is documented here.
+- Enforcing a specific UI, this defines the model that UIs should follow.
+
+## Terminology
+- component: Logical owner of the alert or rule (e.g., `kube-apiserver`, `etcd`, a namespace, a team).
+- layer: Impact scope. Allowed values: `cluster`, `compute`, `namespace`.
+- source: Origin of the rule/alert. Either `platform` (cluster monitoring stack) or `user` (UWM).
+- platform stack: The `openshift-monitoring` stack managed by Red Hat–supported operators.
+- user stack (UWM): User monitoring stack for application namespaces.
+
+## Mapping Logic
+### Primary Mapping (Classifier)
+- The backend uses a classifier (CHA-derived matchers) to compute a `(layer, component)` tuple from rule/alert labels.
+- Typical mappings:
+  - Core control-plane components → `layer=cluster`, `component=<cp-subsystem>`
+  - Node/compute-related → `layer=compute`
+  - Workload/namespace-level alerts → `layer=namespace`
+
+### Fallback Mapping (When component is unknown)
+If the classifier returns an empty component or `Others`:
+- `component = <PrometheusRule namespace>`
+- `layer` is derived from `source`:
+  - `platform` → `cluster`
+  - `user` → `namespace`
+
+Notes:
+- The backend no longer generates an empty `layer`. Generated values are always one of `cluster|compute|namespace`.
+- The `compute` layer is reserved for compute/node-related cases and may expand in future iterations.
+
+### Source Determination
+- For rules: a rule is considered `platform` if it belongs to the cluster monitoring namespace (`openshift-monitoring`). Otherwise it is `user`.
+- For alerts: considered `platform` when either:
+  - `openshift_io_alert_source == platform`, or
+  - `prometheus` label is prefixed with `openshift-monitoring/`.
+  Otherwise `user`.
+
+## Persistence and Overrides
+### Per‑PrometheusRule ConfigMap
+- Name: `alertrule-classification-<prometheusrule-name>`
+- Namespace: same namespace as the `PrometheusRule`
+- OwnerReference: points to the `PrometheusRule`
+- Annotation: a stable signature used for traceability
+- Data key: `alert-rule-classification.yaml`
+- Value: YAML map from `alertRuleId` → object:
+  - `component: <string>`
+  - `layer: <string>`
+  - `errors: [ ... ]` (optional: set when validation fails)
+
+Example:
+```yaml
+<alertRuleIdA>:
+  component: kube-apiserver
+  layer: cluster
+<alertRuleIdB>:
+  component: ns-a
+  layer: namespace
+```
+
+### User Overrides
+- Users may override `component` and/or `layer` by editing the same ConfigMap.
+- Validation:
+  - `component`: non-empty, 1–253 chars, `[A-Za-z0-9._-]`, must start/end alphanumeric.
+  - `layer`: one of `cluster|compute|namespace`.
+- Invalid overrides are preserved but annotated with `errors` and ignored for effective values.
+- Unknown `alertRuleId` entries are ignored.
+
+## Alerts API Enrichment
+- Endpoint aligns with Prometheus `/api/v1/alerts` and adds fields (additive):
+  - `alertRuleId`
+  - `component`
+  - `layer`
+- Classification for alerts is computed by correlating alerts to relabeled rules. When correlation fails, the fallback mapping above applies and derives `layer` from `source`.
+
+## UI Alignment
+- Columns for both Alerts and Alerting Rules should include `Layer` and `Component`.
+- Filters should include `Layer (cluster|namespace|compute)` and `Source (platform|user)`.
+- Creation/edit flows should allow choosing `layer` from the allowed set. `component` free-form (validated).
+- An admin-facing “Manage layers” section can describe the meaning of layers:
+  - Cluster: control plane, cluster-wide components (API server, etcd, network, …)
+  - Namespace: workloads and components scoped to a project/namespace
+  - Compute: node/compute-layer (may be phased in as needed)
+
+### Implementation Details/Notes/Constraints
+- Classification is computed server-side using CHA-derived matchers and persisted per `PrometheusRule` in a namespaced ConfigMap. Users may override `component`/`layer` in the same ConfigMap with validation.
+- Alerts are enriched additively (Prometheus-compatible), correlating to relabeled rules where possible and applying source-based defaults on fallback.
+- No new CRDs or aggregated API servers are introduced.standard RBAC applies.
+
+#### Hypershift / Hosted Control Planes
+
+
+#### Standalone Clusters
+
+#### Single-node Deployments or MicroShift
+
+#### OpenShift Kubernetes Engine
+
+## Upgrade / Downgrade Strategy
+- User overrides remain intact. only invalid values are annotated with `errors`.
+
+## Test Plan (High Level)
+- Unit tests:
+  - Unknown component fallback for user rules → `layer=namespace`, `component=<rule ns>`.
+  - Unknown component fallback for platform rules → `layer=cluster`, `component=<rule ns>`.
+  - Valid overrides are merged. Invalid overrides are recorded in `errors` and ignored.
+  - Signature annotation stored and updated deterministically.
+- Integration/e2e (as available):
+  - ConfigMap creation/update on rule changes.
+  - Alerts API includes additive fields and respects relabel configs.
+
+### Risks and Mitigations
+- Misclassification by classifier: mitigated by clear overrides and validation paths.
+- Drift between docs and implementation: mitigated by this enhancement and regular verification in tests.
+- Client assumptions about additional `layer` values: documented allowed set and guidance to pass through unknown values without interpretation.
+
+### Drawbacks
+- Additional reconciliation and ConfigMap writes on rule changes.
+- Classifier rules require maintenance as platform components evolve.
+
+## Alternatives (Not Implemented)
+- Setting the labels with alertRelabelConfig CR for all alerts, except for operator alerts in UWM. 
+- Introduce a dedicated classification CRD (adds operational overhead with limited benefit).
+- Compute classification only in the UI (duplicates logic, hard to validate).
+
+## Graduation Criteria
+
+### Dev Preview -> Tech Preview
+- End-to-end classification (compute, persist, enrich) with unit tests and docs.
+- UI consumes `component`/`layer` for display and filtering.
+
+### Tech Preview -> GA
+- Full test coverage (upgrade/downgrade/scale).
+- Stable defaulting across supported topologies (standalone, Hypershift, SNO/MicroShift).
+
+### Removing a deprecated feature
+- If the classifier or persistence format changes, document migration and keep backward compatibility for one minor release.
+
+## Version Skew Strategy
+- Server-side enrichment ensures older/newer UIs receive consistent fields. Unknown `layer` values must be passed through and displayed as-is.
+
+## Operational Aspects of API Extensions
+- No new API extensions are introduced. OwnerReferences ensure GC of ConfigMaps. Failures surface in controller logs.
+
+## Support Procedures
+- Verify `alertrule-classification-<prometheusrule>` ConfigMaps and their `errors` fields.
+- Check controller logs for validation failures.
+- Confirm alert `prometheus` or `openshift_io_alert_source` labels for source detection.
+## Open Questions
+