|
| 1 | +--- |
| 2 | +title: alert-rule-classification-mapping |
| 3 | +authors: |
| 4 | + - "@sradco" |
| 5 | +reviewers: |
| 6 | + - "@jan--f" |
| 7 | + - "@jgbernalp" |
| 8 | +approvers: |
| 9 | + - "@jan--f" |
| 10 | + - "@jgbernalp" |
| 11 | +api-approvers: |
| 12 | + - TBD |
| 13 | +creation-date: 2026-01-07 |
| 14 | +last-updated: 2026-01-07 |
| 15 | +tracking-link: |
| 16 | + - "" |
| 17 | +--- |
| 18 | +# Alert Rule Classification Mapping and Layer Defaults |
| 19 | + |
| 20 | +## Summary |
| 21 | +This enhancement defines a mapping for alert rule classification in OpenShift monitoring, covering: |
| 22 | +- How `component` and `layer` are determined for alerting rules and alerts. |
| 23 | +- The fallback behavior when a component cannot be determined by classifiers. |
| 24 | +- The allowed `layer` values and how `layer` is derived from the `source` (platform vs user). |
| 25 | +- Persistence and override model via ConfigMaps. |
| 26 | +- API enrichment and UI alignment for consistent terminology and filtering. |
| 27 | + |
| 28 | +The outcome is a predictable and documented classification that UIs and integrations can rely on. It aligns with the monitoring plugin implementation and prevents empty, ambiguous `layer` values. |
| 29 | + |
| 30 | +Related enhancement: [Alerts UI Management](alerts-ui-management.md) |
| 31 | + |
| 32 | +## Motivation |
| 33 | +- Enable computation of component- and cluster-level health to help the UI prioritize clusters needing attention, highlight failing components, and speed troubleshooting in single and multi-cluster views. |
| 34 | +- Provide a single source of truth for classification behavior across backend and UI. |
| 35 | +- Ensure consistent and stable `component` and `layer` values for alerts and rules. |
| 36 | +- Eliminate historical ambiguity where `layer` could be empty, by defining defaulting rules. |
| 37 | +- Document how user overrides are validated and persisted to enable GitOps-friendly workflows. |
| 38 | + |
| 39 | +## Goals |
| 40 | +1. Define allowed `layer` values and defaulting rules. |
| 41 | +2. Document classifier-based mapping and fallback behavior. |
| 42 | +3. Standardize persistence and override schema in a per-`PrometheusRule` ConfigMap. |
| 43 | +4. Specify API enrichment fields for alerts and rules, and expected UI filters/columns. |
| 44 | +5. Maintain compatibility with Prometheus/Thanos schemas (additive enrichment only). |
| 45 | + |
| 46 | +## Non‑Goals |
| 47 | +- Changing upstream Prometheus/Thanos APIs or schemas. |
| 48 | +- Redefining platform vs user source detection beyond what is documented here. |
| 49 | +- Enforcing a specific UI, this defines the model that UIs should follow. |
| 50 | + |
| 51 | +## Terminology |
| 52 | +- component: Logical owner of the alert or rule (e.g., `kube-apiserver`, `etcd`, a namespace, a team). |
| 53 | +- layer: Impact scope. Allowed values: `cluster`, `compute`, `namespace`. |
| 54 | +- source: Origin of the rule/alert. Either `platform` (cluster monitoring stack) or `user` (UWM). |
| 55 | +- platform stack: The `openshift-monitoring` stack managed by Red Hat–supported operators. |
| 56 | +- user stack (UWM): User monitoring stack for application namespaces. |
| 57 | + |
| 58 | +## Mapping Logic |
| 59 | +### Primary Mapping (Classifier) |
| 60 | +- The backend uses a classifier (CHA-derived matchers) to compute a `(layer, component)` tuple from rule/alert labels. |
| 61 | +- Typical mappings: |
| 62 | + - Core control-plane components → `layer=cluster`, `component=<cp-subsystem>` |
| 63 | + - Node/compute-related → `layer=compute` |
| 64 | + - Workload/namespace-level alerts → `layer=namespace` |
| 65 | + |
| 66 | +### Fallback Mapping (When component is unknown) |
| 67 | +If the classifier returns an empty component or `Others`: |
| 68 | +- `component = <PrometheusRule namespace>` |
| 69 | +- `layer` is derived from `source`: |
| 70 | + - `platform` → `cluster` |
| 71 | + - `user` → `namespace` |
| 72 | + |
| 73 | +Notes: |
| 74 | +- The backend no longer generates an empty `layer`. Generated values are always one of `cluster|compute|namespace`. |
| 75 | +- The `compute` layer is reserved for compute/node-related cases and may expand in future iterations. |
| 76 | + |
| 77 | +### Source Determination |
| 78 | +- For rules: a rule is considered `platform` if it belongs to the cluster monitoring namespace (`openshift-monitoring`). Otherwise it is `user`. |
| 79 | +- For alerts: considered `platform` when either: |
| 80 | + - `openshift_io_alert_source == platform`, or |
| 81 | + - `prometheus` label is prefixed with `openshift-monitoring/`. |
| 82 | + Otherwise `user`. |
| 83 | + |
| 84 | +## Persistence and Overrides |
| 85 | +### Per‑PrometheusRule ConfigMap |
| 86 | +- Name: `alertrule-classification-<prometheusrule-name>` |
| 87 | +- Namespace: same namespace as the `PrometheusRule` |
| 88 | +- OwnerReference: points to the `PrometheusRule` |
| 89 | +- Annotation: a stable signature used for traceability |
| 90 | +- Data key: `alert-rule-classification.yaml` |
| 91 | +- Value: YAML map from `alertRuleId` → object: |
| 92 | + - `component: <string>` |
| 93 | + - `layer: <string>` |
| 94 | + - `errors: [ ... ]` (optional: set when validation fails) |
| 95 | + |
| 96 | +Example: |
| 97 | +```yaml |
| 98 | +<alertRuleIdA>: |
| 99 | + component: kube-apiserver |
| 100 | + layer: cluster |
| 101 | +<alertRuleIdB>: |
| 102 | + component: ns-a |
| 103 | + layer: namespace |
| 104 | +``` |
| 105 | +
|
| 106 | +### User Overrides |
| 107 | +- Users may override `component` and/or `layer` by editing the same ConfigMap. |
| 108 | +- Validation: |
| 109 | + - `component`: non-empty, 1–253 chars, `[A-Za-z0-9._-]`, must start/end alphanumeric. |
| 110 | + - `layer`: one of `cluster|compute|namespace`. |
| 111 | +- Invalid overrides are preserved but annotated with `errors` and ignored for effective values. |
| 112 | +- Unknown `alertRuleId` entries are ignored. |
| 113 | + |
| 114 | +## Alerts API Enrichment |
| 115 | +- Endpoint aligns with Prometheus `/api/v1/alerts` and adds fields (additive): |
| 116 | + - `alertRuleId` |
| 117 | + - `component` |
| 118 | + - `layer` |
| 119 | +- Classification for alerts is computed by correlating alerts to relabeled rules. When correlation fails, the fallback mapping above applies and derives `layer` from `source`. |
| 120 | + |
| 121 | +## UI Alignment |
| 122 | +- Columns for both Alerts and Alerting Rules should include `Layer` and `Component`. |
| 123 | +- Filters should include `Layer (cluster|namespace|compute)` and `Source (platform|user)`. |
| 124 | +- Creation/edit flows should allow choosing `layer` from the allowed set. `component` free-form (validated). |
| 125 | +- An admin-facing “Manage layers” section can describe the meaning of layers: |
| 126 | + - Cluster: control plane, cluster-wide components (API server, etcd, network, …) |
| 127 | + - Namespace: workloads and components scoped to a project/namespace |
| 128 | + - Compute: node/compute-layer (may be phased in as needed) |
| 129 | + |
| 130 | +### Implementation Details/Notes/Constraints |
| 131 | +- Classification is computed server-side using CHA-derived matchers and persisted per `PrometheusRule` in a namespaced ConfigMap. Users may override `component`/`layer` in the same ConfigMap with validation. |
| 132 | +- Alerts are enriched additively (Prometheus-compatible), correlating to relabeled rules where possible and applying source-based defaults on fallback. |
| 133 | +- No new CRDs or aggregated API servers are introduced.standard RBAC applies. |
| 134 | + |
| 135 | +#### Hypershift / Hosted Control Planes |
| 136 | + |
| 137 | + |
| 138 | +#### Standalone Clusters |
| 139 | + |
| 140 | +#### Single-node Deployments or MicroShift |
| 141 | + |
| 142 | +#### OpenShift Kubernetes Engine |
| 143 | + |
| 144 | +## Upgrade / Downgrade Strategy |
| 145 | +- User overrides remain intact. only invalid values are annotated with `errors`. |
| 146 | + |
| 147 | +## Test Plan (High Level) |
| 148 | +- Unit tests: |
| 149 | + - Unknown component fallback for user rules → `layer=namespace`, `component=<rule ns>`. |
| 150 | + - Unknown component fallback for platform rules → `layer=cluster`, `component=<rule ns>`. |
| 151 | + - Valid overrides are merged. Invalid overrides are recorded in `errors` and ignored. |
| 152 | + - Signature annotation stored and updated deterministically. |
| 153 | +- Integration/e2e (as available): |
| 154 | + - ConfigMap creation/update on rule changes. |
| 155 | + - Alerts API includes additive fields and respects relabel configs. |
| 156 | + |
| 157 | +### Risks and Mitigations |
| 158 | +- Misclassification by classifier: mitigated by clear overrides and validation paths. |
| 159 | +- Drift between docs and implementation: mitigated by this enhancement and regular verification in tests. |
| 160 | +- Client assumptions about additional `layer` values: documented allowed set and guidance to pass through unknown values without interpretation. |
| 161 | + |
| 162 | +### Drawbacks |
| 163 | +- Additional reconciliation and ConfigMap writes on rule changes. |
| 164 | +- Classifier rules require maintenance as platform components evolve. |
| 165 | + |
| 166 | +## Alternatives (Not Implemented) |
| 167 | +- Setting the labels with alertRelabelConfig CR for all alerts, except for operator alerts in UWM. |
| 168 | +- Introduce a dedicated classification CRD (adds operational overhead with limited benefit). |
| 169 | +- Compute classification only in the UI (duplicates logic, hard to validate). |
| 170 | + |
| 171 | +## Graduation Criteria |
| 172 | + |
| 173 | +### Dev Preview -> Tech Preview |
| 174 | +- End-to-end classification (compute, persist, enrich) with unit tests and docs. |
| 175 | +- UI consumes `component`/`layer` for display and filtering. |
| 176 | + |
| 177 | +### Tech Preview -> GA |
| 178 | +- Full test coverage (upgrade/downgrade/scale). |
| 179 | +- Stable defaulting across supported topologies (standalone, Hypershift, SNO/MicroShift). |
| 180 | + |
| 181 | +### Removing a deprecated feature |
| 182 | +- If the classifier or persistence format changes, document migration and keep backward compatibility for one minor release. |
| 183 | + |
| 184 | +## Version Skew Strategy |
| 185 | +- Server-side enrichment ensures older/newer UIs receive consistent fields. Unknown `layer` values must be passed through and displayed as-is. |
| 186 | + |
| 187 | +## Operational Aspects of API Extensions |
| 188 | +- No new API extensions are introduced. OwnerReferences ensure GC of ConfigMaps. Failures surface in controller logs. |
| 189 | + |
| 190 | +## Support Procedures |
| 191 | +- Verify `alertrule-classification-<prometheusrule>` ConfigMaps and their `errors` fields. |
| 192 | +- Check controller logs for validation failures. |
| 193 | +- Confirm alert `prometheus` or `openshift_io_alert_source` labels for source detection. |
| 194 | +## Open Questions |
| 195 | + |
0 commit comments