Skip to content

Commit 1141715

Browse files
committed
OCPEDGE-2084: feat(etcd): add PacemakerCluster CRD for Two-Node OpenShift with Fencing
Introduces etcd.openshift.io/v1alpha1 API group with a PacemakerCluster custom resource. This provides visibility into pacemaker cluster health for Two Node OpenShift with Fencing deployments. This status-only resource reports the health of pacemaker nodes and their managed resources (Kubelet, Etcd) along with fencing agent status. Each node tracks conditions, IP addresses for etcd peer URLs, and per-resource health using positive polarity conditions. The cluster-etcd-operator is responsible for interpreting this status and degrading appropriately when etcd is unhealthy or at risk of being unable to recover automatically in a quorum-loss event. Created with support from Claude Opus 4 (Anthropic)
1 parent 6fb7fda commit 1141715

22 files changed

+8305
-0
lines changed

etcd/README.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# etcd.openshift.io API Group
2+
3+
This API group contains CRDs related to etcd cluster management in Two Node OpenShift with Fencing deployments.
4+
5+
## API Versions
6+
7+
### v1alpha1
8+
9+
Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in Two Node OpenShift with Fencing deployments.
10+
11+
#### PacemakerCluster
12+
13+
- **Feature Gate**: `DualReplica`
14+
- **Component**: `two-node-fencing`
15+
- **Scope**: Cluster-scoped singleton resource (must be named "cluster")
16+
- **Resource Path**: `pacemakerclusters.etcd.openshift.io`
17+
18+
The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster.
19+
It is periodically updated by the cluster-etcd-operator's status collector.
20+
21+
### Status Subresource Design
22+
23+
This resource uses the standard Kubernetes status subresource pattern (`+kubebuilder:subresource:status`).
24+
The status collector creates the resource without status, then immediately populates it via the `/status` endpoint.
25+
26+
**Why not atomic create-with-status?**
27+
28+
We initially explored removing the status subresource to allow creating the resource with status in a single
29+
atomic operation. This would ensure the resource is never observed in an incomplete state. However:
30+
31+
1. The Kubernetes API server strips the `status` field from create requests when a status subresource is enabled
32+
2. Without the subresource, we cannot use separate RBAC for spec vs status updates
33+
3. The OpenShift API test framework assumes status subresource exists for status update tests
34+
35+
The status collector performs a two-step operation: create resource, then immediately update status.
36+
The brief window where status is empty is acceptable since the healthcheck controller handles missing status gracefully.
37+
38+
### Pacemaker Resources
39+
40+
A **pacemaker resource** is a unit of work managed by pacemaker. In pacemaker terminology, resources are services
41+
or applications that pacemaker monitors, starts, stops, and moves between nodes to maintain high availability.
42+
43+
For Two Node OpenShift with Fencing, we manage three resource types:
44+
- **Kubelet**: The Kubernetes node agent and a prerequisite for etcd
45+
- **Etcd**: The distributed key-value store
46+
- **FencingAgent**: Used to isolate failed nodes during a quorum loss event (tracked separately)
47+
48+
### Status Structure
49+
50+
```yaml
51+
status: # Optional on creation, populated via status subresource
52+
conditions: # Required when status present (min 3 items)
53+
- type: Healthy
54+
- type: InService
55+
- type: NodeCountAsExpected
56+
lastUpdated: <timestamp> # Required when status present, cannot decrease
57+
nodes: # Control-plane nodes (0-5, expects 2 for TNF)
58+
- name: <hostname> # RFC 1123 subdomain name
59+
addresses: # Required: List of node addresses (1-8 items)
60+
- type: InternalIP # Currently only InternalIP is supported
61+
address: <ip> # First address used for etcd peer URLs
62+
conditions: # Required: Node-level conditions (min 9 items)
63+
- type: Healthy
64+
- type: Online
65+
- type: InService
66+
- type: Active
67+
- type: Ready
68+
- type: Clean
69+
- type: Member
70+
- type: FencingAvailable
71+
- type: FencingHealthy
72+
resources: # Required: Pacemaker resources on this node (min 2)
73+
- name: Kubelet # Both Kubelet and Etcd must be present
74+
conditions: # Required: Resource-level conditions (min 8 items)
75+
- type: Healthy
76+
- type: InService
77+
- type: Managed
78+
- type: Enabled
79+
- type: Operational
80+
- type: Active
81+
- type: Started
82+
- type: Schedulable
83+
- name: Etcd
84+
conditions: [...] # Same 8 conditions as Kubelet (abbreviated)
85+
fencingAgents: # Required: Fencing agents for THIS node (1-8)
86+
- name: <nodename>_<method> # e.g., "master-0_redfish"
87+
method: <method> # Fencing method: redfish, ipmi, fence_aws, etc.
88+
conditions: [...] # Same 8 conditions as resources (abbreviated)
89+
```
90+
91+
### Fencing Agents
92+
93+
Fencing agents are STONITH (Shoot The Other Node In The Head) devices used to isolate failed nodes.
94+
Unlike regular pacemaker resources (Kubelet, Etcd), fencing agents are tracked separately because:
95+
96+
1. **Mapping by target, not schedule**: Resources are mapped to the node where they are scheduled to run.
97+
Fencing agents are mapped to the node they can *fence* (their target), regardless of which node
98+
their monitoring operations are scheduled on.
99+
100+
2. **Multiple agents per node**: A node can have multiple fencing agents for redundancy
101+
(e.g., both Redfish and IPMI). Expected: 1 per node, supported: up to 8.
102+
103+
3. **Health tracking via two node-level conditions**:
104+
- **FencingAvailable**: True if at least one agent is healthy (fencing works), False if all agents unhealthy (degrades operator)
105+
- **FencingHealthy**: True if all agents are healthy (ideal state), False if any agent is unhealthy (emits warning events)
106+
107+
### Cluster-Level Conditions
108+
109+
| Condition | True | False |
110+
|-----------|------|-------|
111+
| `Healthy` | Cluster is healthy (`ClusterHealthy`) | Cluster has issues (`ClusterUnhealthy`) |
112+
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
113+
| `NodeCountAsExpected` | Node count is as expected (`AsExpected`) | Wrong count (`InsufficientNodes`, `ExcessiveNodes`) |
114+
115+
### Node-Level Conditions
116+
117+
| Condition | True | False |
118+
|-----------|------|-------|
119+
| `Healthy` | Node is healthy (`NodeHealthy`) | Node has issues (`NodeUnhealthy`) |
120+
| `Online` | Node is online (`Online`) | Node is offline (`Offline`) |
121+
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
122+
| `Active` | Node is active (`Active`) | Node is in standby (`Standby`) |
123+
| `Ready` | Node is ready (`Ready`) | Node is pending (`Pending`) |
124+
| `Clean` | Node is clean (`Clean`) | Node is unclean (`Unclean`) |
125+
| `Member` | Node is a member (`Member`) | Not a member (`NotMember`) |
126+
| `FencingAvailable` | At least one agent healthy (`FencingAvailable`) | All agents unhealthy (`FencingUnavailable`) - degrades operator |
127+
| `FencingHealthy` | All agents healthy (`FencingHealthy`) | Some agents unhealthy (`FencingUnhealthy`) - emits warnings |
128+
129+
### Resource-Level Conditions
130+
131+
Each resource in the `resources` array and each fencing agent in the `fencingAgents` array has its own conditions.
132+
133+
| Condition | True | False |
134+
|-----------|------|-------|
135+
| `Healthy` | Resource is healthy (`ResourceHealthy`) | Resource has issues (`ResourceUnhealthy`) |
136+
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
137+
| `Managed` | Managed by pacemaker (`Managed`) | Not managed (`Unmanaged`) |
138+
| `Enabled` | Resource is enabled (`Enabled`) | Resource is disabled (`Disabled`) |
139+
| `Operational` | Resource is operational (`Operational`) | Resource has failed (`Failed`) |
140+
| `Active` | Resource is active (`Active`) | Resource is not active (`Inactive`) |
141+
| `Started` | Resource is started (`Started`) | Resource is stopped (`Stopped`) |
142+
| `Schedulable` | Resource is schedulable (`Schedulable`) | Resource is not schedulable (`Unschedulable`) |
143+
144+
### Validation Rules
145+
146+
**Resource naming:**
147+
- Resource name must be "cluster" (singleton)
148+
149+
**Node name validation:**
150+
- Must be a lowercase RFC 1123 subdomain name
151+
- Consists of lowercase alphanumeric characters, '-' or '.'
152+
- Must start and end with an alphanumeric character
153+
- Maximum 253 characters
154+
155+
**Node addresses:**
156+
- Uses `PacemakerNodeAddress` type (similar to `corev1.NodeAddress` but with IP validation)
157+
- Currently only `InternalIP` type is supported
158+
- Pacemaker allows multiple addresses for Corosync communication between nodes (1-8 addresses)
159+
- The first address in the list is used for IP-based peer URLs for etcd membership
160+
- IP validation:
161+
- Must be a valid global unicast IPv4 or IPv6 address
162+
- Must be in canonical form (e.g., `192.168.1.1` not `192.168.001.001`, or `2001:db8::1` not `2001:0db8::1`)
163+
- Excludes loopback, link-local, and multicast addresses
164+
- Maximum length is 39 characters (full IPv6 address)
165+
166+
**Timestamp validation:**
167+
- `lastUpdated` is required when status is present
168+
- Once set, cannot be set to an earlier timestamp (validation uses `!has(oldSelf.lastUpdated)` to handle initial creation)
169+
- Timestamps must always increase (prevents stale updates from overwriting newer data)
170+
171+
**Status fields:**
172+
- `status` - Optional on creation (pointer type), populated via status subresource
173+
- When status is present, all fields within are required:
174+
- `conditions` - Required array of cluster conditions (min 3 items)
175+
- `lastUpdated` - Required timestamp for staleness detection
176+
- `nodes` - Required array of control-plane node statuses (min 0, max 5; empty allowed for catastrophic failures)
177+
178+
**Node fields (when node present):**
179+
- `name` - Required, RFC 1123 subdomain
180+
- `addresses` - Required (min 1, max 8 items)
181+
- `conditions` - Required (min 9 items with specific types enforced via XValidation)
182+
- `resources` - Required (min 2 items: Kubelet and Etcd)
183+
- `fencingAgents` - Required (min 1, max 8 items)
184+
185+
**Conditions validation:**
186+
- Cluster-level: MinItems=3 (Healthy, InService, NodeCountAsExpected)
187+
- Node-level: MinItems=9 (Healthy, Online, InService, Active, Ready, Clean, Member, FencingAvailable, FencingHealthy)
188+
- Resource-level: MinItems=8 (Healthy, InService, Managed, Enabled, Operational, Active, Started, Schedulable)
189+
- Fencing agent-level: MinItems=8 (same conditions as resources)
190+
191+
All condition arrays have XValidation rules to ensure specific condition types are present.
192+
193+
**Resource names:**
194+
- Valid values are: `Kubelet`, `Etcd`
195+
- Both resources must be present in each node's `resources` array
196+
197+
**Fencing agent fields:**
198+
- `name`: The pacemaker resource name (e.g., "master-0_redfish"), max 253 characters
199+
- `method`: The fencing method (e.g., "redfish", "ipmi", "fence_aws"), max 63 characters
200+
- `conditions`: Required, same 8 conditions as resources
201+
202+
### Usage
203+
204+
The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on
205+
the cluster state. The aggregate `Healthy` conditions at each level (cluster, node, resource) provide a quick
206+
way to determine overall health.

etcd/install.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package etcd
2+
3+
import (
4+
"k8s.io/apimachinery/pkg/runtime"
5+
"k8s.io/apimachinery/pkg/runtime/schema"
6+
7+
v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
8+
)
9+
10+
const (
11+
GroupName = "etcd.openshift.io"
12+
)
13+
14+
var (
15+
schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
16+
// Install is a function which adds every version of this group to a scheme
17+
Install = schemeBuilder.AddToScheme
18+
)
19+
20+
func Resource(resource string) schema.GroupResource {
21+
return schema.GroupResource{Group: GroupName, Resource: resource}
22+
}
23+
24+
func Kind(kind string) schema.GroupKind {
25+
return schema.GroupKind{Group: GroupName, Kind: kind}
26+
}

etcd/v1alpha1/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.PHONY: test
2+
test:
3+
make -C ../../tests test GINKGO_EXTRA_ARGS=--focus="etcd.openshift.io/v1alpha1"

etcd/v1alpha1/doc.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// +k8s:deepcopy-gen=package,register
2+
// +k8s:defaulter-gen=TypeMeta
3+
// +k8s:openapi-gen=true
4+
// +openshift:featuregated-schema-gen=true
5+
// +groupName=etcd.openshift.io
6+
package v1alpha1

etcd/v1alpha1/register.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
package v1alpha1
2+
3+
import (
4+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
5+
"k8s.io/apimachinery/pkg/runtime"
6+
"k8s.io/apimachinery/pkg/runtime/schema"
7+
)
8+
9+
var (
10+
GroupName = "etcd.openshift.io"
11+
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
12+
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
13+
// Install is a function which adds this version to a scheme
14+
Install = schemeBuilder.AddToScheme
15+
16+
// SchemeGroupVersion generated code relies on this name
17+
// Deprecated
18+
SchemeGroupVersion = GroupVersion
19+
// AddToScheme exists solely to keep the old generators creating valid code
20+
// DEPRECATED
21+
AddToScheme = schemeBuilder.AddToScheme
22+
)
23+
24+
// Resource generated code relies on this being here, but it logically belongs to the group
25+
// DEPRECATED
26+
func Resource(resource string) schema.GroupResource {
27+
return schema.GroupResource{Group: GroupName, Resource: resource}
28+
}
29+
30+
func addKnownTypes(scheme *runtime.Scheme) error {
31+
metav1.AddToGroupVersion(scheme, GroupVersion)
32+
33+
scheme.AddKnownTypes(GroupVersion,
34+
&PacemakerCluster{},
35+
&PacemakerClusterList{},
36+
)
37+
38+
return nil
39+
}

0 commit comments

Comments
 (0)