-
Notifications
You must be signed in to change notification settings - Fork 584
OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing #2544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Hello @jaypoulz! Some important instructions when contributing to openshift/api: |
|
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
58218ce to
96e327f
Compare
2ba442d to
29b9fec
Compare
|
@jaypoulz thank you for the PR, do you mind making the CI happy? |
29b9fec to
26f7821
Compare
|
Hi @saschagrunert :) Working on it! :D |
|
A few open questions I have:
That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.
|
b0ff230 to
1b57b09
Compare
b9b727f to
fdd53e9
Compare
|
Yeah, I'll ignore the CI failures for now, running
I'm new to API review, but my gut feeling tells me that a dedicated
You can also try to run it in a container by
Do you mind elaborating on that? Do you mean generating the code for the unions? API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go @jaypoulz is there an OpenShift enhancement available for this change? |
etcd/tnf/v1alpha1/tests/pacemakerstatuses.tnf.etcd.openshift.io/DualReplica.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@etcd/v1alpha1/Makefile`:
- Around line 1-3: The Makefile only defines a test target and is missing
minimal all and clean targets required by checkmake; update the Makefile to add
an all target (e.g., make all depends on test) and a clean target (no-op or
removing generated artifacts), and include both "all" and "clean" in the .PHONY
declaration alongside "test" so .PHONY lists all three targets and the new
targets are implemented (referencing the Makefile's existing "test" target when
wiring "all").
♻️ Duplicate comments (3)
payload-manifests/crds/0000_25_etcd_01_pacemakerclusters-CustomNoUpgrade.crd.yaml (1)
171-174: Fix invalid CEL call for canonical IP check (still present).
Same issue as previously noted:ip.isCanonical(self)is invalid; parse first withip(self). Regenerate from source so all variants stay aligned.🐛 Proposed fix
- rule: isIP(self) && ip.isCanonical(self) && ip(self).isGlobalUnicast() + rule: isIP(self) && ip(self).isCanonical() && ip(self).isGlobalUnicast()openapi/openapi.json (1)
14663-14842: Documented list-size constraints aren’t enforced in the schema.Lines 14663–14842 describe minimum/maximum item counts for several arrays (conditions, resources, fencingAgents, nodes), but the OpenAPI definitions don’t include corresponding
minItems/maxItems. This leaves the constraints unenforced for API consumers and validation tooling. Please add the appropriate kubebuilder validation markers on the source Go types (e.g.,+kubebuilder:validation:MinItems/MaxItems) and regenerate the OpenAPI.payload-manifests/crds/0000_25_etcd_01_pacemakerclusters-TechPreviewNoUpgrade.crd.yaml (1)
171-175: Verify CELip.isCanonicalusage against the target K8s CEL library.The rule uses
ip.isCanonical(self). Please confirm this is supported in the target Kubernetes/OCP version (some documentation and past reviews suggestip(self).isCanonical()instead).Kubernetes CEL IP library ip.isCanonical(self) versus ip(self).isCanonical() supported syntax
🧹 Nitpick comments (2)
tools/go.mod (1)
38-38: Document and time‑box the controller‑tools fork.The replace to a personal fork is a supply‑chain and maintainability risk unless clearly justified and tracked. Please add a short rationale (and upstream issue/rollback plan) near this replace and verify the fork still builds with the Go toolchain used here.
📌 Suggested inline note
-replace sigs.k8s.io/controller-tools => github.com/jaypoulz/controller-tools v0.0.0-20260119185850-e534bd0a65da +// TEMP: forked controller-tools for validation ordering fix; revert after upstream merges <issue/link> +replace sigs.k8s.io/controller-tools => github.com/jaypoulz/controller-tools v0.0.0-20260119185850-e534bd0a65dapayload-manifests/crds/0000_25_etcd_01_pacemakerclusters-TechPreviewNoUpgrade.crd.yaml (1)
401-425: Alignmethodvalidation with the “redfish-only” constraint.
nameis enforced to end with_redfish, butmethodis an unconstrained string. If the API truly only supports redfish today, consider constrainingmethodto match; otherwise, update the description to avoid implying a restriction.♻️ Possible schema tightening
method: description: |- method is the fencing method used by this agent (e.g., "redfish", "ipmi", "fence_aws"). This is extracted from the pacemaker resource agent type. maxLength: 63 minLength: 1 type: string + enum: + - redfish
...lpha1/zz_generated.crd-manifests/0000_10_config-operator_01_backups-CustomNoUpgrade.crd.yaml
Show resolved
Hide resolved
.../zz_generated.crd-manifests/0000_10_config-operator_01_backups-TechPreviewNoUpgrade.crd.yaml
Show resolved
Hide resolved
ae78d1e to
1141715
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@etcd/README.md`:
- Around line 86-88: Update the fencing example in the README to match the
current CRD validation by showing redfish-only naming and method: change the
example name pattern to use the `_redfish` suffix (e.g., "master-0_redfish") and
set method to "redfish" (remove other method examples like ipmi or fence_aws);
reference the example keys `name` and `method` so reviewers can find and align
the README with the CRD's `_redfish` naming constraint.
In `@etcd/v1alpha1/types_pacemakercluster.go`:
- Around line 505-511: The `LastUpdated` field is declared as +required but its
JSON tag includes `omitempty`, which contradicts requiredness; update the
declaration for LastUpdated (metav1.Time) in types_pacemakercluster.go to be
consistent by removing `omitempty` from the
`json:"lastUpdated,omitempty,omitzero"` tag so it becomes
`json:"lastUpdated,omitzero"` (or alternatively change the kubebuilder marker to
+optional if you intend the field to be optional); apply the same pattern used
for other required metav1.Time fields like `LastTransitionTime`.
♻️ Duplicate comments (1)
openapi/generated_openapi/zz_generated.openapi.go (1)
26444-26465: Align documented invariants with enforced schema validations.Several descriptions assert minimum condition/resource counts, a node cap, and monotonic
lastUpdated, but I don’t seeminItems/maxItemsorx-kubernetes-validationsemitted here. If these constraints are intended, please add the corresponding kubebuilder markers in the Go types and regenerate; otherwise consider softening the wording to avoid implying enforcement.#!/bin/bash set -euo pipefail files=$(fd -a 'types_pacemakercluster.go' .) for file in $files; do echo "==> $file" rg -n -C2 'PacemakerCluster(FencingAgentStatus|NodeStatus|ResourceStatus|Status)' "$file" rg -n -C2 'kubebuilder:validation:(MinItems|MaxItems|Enum|XValidation)' "$file" rg -n -C2 'Conditions\s+\[\]|Resources\s+\[\]|Nodes\s+\[\]|LastUpdated' "$file" doneAlso applies to: 26547-26568, 26595-26606, 26655-26676, 26701-26749
🧹 Nitpick comments (3)
etcd/v1alpha1/tests/pacemakerclusters.etcd.openshift.io/DualReplica.yaml (1)
682-1934: Thorough IP address validation test coverage.The IP validation tests cover:
- IPv4 loopback rejection (127.0.0.1)
- Invalid format rejection ("not-an-ip")
- Valid IPv6 acceptance (2001:db8::1)
- IPv6 loopback rejection (::1)
- IPv6 link-local rejection (fe80::1)
This aligns with the
isIP(self) && ip.isCanonical(self) && ip(self).isGlobalUnicast()CEL validation rule onPacemakerNodeAddress.address.Consider adding tests for edge cases if time permits:
- Fencing agent name validation (
_redfishsuffix requirement)- Missing required resources (Kubelet/Etcd)
- Non-canonical IPv6 format (e.g.,
2001:0db8::1)etcd/README.md (1)
18-20: Make the status-only nature explicit.A short, explicit note that this CRD has no spec would prevent confusion for readers expecting a spec section.
📝 Suggested doc tweak
The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster. It is periodically updated by the cluster-etcd-operator's status collector. +It is a status-only resource; spec is intentionally omitted and must remain empty.payload-manifests/crds/0000_25_etcd_01_pacemakerclusters-DevPreviewNoUpgrade.crd.yaml (1)
401-421: Consider validatingmethodto match supported fencing method.
nameis validated to end with_redfish, butmethodaccepts any string. If only redfish is supported, constrainmethodtoo (preferably via Go type validation so the CRD is regenerated consistently).♻️ Suggested schema tightening
method: description: |- method is the fencing method used by this agent (e.g., "redfish", "ipmi", "fence_aws"). This is extracted from the pacemaker resource agent type. maxLength: 63 minLength: 1 + enum: + - redfish type: string
| // lastUpdated is the timestamp when this status was last updated. This is useful for identifying | ||
| // stale status reports. It must be a valid timestamp in RFC3339 format. Once set, this field cannot | ||
| // be removed and cannot be set to an earlier timestamp than the current value. | ||
| // +kubebuilder:validation:Format=date-time | ||
| // +required | ||
| LastUpdated metav1.Time `json:"lastUpdated,omitempty,omitzero"` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's check if this is in the etcd package mentioned
echo "=== Looking at the specific file mentioned ==="
if [ -f "etcd/v1alpha1/types_pacemakercluster.go" ]; then
sed -n '505,511p' etcd/v1alpha1/types_pacemakercluster.go
else
echo "File not found at specified path, searching for it..."
find . -name "types_pacemakercluster.go" -type f 2>/dev/null
fi
echo ""
echo "=== Searching for metav1.Time fields with both omitempty and omitzero ==="
rg -n "metav1\.Time.*json:.*omitempty.*omitzero|metav1\.Time.*json:.*omitzero.*omitempty" --type=go -B 1 | head -30
echo ""
echo "=== Searching for metav1.Time fields with just omitzero ==="
rg -n "metav1\.Time.*json:.*omitzero" --type=go -B 1 | grep -v "omitempty" | head -30
echo ""
echo "=== Searching for metav1.Time fields with just omitempty ==="
rg -n "metav1\.Time.*json:.*omitempty" --type=go -B 1 | grep -v "omitzero" | head -30Repository: openshift/api
Length of output: 3488
🏁 Script executed:
#!/bin/bash
echo "=== Checking context around all instances of omitempty,omitzero combination ==="
rg -B 2 "metav1\.Time.*omitempty.*omitzero|metav1\.Time.*omitzero.*omitempty" --type=goRepository: openshift/api
Length of output: 929
🏁 Script executed:
#!/bin/bash
echo "=== Searching for documentation or comments about omitzero ==="
rg -i "omitzero" --type=go -B 3 -A 3 | head -40
echo ""
echo "=== Check if there are metav1.Time fields with omitzero but no omitempty ==="
rg "metav1\.Time.*\`json:.*omitzero" --type=go | grep -v "omitempty"
echo ""
echo "=== Check the actual inconsistency: +required with omitempty ==="
rg -B 3 "json:.*omitempty" --type=go | rg -B 3 "\+required"Repository: openshift/api
Length of output: 37285
🏁 Script executed:
#!/bin/bash
echo "=== Checking other +required metav1.Time fields (if any) for comparison ==="
rg -B 3 "metav1\.Time.*json:" --type=go | rg -B 3 "\+required" | head -50Repository: openshift/api
Length of output: 2807
Remove omitempty or reconsider the +required marker.
The LastUpdated field is marked as +required but uses the omitempty JSON tag, which allows the field to be omitted during JSON serialization. This is contradictory. Other similar +required metav1.Time fields in the codebase (e.g., LastTransitionTime in operatorcontrolplane and machineconfiguration packages) omit the omitempty tag entirely. Either remove omitempty to enforce presence in JSON output, or change the marker to +optional if the field is truly optional.
🤖 Prompt for AI Agents
In `@etcd/v1alpha1/types_pacemakercluster.go` around lines 505 - 511, The
`LastUpdated` field is declared as +required but its JSON tag includes
`omitempty`, which contradicts requiredness; update the declaration for
LastUpdated (metav1.Time) in types_pacemakercluster.go to be consistent by
removing `omitempty` from the `json:"lastUpdated,omitempty,omitzero"` tag so it
becomes `json:"lastUpdated,omitzero"` (or alternatively change the kubebuilder
marker to +optional if you intend the field to be optional); apply the same
pattern used for other required metav1.Time fields like `LastTransitionTime`.
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits, but otherwise LGTM
Have reviewed the verify and lint failures and will override the lint failures, verify should be fixed now we've bumped controller-tools separately
be75fd9 to
efa53f8
Compare
| // +required | ||
| Method FencingMethod `json:"method,omitempty"` | ||
|
|
||
| // node is the name of the node that this fencing agent can fence (its target). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not always the same as the node name in PacemakerClusterNodeStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes 🙈
| // +required | ||
| Conditions []metav1.Condition `json:"conditions,omitempty"` | ||
|
|
||
| // name is the name of the node. This is expected to match the Kubernetes node's name, which must be a lowercase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this should include node? Either just as node or nodeName? We have examples of both already in openshift APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to follow to follow apiserver/v1 and operator/v1 and go with nodeName. When I see "node", my first thought is that the field will be a corev1.Node reference.
efa53f8 to
bf02033
Compare
Add PacemakerCluster CRD to monitor pacemaker cluster health in Two Node OpenShift with Fencing deployments. This status-only resource is populated by the cluster-etcd-operator. Features: - Cluster-scoped singleton (must be named 'cluster') - Hierarchical health tracking: cluster -> nodes -> resources/fencing agents - Comprehensive validation for node names, IP addresses, and timestamps Gated behind DualReplica feature gate.
bf02033 to
4b2f7e2
Compare
|
/lgtm |
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jaypoulz: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/override ci/prow/lint We decided to make conditions required and believe that's the correct direction for this API |
|
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/lint DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-aws-serial-techpreview Known issue with the pipeline controller, these do not actually exist |
|
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial, ci/prow/e2e-aws-serial-techpreview DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/verified by https://issues.redhat.com/browse/OCPEDGE-2097?focusedId=28891048&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-28891048 and https://issues.redhat.com/secure/attachment/13596777/healthcheck-validation.zip This was a detailed scenario run-through of 8 scenarios that the PacemakerCluster API should report with the reference PR in CEO: openshift/cluster-etcd-operator#1487 I did find a few bugs with the CEO implementation of the reporting, but the API behaved as expected and had comprehensive coverage for the noted scenarios. |
|
@jaypoulz: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with openshift/cluster-etcd-operator#1487
Depends on kubernetes-sigs/controller-tools#1324, which in turn needs a bump for go 1.25
In the meantime, I've vendored: openshift/kubernetes-sigs-controller-tools#33
Depends on #2657 #2658