Skip to content

Conversation

@maxcao13
Copy link
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 16, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 16, 2026

@maxcao13: This pull request references AUTOSCALE-512 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@maxcao13 maxcao13 marked this pull request as draft January 16, 2026 23:32
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2026
Signed-off-by: Max Cao <macao@redhat.com>
types that are unsupported by the service provider (e.g., ROSA does not support
`t3.micro`).

Simple approaches like ValidatingAdmissionPolicies or hiding API fields don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on why validating admission policy is less preferred or why can't it be complementary to the reconciliation approach?

Wouldn't something like this be equivalent to what you achieve in your example with the autoNodePolicy?

object.spec.template.spec.requirements.all(req,
  (req.key == 'node.kubernetes.io/instance-type' && req.operator == 'In')
  ? req.values.all(val, !['m5.4xlarge', 'm5.2xlarge'].has(val))
  : true
)

I assume the policy is a best effort providing the amount of input that is available https://karpenter.sh/docs/reference/instance-types/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a background for anyone else reading this thread and proposal:

Just looking at your link of instance types and aws-specific well-known labels that would select those instancetypes - there are a ton in there, and my thinking is that it would be technically infeasible to restrict all pathways to have a deny-list during validation. But if we were only to restrict a small subset of paths that disable instancetypes, is that worth putting the code in?

For example, let's say ROSA SRE wants to restrict the m5.4xlarge and m5.2xlarge instance types, and we have cel that tries to restrict users from setting this in their nodepool, what exactly would that look like?

If we have what you describe:

object.spec.template.spec.requirements.all(req,
  (req.key == 'node.kubernetes.io/instance-type' && req.operator == 'In')
  ? req.values.all(val, !['m5.4xlarge', 'm5.2xlarge'].has(val))
  : true
)

Users can get around this by just specifiying this instead:

  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["m5"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["8", "16"]

and if we say, "oh we will just restrict that both those requirements cannot get set at the same time as well", then you can see how complicated this can get.

If we want to only include a simple cel rule like you proposed as complimentary, and just ignore all the other ways a user can bypass it, then it wouldn't be so complicated and we could do it, but I wasn't sure that this best-effort policy would be greatly beneficial to us, so let me know if you think otherwise. To me, I imagined admission checks to be an "all or nothing" sort of thing, and I could make that clearer in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a ton in there, and my thinking is that it would be technically infeasible to restrict all pathways to have a deny-list during validation. But if we were only to restrict a small subset of paths that disable instancetypes, is that worth putting the code in?

How is this not true for the policy CR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not true because the service provider can directly insert their own requirements that will be immutable from the user perspective and cannot be superseded/circumvented by any user-set requirements.

So, for example, if ROSA has a particular region that has disabled m5.4xlarge and m5.2xlarge instance types and wants to make sure all users in the management clusters in those regions are not allowed to spin those up, they can specify in the CR:

requirements:
  - key: node.kubernetes.io/instance-type
    operator: NotIn
    values: ["m5.4xlarge, m5.2xlarge"]

which will reflect in the underlying NodePool.

But the trick here is that if the user tries to set something like this in the OpenshiftNodePool:

  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["m5.4xlarge, m5.2xlarge"]

The resulting underlying NodePool will look like:

requirements:
  - key: node.kubernetes.io/instance-type
    operator: NotIn
    values: ["m5.4xlarge, m5.2xlarge"]
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["m5.4xlarge, m5.2xlarge"]

and Karpenter core code for requirements is designed such that all the requirements that are combined in order to select a final instance type, are combined with intersection. In other words, only instance types that appear in all requiremenst will be considered, meaning this requirements spec's intersection, will not select any m5.4xlarge or m5.2xlarge instances. In fact, Karpenter will not be able to provision anything using this NodePool, unless there is another requirement that the user specifies that produce a non-empty intersection of instance types.


This enhancement is not applicable to standalone OpenShift clusters. AutoNodePolicy
relies on the management/guest cluster separation that only exists in HyperShift
environments. In standalone clusters, administrators have full control over
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw there's nothing technically preventing the karpenter operator from running in other form factors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah I glossed over this when writing/clauding it, but on OpenShift standalone (at least with karpenter-provider-aws), we probably wouldn't restrict instance types or anything else at all, so AutoNodePolicy would not be applicable there.

1. The Service Provider SRE updates an existing AutoNodePolicy (e.g., adds
a new instance type `m6i.xlarge` to the allowlist).

2. The Karpenter Operator detects the policy change and re-reconciles all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only relevant for the nodepools which expressed intent for m6i.xlarge in the first place, correct? does the policy clobber the values for the upper level nodepool or only for the lower level one? both approaches have trade offs.

Copy link
Member Author

@maxcao13 maxcao13 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the policy should clobber on the lower level nodepool.

If a user-configured set of requirements were to yield 0 instance types, the UX feedback would simply be an Event. This currently exists already in Karpenter itself:

5s                      Warning   NoCompatibleInstanceTypes                    NodePool/ondemand-and-spot2           NodePool requirements filtered out all compatible available instance types

I don't think the user should explicitly "know" about these overrides as I think it would be unnecessarily mental overhead.The only thing they really need to know is that these instance types are not enabled by the service provider, and that should easily be known from ROSA/OpenShift documentation.

In a perfect world, the underlying NodePool and NodeClass wouldn't be in the guest cluster so the user wouldn't see all the underlying clobbering, but that is currently not the case, and a Karpenter power user will technically be able to see the clobbering there, but that boat sailed already.

immediate rejection.

## Alternatives (Not Implemented)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validating admission policy? webhook?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add more detail on alternatives that were considered, thanks 👍

@maxcao13
Copy link
Member Author

maxcao13 commented Jan 21, 2026

This enhancement seems to not be viable to the overall ROSA architecture, and instead the ROSA team will implement webhook controllers and VAPs to prevent restricted instance types from being created.

I'll leave it open for now, but it seems like I will close this soon, thanks alberto for the review here 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants