-
Notifications
You must be signed in to change notification settings - Fork 533
AUTOSCALE-512: enhancement for AutoNodePolicy #1924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@maxcao13: This pull request references AUTOSCALE-512 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
3447f82 to
40dffba
Compare
Signed-off-by: Max Cao <macao@redhat.com>
40dffba to
e842425
Compare
| types that are unsupported by the service provider (e.g., ROSA does not support | ||
| `t3.micro`). | ||
|
|
||
| Simple approaches like ValidatingAdmissionPolicies or hiding API fields don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on why validating admission policy is less preferred or why can't it be complementary to the reconciliation approach?
Wouldn't something like this be equivalent to what you achieve in your example with the autoNodePolicy?
object.spec.template.spec.requirements.all(req,
(req.key == 'node.kubernetes.io/instance-type' && req.operator == 'In')
? req.values.all(val, !['m5.4xlarge', 'm5.2xlarge'].has(val))
: true
)
I assume the policy is a best effort providing the amount of input that is available https://karpenter.sh/docs/reference/instance-types/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a background for anyone else reading this thread and proposal:
Just looking at your link of instance types and aws-specific well-known labels that would select those instancetypes - there are a ton in there, and my thinking is that it would be technically infeasible to restrict all pathways to have a deny-list during validation. But if we were only to restrict a small subset of paths that disable instancetypes, is that worth putting the code in?
For example, let's say ROSA SRE wants to restrict the m5.4xlarge and m5.2xlarge instance types, and we have cel that tries to restrict users from setting this in their nodepool, what exactly would that look like?
If we have what you describe:
object.spec.template.spec.requirements.all(req,
(req.key == 'node.kubernetes.io/instance-type' && req.operator == 'In')
? req.values.all(val, !['m5.4xlarge', 'm5.2xlarge'].has(val))
: true
)
Users can get around this by just specifiying this instead:
requirements:
- key: "karpenter.k8s.aws/instance-family"
operator: In
values: ["m5"]
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values: ["8", "16"]
and if we say, "oh we will just restrict that both those requirements cannot get set at the same time as well", then you can see how complicated this can get.
If we want to only include a simple cel rule like you proposed as complimentary, and just ignore all the other ways a user can bypass it, then it wouldn't be so complicated and we could do it, but I wasn't sure that this best-effort policy would be greatly beneficial to us, so let me know if you think otherwise. To me, I imagined admission checks to be an "all or nothing" sort of thing, and I could make that clearer in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a ton in there, and my thinking is that it would be technically infeasible to restrict all pathways to have a deny-list during validation. But if we were only to restrict a small subset of paths that disable instancetypes, is that worth putting the code in?
How is this not true for the policy CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not true because the service provider can directly insert their own requirements that will be immutable from the user perspective and cannot be superseded/circumvented by any user-set requirements.
So, for example, if ROSA has a particular region that has disabled m5.4xlarge and m5.2xlarge instance types and wants to make sure all users in the management clusters in those regions are not allowed to spin those up, they can specify in the CR:
requirements:
- key: node.kubernetes.io/instance-type
operator: NotIn
values: ["m5.4xlarge, m5.2xlarge"]
which will reflect in the underlying NodePool.
But the trick here is that if the user tries to set something like this in the OpenshiftNodePool:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.4xlarge, m5.2xlarge"]
The resulting underlying NodePool will look like:
requirements:
- key: node.kubernetes.io/instance-type
operator: NotIn
values: ["m5.4xlarge, m5.2xlarge"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.4xlarge, m5.2xlarge"]
and Karpenter core code for requirements is designed such that all the requirements that are combined in order to select a final instance type, are combined with intersection. In other words, only instance types that appear in all requiremenst will be considered, meaning this requirements spec's intersection, will not select any m5.4xlarge or m5.2xlarge instances. In fact, Karpenter will not be able to provision anything using this NodePool, unless there is another requirement that the user specifies that produce a non-empty intersection of instance types.
|
|
||
| This enhancement is not applicable to standalone OpenShift clusters. AutoNodePolicy | ||
| relies on the management/guest cluster separation that only exists in HyperShift | ||
| environments. In standalone clusters, administrators have full control over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw there's nothing technically preventing the karpenter operator from running in other form factors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah I glossed over this when writing/clauding it, but on OpenShift standalone (at least with karpenter-provider-aws), we probably wouldn't restrict instance types or anything else at all, so AutoNodePolicy would not be applicable there.
| 1. The Service Provider SRE updates an existing AutoNodePolicy (e.g., adds | ||
| a new instance type `m6i.xlarge` to the allowlist). | ||
|
|
||
| 2. The Karpenter Operator detects the policy change and re-reconciles all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only relevant for the nodepools which expressed intent for m6i.xlarge in the first place, correct? does the policy clobber the values for the upper level nodepool or only for the lower level one? both approaches have trade offs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the policy should clobber on the lower level nodepool.
If a user-configured set of requirements were to yield 0 instance types, the UX feedback would simply be an Event. This currently exists already in Karpenter itself:
5s Warning NoCompatibleInstanceTypes NodePool/ondemand-and-spot2 NodePool requirements filtered out all compatible available instance types
I don't think the user should explicitly "know" about these overrides as I think it would be unnecessarily mental overhead.The only thing they really need to know is that these instance types are not enabled by the service provider, and that should easily be known from ROSA/OpenShift documentation.
In a perfect world, the underlying NodePool and NodeClass wouldn't be in the guest cluster so the user wouldn't see all the underlying clobbering, but that is currently not the case, and a Karpenter power user will technically be able to see the clobbering there, but that boat sailed already.
| immediate rejection. | ||
|
|
||
| ## Alternatives (Not Implemented) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validating admission policy? webhook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add more detail on alternatives that were considered, thanks 👍
|
This enhancement seems to not be viable to the overall ROSA architecture, and instead the ROSA team will implement webhook controllers and VAPs to prevent restricted instance types from being created. I'll leave it open for now, but it seems like I will close this soon, thanks alberto for the review here 👍 |
No description provided.