Skip to content

Conversation

@slawande2
Copy link
Collaborator

@slawande2 slawande2 commented Jan 14, 2026

Which issue this PR addresses:

Fixes ARO-22824

What this PR does / why we need it:

  • Managed identities, cluster service principals, and role assignments are cleaned up as part of e2e cluster deletion, but this doesn't occur if the e2e job times out and doesn't make it to the cleanup step.

  • While running miwi E2E tests in the E2E Build and Release pipeline, we observed that managed identities and cluster service principals are not being cleaned up when the E2E job times out. This PR adds cleanup logic specifically for orphaned resource groups from prod e2e clusters and service principals, by creating a dedicated cleanup job in pkg/util/purge.

Test plan for issue:

  • Added unit tests for orphaned service principals cleanup

Is there any documentation that needs to be updated for this PR?

no

Copy link
Collaborator

@shubhadapaithankar shubhadapaithankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some thoughts and questions but overall it looks great. This will be a nice change

@mociarain
Copy link
Collaborator

This LGTM. Just wondering if you tested it with the dry-run mode to see if there were any edge cases that appeared?

@slawande2
Copy link
Collaborator Author

slawande2 commented Jan 16, 2026

@mociarain I tested the changes in dryRun mode: https://msazure.visualstudio.com/AzureRedHatOpenShift/_build/results?buildId=149649712&view=logs&j=a4f1910f-c367-5697-edcd-724d81cde23b&t=bc103e47-9b88-5c56-fcad-f3f2f0b2ed91
and the only concern I have is that the purgeTTL is set to 3h for the purge pipeline. This would delete all SPs older than 3h, which could delete SPs mid e2e tests. Should we change it to 48h instead?

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the pipeline run, and I thought it was interesting how there were exactly 100 mock-msi SPs. I think there are more than 100 and we're only getting 100 because the list response is paginated, and we're not checking the other pages.

Some more info here: https://learn.microsoft.com/en-us/graph/paging?tabs=http

TBH I'm not sure if going through all of the pages is worth the trouble. Since the clean pipeline runs on a schedule, we'll get all of the SPs eventually.

I would be cool merging this without any changes, but I wanted to point this out in case anyone else has other thoughts.

@mociarain
Copy link
Collaborator

the only concern I have is that the purgeTTL is set to 3h for the purge pipeline. This would delete all SPs older than 3h, which could delete SPs mid e2e tests. Should we change it to 48h instead?

Agreed. Bump it to 48hrs

TBH I'm not sure if going through all of the pages is worth the trouble. Since the clean pipeline runs on a schedule, we'll get all of the SPs eventually.

+1 on this. I think we don't need to change it but can you add a comment to it just outlining that this is the case and we left it this way assuming it will eventually clear them out.

@slawande2
Copy link
Collaborator Author

I realized after running the pipeline in dry run mode that using rc.graphClient.Applications().ByApplicationId(objectID).Delete(ctx, nil) would only delete the mock-msi SP and v4-e2e-* SPs, but not Disk Encryption Sets. These need to be deleted using rc.graphClient.ServicePrincipals().ByServicePrincipalId(objectID).Delete(ctx, nil) as no application object exists to delete for DESs. And if we use ServicePrincipals().Delete() for everything, the SPs would get deleted but leave orphaned application registration.

@slawande2 slawande2 force-pushed the slawande/ARO-22824/e2e-cleanup branch from a4d2cef to 1648f58 Compare January 20, 2026 23:03
@slawande2
Copy link
Collaborator Author

I'm still getting an error for deleting disk encryption sets in dry run mode.

@slawande2
Copy link
Collaborator Author

slawande2 commented Jan 23, 2026

For disk encryption sets, Caden suggested deleting the resource groups that contain these DES instead of deleting individual DESs, which makes sense to me. There's already existing logic for cleaning up resourcegroups with aro-v4-e2e and v4-e2e resourceGroupDeletePrefixes which should work but isn't? there are resource groups as old as 2 months that haven't been cleaned up.

@kimorris27
Copy link
Contributor

For disk encryption sets, Caden suggested deleting the resource groups that contain these DES instead of deleting individual DESs, which makes sense to me. There's already existing logic for cleaning up resourcegroups with aro-v4-e2e and v4-e2e resourceGroupDeletePrefixes which should work but isn't? there are resource groups as old as 2 months that haven't been cleaned up.

The existing code does clean up RGs with those prefixes, but it also checks for the purge tag. I looked in the E2E sub just now, and it seems like recent CI E2E RGs don't have the purge tag... but I'm not sure why that would be.

I think we'll be fine once your changes merge though, since we won't be checking that tag anymore.

@slawande2
Copy link
Collaborator Author

/azp run ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chainsaw Pull requests or issues owned by Team Chainsaw ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants