Skip to content

Conversation

@nagilson
Copy link
Member

@nagilson nagilson commented Jan 13, 2026

context

for #52090
copilot failed attempt #52338

Issue: The Workload Manifest Reader can fail with ManifestFromWorkloadSetNotFound when calling GetManifests when package managers clean up manifests from older sdk versions that were still needed by the newer version workloads.

❗ This PR is better reviewed with hiding whitespace changes.

Manual Testing Loop Demo

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload install macos
rm -rf  /workspaces/sdk2/sdk/artifacts/bin/redist/Debug/dotnet/sdk-manifests/10.0.100/microsoft.net.workload.emscripten.current/10.0.102/

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload --info

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload repair
image image

What's this do

I considered removing the workload set entirely during workload manifest search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.

Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.

So, now we have a separate class which tries to repair missing manifests before we start the manifest lookup. The errors for missing manifests are disabled in the recorder because the recorder will fail to initialize if the manifests are corrupted and it doesn't have the data available to repair or recover from that state. We do the sub-manifest repair because the manifests need to be in a good state for the rest of repair to have the correct data to function. The repairer being hooked into the Installer means commands such as repair will get the hook to do so, but commands such as --info never initialize an installer through the factory, so they instead print a prettier error.

Notes

I tested and confirmed:
repair, restore, update are fixed.
--info, workload --info, workload list show the small error.

Plan: (this commit is step 1 of the plan but idk if it works completely yet)

- Workload Manifest Reader will always fail with `ManifestFromWorkloadSetNotFound` when calling GetManifests.

I considered removing the workload set entirely during workload set search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.

Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.

So,

1. Add a function to repair so it can repair missing manifests, because before it called ReinstallWorkloadsBasedOnCurrentManifests which calls install workloads, which calls install packs, which needs to read the manifests to know the packs, so repair would fail.

2. Once repair is able to fix the issue, resolve the other commands to be able to fix the issue by A - detecting corrupt issue, and then calling the repair subcomponent to install missing manifests

3. Make sure repair isn't cyclical and doesn't try to call the sub repair option (should be ok, because we don't get to the missing manifests code until we've already repaired the manifests)

4. Add boolean for manifest reader or whatever to print during info, version or list that there was a corrupt workload set  likely from pkg manager but still allow repairing it (or decide if we don't want to do that.)
workload restore is not setup for the injection so didn't add it there but the code is shared between the two so testing like this should be decent
…rkload command so it can be shared"

This reverts commit a3959ba.
… also fixed

this prevents needing to inherit or modify many command files at once
workload history recorder tries to get manifests and it does this before file baesd installer is created.
…ble to repair the manifests - the repair family commands should be able repair
the install state is empty on disk - this is not safe to rely on. Also if we refresh manifests we will create a dupe manifest
@nagilson nagilson changed the title workload x can recover from corrupt workload sets workload repair can recover from corrupt workload sets Jan 15, 2026
@nagilson nagilson marked this pull request as ready for review January 15, 2026 00:51
@nagilson nagilson requested a review from a team January 15, 2026 00:52
Copy link
Member Author

@nagilson nagilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some polish needed if this isnt urgent:

  1. The error string should likely contain the workload id and not just the version
  2. We should try to dedupe the message if possible, despite separate assemblies
  3. The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
  4. It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
  5. The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer

@nagilson
Copy link
Member Author

The test failure seems to be due to a change in the new, more specific error message presenting in some cases instead of an existing error message.

@marcpopMSFT
Copy link
Member

I tested in codespaces. I can confirm that if I delete any workload manifest (after installing), I will get errors during --info, workload --info, and build. I can also confirm that repair, restore, and update all fix the problem.

The error experiences themselves could probably be cleaned up a bit as they give full exception stacks in 2/3 of those flows (workload --info is the best Ux). But this is sufficient for unblocking the customer. Need to fix the tests and review.

we only searched the manifest  root at dotnetDir provided by the installer factory  but the resolver can search all roots, so the global root and or the user local root.
@nagilson
Copy link
Member Author

/xlf

@github-actions
Copy link
Contributor

▶️ XLF update workflow started. Track progress in this workflow run.

@nagilson
Copy link
Member Author

nagilson commented Jan 21, 2026

There's some polish needed if this isnt urgent:

  1. The error string should likely contain the workload id and not just the version
  2. We should try to dedupe the message if possible, despite separate assemblies
  3. The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
  4. It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
  5. The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer
  1. there is no id, just the set version. I've improved the err message
  2. not easily done as it'd make a circular reference
  3. fixed and realized one only searched 1 manifest root
  4. not investing the time to do that
  5. fixed

@nagilson
Copy link
Member Author

I was still able to confirm the fix worked for me
image

@nagilson nagilson requested a review from dsplaisted January 21, 2026 23:57
@nagilson
Copy link
Member Author

Failures now are due to dotnet-watch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants