-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Resilience] Integrate Polly for grain placement resilience and add configuration options #9819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR integrates the Polly resilience framework into Orleans' grain placement system to add retry and timeout capabilities for placement operations, making the system more robust against transient failures.
- Adds Polly-based resilience pipeline with configurable retry and timeout policies for grain placement
- Introduces three new configuration options in
SiloMessagingOptions:PlacementTimeout,PlacementMaxRetries, andPlacementRetryBaseDelay - Refactors placement logic to execute through a resilience pipeline with message expiration checks
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| Directory.Packages.props | Adds Polly (8.5.2) and Polly.Extensions (8.6.4) package references |
| src/Orleans.Core/Orleans.Core.csproj | Adds Polly package references to Orleans.Core |
| src/Orleans.Runtime/Orleans.Runtime.csproj | Adds Polly.Extensions package reference to Orleans.Runtime |
| src/Orleans.Runtime/OrleansRuntimeResiliencePolicies.cs | New file defining resilience policies with timeout and retry strategies for placement operations |
| src/Orleans.Runtime/Configuration/Options/SiloMessagingOptions.cs | Adds three new configuration properties for placement resilience with defaults |
| src/Orleans.Runtime/Hosting/DefaultSiloServices.cs | Registers Orleans runtime resilience policies during service configuration |
| src/Orleans.Runtime/Placement/PlacementService.cs | Integrates resilience pipeline into placement workflow with message expiration checks and context struct to avoid allocations |
| builder | ||
| .AddTimeout(new TimeoutStrategyOptions | ||
| { | ||
| Timeout = options.PlacementTimeout, | ||
| OnTimeout = args => | ||
| { | ||
| logger.LogWarning("Grain placement operation timed out after {Timeout}.", options.PlacementTimeout); | ||
| return default; | ||
| } | ||
| }) | ||
| .AddRetry(new RetryStrategyOptions | ||
| { | ||
| MaxRetryAttempts = options.PlacementMaxRetries, | ||
| BackoffType = DelayBackoffType.Exponential, | ||
| Delay = options.PlacementRetryBaseDelay, | ||
| UseJitter = true, | ||
| ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => IsTransientPlacementException(ex)), | ||
| OnRetry = args => | ||
| { | ||
| logger.LogDebug(args.Outcome.Exception, "Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.", args.AttemptNumber, args.RetryDelay); | ||
| return default; | ||
| } | ||
| }); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resilience pipeline order is incorrect. The timeout is added before the retry strategy, which means the timeout applies to the entire operation including all retries, not to individual retry attempts. This should be reversed: add retry first, then timeout, so each retry attempt has its own timeout.
builder
.AddRetry(new RetryStrategyOptions { ... })
.AddTimeout(new TimeoutStrategyOptions { ... });With the current order, if PlacementTimeout is 30 seconds and retries take 10 seconds each, the operation would fail after 30 seconds even if retries are still within the retry limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's precisely what we want.
| exception switch | ||
| { | ||
| OrleansException => true, | ||
| TimeoutException => true, |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying on TimeoutException creates a logical conflict. When a TimeoutException is thrown by Polly's timeout strategy, retrying it will likely just hit the same timeout again. This exception should either not be retried, or the timeout configuration needs to be reconsidered (see comment about timeout/retry ordering).
If the intent is to handle timeouts from operations within the placement logic (not from Polly's timeout strategy), those should be caught and wrapped in a different exception type to distinguish them from Polly's timeout.
| TimeoutException => true, | |
| TimeoutException => false, |
| if (firstMessage.IsExpired) | ||
| { | ||
| LogMessageExpiredDuringPlacement(_logger, firstMessage.TargetGrain); | ||
| throw new TimeoutException($"Message expired before placement could complete for grain {firstMessage.TargetGrain}."); | ||
| } |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message expiration check throws a TimeoutException, but the resilience policy is configured to retry TimeoutException (line 71 of OrleansRuntimeResiliencePolicies.cs). This means expired messages will be unnecessarily retried, which is wasteful since they will fail again immediately.
Consider using a different exception type for message expiration (e.g., a custom MessageExpiredException or OperationCanceledException) that won't trigger retries, or exclude this specific case from retry logic.
| if (firstMessage.IsExpired) | ||
| { | ||
| LogMessageExpiredDuringPlacement(_logger, targetGrain); | ||
| throw new TimeoutException($"Message expired before placement could complete for grain {targetGrain}."); | ||
| } |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as the first expiration check: this TimeoutException will be retried by the resilience policy even though the message has already expired. See comment on lines 381-385.
| using System; | ||
| using Microsoft.Extensions.DependencyInjection; | ||
| using Microsoft.Extensions.Logging; | ||
| using Microsoft.Extensions.Options; | ||
| using Orleans.Configuration; | ||
| using Orleans.Runtime.Placement; | ||
| using Polly; | ||
| using Polly.Retry; | ||
| using Polly.Timeout; | ||
|
|
||
| namespace Orleans.Runtime; | ||
|
|
||
| /// <summary> | ||
| /// Resilience policies used by the Orleans runtime. | ||
| /// </summary> | ||
| public static class OrleansRuntimeResiliencePolicies | ||
| { | ||
| /// <summary> | ||
| /// The key used to identify the placement resilience pipeline in <see cref="Polly.Registry.ResiliencePipelineProvider{TKey}"/>. | ||
| /// </summary> | ||
| public const string PlacementResiliencePipelineKey = "Orleans.Placement"; | ||
|
|
||
| /// <summary> | ||
| /// Adds all Orleans runtime resilience policies to the service collection. | ||
| /// </summary> | ||
| /// <param name="services">The service collection.</param> | ||
| /// <returns>The service collection for chaining.</returns> | ||
| internal static IServiceCollection AddOrleansRuntimeResiliencePolicies(IServiceCollection services) | ||
| { | ||
| // Placement resilience pipeline | ||
| services.AddResiliencePipeline(PlacementResiliencePipelineKey, static (builder, context) => | ||
| { | ||
| var options = context.ServiceProvider.GetRequiredService<IOptions<SiloMessagingOptions>>().Value; | ||
| var logger = context.ServiceProvider.GetRequiredService<ILogger<PlacementService>>(); | ||
|
|
||
| builder | ||
| .AddTimeout(new TimeoutStrategyOptions | ||
| { | ||
| Timeout = options.PlacementTimeout, | ||
| OnTimeout = args => | ||
| { | ||
| logger.LogWarning("Grain placement operation timed out after {Timeout}.", options.PlacementTimeout); | ||
| return default; | ||
| } | ||
| }) | ||
| .AddRetry(new RetryStrategyOptions | ||
| { | ||
| MaxRetryAttempts = options.PlacementMaxRetries, | ||
| BackoffType = DelayBackoffType.Exponential, | ||
| Delay = options.PlacementRetryBaseDelay, | ||
| UseJitter = true, | ||
| ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => IsTransientPlacementException(ex)), | ||
| OnRetry = args => | ||
| { | ||
| logger.LogDebug(args.Outcome.Exception, "Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.", args.AttemptNumber, args.RetryDelay); | ||
| return default; | ||
| } | ||
| }); | ||
| }); | ||
|
|
||
| return services; | ||
| } | ||
|
|
||
| /// <summary> | ||
| /// Determines whether an exception is transient and should be retried during placement operations. | ||
| /// </summary> | ||
| private static bool IsTransientPlacementException(Exception exception) => | ||
| exception switch | ||
| { | ||
| OrleansException => true, | ||
| TimeoutException => true, | ||
| OperationCanceledException => false, | ||
| _ => false | ||
| }; | ||
| } |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new resilience features for grain placement lack test coverage. Consider adding tests for:
- Retry behavior on transient failures
- Timeout behavior
- Interaction between timeout and retry policies
- Message expiration scenarios during retries
- Configuration options validation
This is especially important given the complexity of the retry/timeout interaction and the bugs identified in the resilience policy configuration.
| /// This timeout applies to the entire placement operation including retries. | ||
| /// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>. |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states "This timeout applies to the entire placement operation including retries," which accurately describes the current implementation but may not be the intended behavior.
With the current pipeline order (timeout before retry), a single timeout covers all retry attempts. This means if retries take longer than the timeout duration, the operation will fail even if individual attempts would have succeeded.
If this is intentional, the documentation is correct. However, if each retry attempt should have its own timeout (which is more typical), this documentation should be updated to reflect that change when the pipeline order is corrected. See the comment on OrleansRuntimeResiliencePolicies.cs lines 36-58.
| /// This timeout applies to the entire placement operation including retries. | |
| /// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>. | |
| /// This timeout applies to each individual placement attempt. If a single attempt exceeds the timeout, it will fail and the operation will retry up to <see cref="PlacementMaxRetries"/> times. | |
| /// If all attempts fail due to timeout, the placement operation will fail with a <see cref="TimeoutException"/>. |
| <PackageVersion Include="OpenTelemetry.Instrumentation.Http" Version="1.9.0" /> | ||
| <PackageVersion Include="OpenTelemetry.Instrumentation.Runtime" Version="1.9.0" /> | ||
| <PackageVersion Include="Polly" Version="8.5.2" /> | ||
| <PackageVersion Include="Polly.Extensions" Version="8.6.4" /> |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package version for Polly.Extensions is 8.6.4 while Polly is at 8.5.2. This version mismatch could lead to compatibility issues. Both packages should use the same major.minor version to ensure they work correctly together.
| <PackageVersion Include="Polly.Extensions" Version="8.6.4" /> | |
| <PackageVersion Include="Polly.Extensions" Version="8.5.2" /> |
…onfiguration options
d6580ef to
d5457b0
Compare
Microsoft Reviewers: Open in CodeFlow