Skip to content

Conversation

@ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented Nov 25, 2025

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings November 25, 2025 22:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the Polly resilience framework into Orleans' grain placement system to add retry and timeout capabilities for placement operations, making the system more robust against transient failures.

  • Adds Polly-based resilience pipeline with configurable retry and timeout policies for grain placement
  • Introduces three new configuration options in SiloMessagingOptions: PlacementTimeout, PlacementMaxRetries, and PlacementRetryBaseDelay
  • Refactors placement logic to execute through a resilience pipeline with message expiration checks

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
Directory.Packages.props Adds Polly (8.5.2) and Polly.Extensions (8.6.4) package references
src/Orleans.Core/Orleans.Core.csproj Adds Polly package references to Orleans.Core
src/Orleans.Runtime/Orleans.Runtime.csproj Adds Polly.Extensions package reference to Orleans.Runtime
src/Orleans.Runtime/OrleansRuntimeResiliencePolicies.cs New file defining resilience policies with timeout and retry strategies for placement operations
src/Orleans.Runtime/Configuration/Options/SiloMessagingOptions.cs Adds three new configuration properties for placement resilience with defaults
src/Orleans.Runtime/Hosting/DefaultSiloServices.cs Registers Orleans runtime resilience policies during service configuration
src/Orleans.Runtime/Placement/PlacementService.cs Integrates resilience pipeline into placement workflow with message expiration checks and context struct to avoid allocations

Comment on lines +36 to +58
builder
.AddTimeout(new TimeoutStrategyOptions
{
Timeout = options.PlacementTimeout,
OnTimeout = args =>
{
logger.LogWarning("Grain placement operation timed out after {Timeout}.", options.PlacementTimeout);
return default;
}
})
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = options.PlacementMaxRetries,
BackoffType = DelayBackoffType.Exponential,
Delay = options.PlacementRetryBaseDelay,
UseJitter = true,
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => IsTransientPlacementException(ex)),
OnRetry = args =>
{
logger.LogDebug(args.Outcome.Exception, "Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.", args.AttemptNumber, args.RetryDelay);
return default;
}
});
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resilience pipeline order is incorrect. The timeout is added before the retry strategy, which means the timeout applies to the entire operation including all retries, not to individual retry attempts. This should be reversed: add retry first, then timeout, so each retry attempt has its own timeout.

builder
    .AddRetry(new RetryStrategyOptions { ... })
    .AddTimeout(new TimeoutStrategyOptions { ... });

With the current order, if PlacementTimeout is 30 seconds and retries take 10 seconds each, the operation would fail after 30 seconds even if retries are still within the retry limit.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's precisely what we want.

exception switch
{
OrleansException => true,
TimeoutException => true,
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying on TimeoutException creates a logical conflict. When a TimeoutException is thrown by Polly's timeout strategy, retrying it will likely just hit the same timeout again. This exception should either not be retried, or the timeout configuration needs to be reconsidered (see comment about timeout/retry ordering).

If the intent is to handle timeouts from operations within the placement logic (not from Polly's timeout strategy), those should be caught and wrapped in a different exception type to distinguish them from Polly's timeout.

Suggested change
TimeoutException => true,
TimeoutException => false,

Copilot uses AI. Check for mistakes.
Comment on lines +381 to +385
if (firstMessage.IsExpired)
{
LogMessageExpiredDuringPlacement(_logger, firstMessage.TargetGrain);
throw new TimeoutException($"Message expired before placement could complete for grain {firstMessage.TargetGrain}.");
}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message expiration check throws a TimeoutException, but the resilience policy is configured to retry TimeoutException (line 71 of OrleansRuntimeResiliencePolicies.cs). This means expired messages will be unnecessarily retried, which is wasteful since they will fail again immediately.

Consider using a different exception type for message expiration (e.g., a custom MessageExpiredException or OperationCanceledException) that won't trigger retries, or exclude this specific case from retry logic.

Copilot uses AI. Check for mistakes.
Comment on lines +408 to +412
if (firstMessage.IsExpired)
{
LogMessageExpiredDuringPlacement(_logger, targetGrain);
throw new TimeoutException($"Message expired before placement could complete for grain {targetGrain}.");
}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the first expiration check: this TimeoutException will be retried by the resilience policy even though the message has already expired. See comment on lines 381-385.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +75
using System;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Orleans.Configuration;
using Orleans.Runtime.Placement;
using Polly;
using Polly.Retry;
using Polly.Timeout;

namespace Orleans.Runtime;

/// <summary>
/// Resilience policies used by the Orleans runtime.
/// </summary>
public static class OrleansRuntimeResiliencePolicies
{
/// <summary>
/// The key used to identify the placement resilience pipeline in <see cref="Polly.Registry.ResiliencePipelineProvider{TKey}"/>.
/// </summary>
public const string PlacementResiliencePipelineKey = "Orleans.Placement";

/// <summary>
/// Adds all Orleans runtime resilience policies to the service collection.
/// </summary>
/// <param name="services">The service collection.</param>
/// <returns>The service collection for chaining.</returns>
internal static IServiceCollection AddOrleansRuntimeResiliencePolicies(IServiceCollection services)
{
// Placement resilience pipeline
services.AddResiliencePipeline(PlacementResiliencePipelineKey, static (builder, context) =>
{
var options = context.ServiceProvider.GetRequiredService<IOptions<SiloMessagingOptions>>().Value;
var logger = context.ServiceProvider.GetRequiredService<ILogger<PlacementService>>();

builder
.AddTimeout(new TimeoutStrategyOptions
{
Timeout = options.PlacementTimeout,
OnTimeout = args =>
{
logger.LogWarning("Grain placement operation timed out after {Timeout}.", options.PlacementTimeout);
return default;
}
})
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = options.PlacementMaxRetries,
BackoffType = DelayBackoffType.Exponential,
Delay = options.PlacementRetryBaseDelay,
UseJitter = true,
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => IsTransientPlacementException(ex)),
OnRetry = args =>
{
logger.LogDebug(args.Outcome.Exception, "Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.", args.AttemptNumber, args.RetryDelay);
return default;
}
});
});

return services;
}

/// <summary>
/// Determines whether an exception is transient and should be retried during placement operations.
/// </summary>
private static bool IsTransientPlacementException(Exception exception) =>
exception switch
{
OrleansException => true,
TimeoutException => true,
OperationCanceledException => false,
_ => false
};
}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new resilience features for grain placement lack test coverage. Consider adding tests for:

  • Retry behavior on transient failures
  • Timeout behavior
  • Interaction between timeout and retry policies
  • Message expiration scenarios during retries
  • Configuration options validation

This is especially important given the complexity of the retry/timeout interaction and the bugs identified in the resilience policy configuration.

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +181
/// This timeout applies to the entire placement operation including retries.
/// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>.
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "This timeout applies to the entire placement operation including retries," which accurately describes the current implementation but may not be the intended behavior.

With the current pipeline order (timeout before retry), a single timeout covers all retry attempts. This means if retries take longer than the timeout duration, the operation will fail even if individual attempts would have succeeded.

If this is intentional, the documentation is correct. However, if each retry attempt should have its own timeout (which is more typical), this documentation should be updated to reflect that change when the pipeline order is corrected. See the comment on OrleansRuntimeResiliencePolicies.cs lines 36-58.

Suggested change
/// This timeout applies to the entire placement operation including retries.
/// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>.
/// This timeout applies to each individual placement attempt. If a single attempt exceeds the timeout, it will fail and the operation will retry up to <see cref="PlacementMaxRetries"/> times.
/// If all attempts fail due to timeout, the placement operation will fail with a <see cref="TimeoutException"/>.

Copilot uses AI. Check for mistakes.
<PackageVersion Include="OpenTelemetry.Instrumentation.Http" Version="1.9.0" />
<PackageVersion Include="OpenTelemetry.Instrumentation.Runtime" Version="1.9.0" />
<PackageVersion Include="Polly" Version="8.5.2" />
<PackageVersion Include="Polly.Extensions" Version="8.6.4" />
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package version for Polly.Extensions is 8.6.4 while Polly is at 8.5.2. This version mismatch could lead to compatibility issues. Both packages should use the same major.minor version to ensure they work correctly together.

Suggested change
<PackageVersion Include="Polly.Extensions" Version="8.6.4" />
<PackageVersion Include="Polly.Extensions" Version="8.5.2" />

Copilot uses AI. Check for mistakes.
@ReubenBond ReubenBond force-pushed the fix/connection-resilience/1 branch from d6580ef to d5457b0 Compare November 26, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant