-
Notifications
You must be signed in to change notification settings - Fork 3
Fix OAuth token refresh race condition and message delivery timeout #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OAuth token refresh race condition and message delivery timeout #465
Conversation
Add connection generation tracking to prevent concurrent refresh thrashing when OAuth tokens expire. Multiple threads now check if another thread already refreshed the connection, avoiding redundant refreshes that caused requests to remain in pending state. Co-authored-by: benhegartysefe <[email protected]>
|
@copilot noticed odd behaviour here is what i see in monitor logs: [2026-01-15 16:44:33.429] [INFOR Dorc.Monitor.ComponentProcessor:] [T:15] - Processing of the PowerShell component '135 - Start Spread & Endur' failed. |
Reorder operations to create consumer before publishing lock message. This prevents timeouts when messages are published before consumers exist. The consumer must exist when the message is published to ensure immediate delivery on quorum queues. Co-authored-by: benhegartysefe <[email protected]>
Fixed in commit 09e071a. The timeout issue was caused by publishing lock messages before setting up the consumer. Reordered the operations to create the consumer first, then publish - this ensures messages are delivered immediately on quorum queues. |
|
@copilot what I got on restart yesterday:
MonitorNonProd_20260119_DORC_UT2_1st_node.log MonitorNonProd_20260119_DORC_UT2_2nd_node.log there are token timeouts and lock acquisition errors in the logs |
|
@copilot [2026-01-19 17:34:33.648] [WARNI Dorc.Monitor.RequestProcessors.PendingRequestProcessor:] [T:16] - Cannot create deployment result since duplicate component with id '542' is detected. |
The log excerpt shows normal deployment processing with a duplicate component warning, which is expected behavior when a component appears multiple times in a build definition. This warning doesn't prevent deployment completion. The "Pending" status in the screenshot appears to be a different issue unrelated to the OAuth/lock fixes. Could you share the specific token timeout or lock acquisition error messages you're seeing? I need to see errors like "ACCESS_REFUSED", "OAuth token may have expired", or "Timeout waiting for lock message" to diagnose issues with my changes. |
…oken refresh - Fix inverted logic in RemoveCancellationTokenSource that was incorrectly logging successful removals as errors (changed to debug level for not-found case) - Add proactive OAuth token refresh 5 minutes before expiry to prevent ACCESS_REFUSED errors during lock acquisition - Track token expiry time and check before each lock acquisition attempt Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update test mocks to include the new IMonitorConfiguration dependency added to DeploymentEngine constructor. Co-Authored-By: Claude Opus 4.5 <[email protected]>

When OAuth tokens expire, concurrent lock acquisition failures trigger simultaneous connection refresh attempts. Each waiting thread was blindly closing and recreating the connection after acquiring the semaphore, causing connection thrashing and preventing requests from processing.
Additionally, lock messages were being published before consumers were set up, causing continuous timeout warnings when queues are recreated after deletion.
Changes
Added connection generation tracking to
RabbitMqDistributedLockService:connectionGenerationcounter tracks refresh cyclesForceConnectionRefreshAsyncskips refresh if generation changed (another thread already refreshed)Reordered lock acquisition operations to prevent message delivery timeouts:
OAuth refresh flow:
Message delivery flow:
The generation check prevents redundant refreshes while preserving thread safety via the existing semaphore. The reordered operations ensure reliable message delivery on quorum queues.
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.