Closed Bug 1604422 Opened 6 years ago Closed 6 years ago

internal server failures causing task failures, e.g. Docker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task.

Categories

(Cloud Services :: Operations: Taskcluster, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: brian)

Details

I was paged on an elevated 500 rate in firefoxci. Sentry (https://sentry.prod.mozaws.net/operations/taskcluster-firefox-ci/) points at queue, but queue service shows no anomalous resource usage in gcp console. Checking Azure, the firefoxci subscription indicates that " This storage account was throttled because it exceeded Azure Storage partition request per second, partition bandwidth, or IP scalability limits. ". Those with access can view this event at https://portal.azure.com/#@taskclusteraccountsmozilla.onmicrosoft.com/resource/subscriptions/f198c408-dcad-42bd-b3c9-fc1c8d7b3db6/resourceGroups/firefoxcitc/providers/Microsoft.Storage/storageAccounts/firefoxcitc/troubleshoot.

Next steps: investigate why our request rate spiked, and contact Azure support if appropriate.

Summary: trees closed for internal server failures causing task failures, e.g. ocker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task. → trees closed for internal server failures causing task failures, e.g. Docker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task.

Cause of request rate spike remains uncertain. Speculation: Taskcluster may not be properly respecting the retry-after and remaining-resource headers of azure's responses
(from taking a look at https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshooting-throttling-errors). Tom notes that neither of those headers are mentioned in the Taskcluster source, which is consistent with this assumption and the behavior we're seeing.

About 15 minutes ago, I scaled the queue-web service to 0 replicas, waited 5mins, and scaled it back to its prior replica count of 25. Although Azure's relevant rate limits are per hour, I expect that a restart might interrupt a retry loop if this is symptomatic of that kind of bug in TC. Regardless of the root cause, a bit of time with 0 requests to Azure may treat the symptom of the throttling problem (https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits#subscription-and-tenant-limits).

Azure indicates that no throttling is present at this time. I just redeployed the firefoxci cluster (Jenkins build #36) to rule out bad state in the services as a factor in the continuing build failures. As far as I can tell, services should be in the same state they were in before the throttling outage. According to conversation with Aryx in slack, rebuilding a build which had previously succeeded is succeeding, but some new builds are failing. This leads me to believe that the failures experienced by the new builds may be related to their contents, which would warrant investigation by Taskcluster developers.

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/d9958362b9bd No bug - Fix typo to get build and test coverage after previous push lacked them due to infra issue (b 1604422). CLOSED TREE
Assignee: nobody → bpitts
Status: NEW → ASSIGNED

Edunham is on PTO and is handing this off to me now.

After the Azure throttling disappeared, gecko decision tasks were still failing. The backout of https://hg.mozilla.org/ci/ci-configuration/rev/bec92f4b290245448bcb438f9e566c7215ac7c1b fixed and trees got reopened at 10:42 UTC (had been closed at 08:08 UTC).

Severity: blocker → normal
Summary: trees closed for internal server failures causing task failures, e.g. Docker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task. → internal server failures causing task failures, e.g. Docker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 73
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: Firefox 73 → ---

Why was this reopened?

Because it got closed for a commit whose only purpose was to get build and test coverage and didn't functionally change anything.

I don't think there is any action for me to take now, so I am going to close this again. For followup items see bug 1604649

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.