internal server failures causing task failures, e.g. Docker configuration could not be created. This may indicate an authentication error when validating scopes necessary for running the task.
Categories
(Cloud Services :: Operations: Taskcluster, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: brian)
Details
Tasks are failing due to internal server failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&selectedJob=281518160&resultStatus=superseded%2Cretry%2Ctestfailed%2Cbusted%2Cexception&revision=90cf0ce6c916dc56b5b5c0577f468fc5c4e89745
This also causes gecko decision tasks to not get scheduled - trees are closed.
edunham is investigating.
I was paged on an elevated 500 rate in firefoxci. Sentry (https://sentry.prod.mozaws.net/operations/taskcluster-firefox-ci/) points at queue, but queue service shows no anomalous resource usage in gcp console. Checking Azure, the firefoxci subscription indicates that " This storage account was throttled because it exceeded Azure Storage partition request per second, partition bandwidth, or IP scalability limits. ". Those with access can view this event at https://portal.azure.com/#@taskclusteraccountsmozilla.onmicrosoft.com/resource/subscriptions/f198c408-dcad-42bd-b3c9-fc1c8d7b3db6/resourceGroups/firefoxcitc/providers/Microsoft.Storage/storageAccounts/firefoxcitc/troubleshoot.
Next steps: investigate why our request rate spiked, and contact Azure support if appropriate.
Updated•6 years ago
|
Cause of request rate spike remains uncertain. Speculation: Taskcluster may not be properly respecting the retry-after and remaining-resource headers of azure's responses
(from taking a look at https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshooting-throttling-errors). Tom notes that neither of those headers are mentioned in the Taskcluster source, which is consistent with this assumption and the behavior we're seeing.
About 15 minutes ago, I scaled the queue-web service to 0 replicas, waited 5mins, and scaled it back to its prior replica count of 25. Although Azure's relevant rate limits are per hour, I expect that a restart might interrupt a retry loop if this is symptomatic of that kind of bug in TC. Regardless of the root cause, a bit of time with 0 requests to Azure may treat the symptom of the throttling problem (https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits#subscription-and-tenant-limits).
Azure indicates that no throttling is present at this time. I just redeployed the firefoxci cluster (Jenkins build #36) to rule out bad state in the services as a factor in the continuing build failures. As far as I can tell, services should be in the same state they were in before the throttling outage. According to conversation with Aryx in slack, rebuilding a build which had previously succeeded is succeeding, but some new builds are failing. This leads me to believe that the failures experienced by the new builds may be related to their contents, which would warrant investigation by Taskcluster developers.
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 5•6 years ago
|
||
Edunham is on PTO and is handing this off to me now.
| Reporter | ||
Comment 6•6 years ago
|
||
After the Azure throttling disappeared, gecko decision tasks were still failing. The backout of https://hg.mozilla.org/ci/ci-configuration/rev/bec92f4b290245448bcb438f9e566c7215ac7c1b fixed and trees got reopened at 10:42 UTC (had been closed at 08:08 UTC).
| Reporter | ||
Updated•6 years ago
|
Comment 7•6 years ago
|
||
| bugherder | ||
| Reporter | ||
Updated•6 years ago
|
| Assignee | ||
Comment 8•6 years ago
|
||
Why was this reopened?
| Reporter | ||
Comment 9•6 years ago
|
||
Because it got closed for a commit whose only purpose was to get build and test coverage and didn't functionally change anything.
| Assignee | ||
Comment 10•6 years ago
|
||
I don't think there is any action for me to take now, so I am going to close this again. For followup items see bug 1604649
Description
•