Closed Bug 1593896 Opened 6 years ago Closed 6 years ago

Nov 4, 2019 Outage of firefox-ci and community clusters

Categories

(Taskcluster :: Operations and Service Requests, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: bstack)

Details

We're getting lots of azure errors, specifically OperationTimedOutError and ServerBusyError. This started at 2:34 UTC tonight. We are investigating more.

Assignee: nobody → bstack
Status: NEW → ASSIGNED
Priority: -- → P2

about an hour later everything started working again for no good reason. we should investigate this further once we have access to the azure dashboard but for now I think we are ok

It has started flaking again. Let's pick this up in the morning though.

Per https://status.azure.com/en-us/status, " Starting at 02:33 UTC on 05 Nov 2019 a subset of customers using Virtual Machines in West US 2 may experience connection failures when trying to access some Virtual Machines hosted in the region. These Virtual Machines may have also restarted unexpectedly. Engineers are aware of this issue and are actively investigating. The next update will be provided in 60 minutes, or as events warrant.

This message was last updated at 4:33 AM UTC on November 5, 2019 ".

If these issues persist while Azure reports that all its services are ok, then there's a chance that it's a problem we can do anything about. But based on the reported timings, I strongly suspect this may be symptomatic of some known issues that Azure is experiencing.

Azure is all green checkmarks again, and both deployments are working again. So, yep, this was an upstream outage.

That Azure under-reported the severity of the issue, and did not report it in a timely fashion, and perhaps even didn't realize that this VM-related issue affected their storage APIs, does not surprise me given our experience with Azure.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

(Looks like this resolved around 10:15am UTC, for a total of about 7.75 hours)

You need to log in before you can comment on or make changes to this bug.