status.mozilla.org didn't accurately reflect "Service Disruption" for SUMO/https://support.mozilla.org

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: stephend, Assigned: ashish)

Tracking

Details

(URL)

Attachments

(2 attachments)

Problem: 

tl;dr: the monitoring/alerting powering the status for SUMO on status.mozilla.org needs investigation.

Yesterday, June 10th, https://support.mozilla.org didn't reflect the proper status of "Service Disruption" -- it remained green ("Service is operating normally") throughout the HTTP 503 errors.

(Around 2:05pm PDT, according to users in #sumodev, the site started displaying HTTP 503 - Service Unavailable errors, intermittently, and lasted until the all clear around 3:15 or so.)

Looks like New Relic picked up the error and alerted, #moc was engaged, nagios fired, etc. -- so I want to scope this bug to just the disconnect between the "all clear" that https://status.mozilla.org/ displayed, and the reality of the site's disruption for almost an hour.

New Relic: https://rpm.newrelic.com/accounts/263620/applications/2779374/downtime (might not be around forever, so I'll attach a few screenshots for posterity).
Created attachment 8621139 [details]
CHK_s4tUIAEsLRC.png_large.png

1) status.mozilla.org status showing green for SUMO
2) site returning HTTP 503 - Service Temporarily Unavailable
3) New Relic showing HTTP 503 errors w/duration
Flags: needinfo?(lypulong)
(Assignee)

Updated

3 years ago
Assignee: nobody → ashish
Status: NEW → ASSIGNED
Flags: needinfo?(lypulong)
(Assignee)

Comment 2

3 years ago
Created attachment 8621191 [details]
pingdom incident list

For support.mozilla.org alerts, Pingdom informs oncall (via pagerduty) in 5 mins, statushub in 10 mins. The longest outage was 9 mins, so Pingdom didn't alert statushub.

The alerting policy seems to be in-line with other websites that share the same "criticality".
I agree status.mozilla.org should have reported this. However, I'd like to note that the outage of the site was not complete. Throughout the entire window, the site would sometimes load, sometimes not. This maybe why status.mozilla.org was never triggered.
(Assignee)

Comment 4

3 years ago
Yeah, it was just the nature of the incident that made it dodge statusmo. I'm inclined to call this a one-off. I've checked the alerting pipeline and there's nothing wrong with that. I don't find a need to change the alerting policy as well.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(In reply to Ashish Vijayaram [:ashish] from comment #4)
> Yeah, it was just the nature of the incident that made it dodge statusmo.
> I'm inclined to call this a one-off. I've checked the alerting pipeline and
> there's nothing wrong with that. I don't find a need to change the alerting
> policy as well.

So if https://www.mozilla.org were down for 9 minutes, intermittently, would status.mozilla.org exhibit the same "all clear/green" status?
Flags: needinfo?(ashish)
(Assignee)

Comment 6

3 years ago
That is correct. statusmo will not change unless the outage is at least 10 mins long.
Flags: needinfo?(ashish)
You need to log in before you can comment on or make changes to this bug.