Closed Bug 1497031 Opened Last year Closed 11 months ago

MOC PagerDuty Alerts (wiki.mozilla.org violated Check Failure)

Categories

(Infrastructure & Operations :: SRE, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ingg, Unassigned)

Details

We received a page stating that the wiki.mozilla.org site went down, which it appears to have. Site will not resolve; server is responding to pings, however attempting to ssh to it hangs:

client-10-48-242-2:~ ieleucu$ ping wiki.mozilla.org
PING wiki-prod-1394614349.us-west-2.elb.amazonaws.com (52.89.171.193): 56 data bytes
64 bytes from 52.89.171.193: icmp_seq=0 ttl=237 time=22.878 ms

client-10-48-242-2:~ ieleucu$ ssh wiki-prod-1394614349.us-west-2.elb.amazonaws.com
ssh: connect to host wiki-prod-1394614349.us-west-2.elb.amazonaws.com port 22: Operation timed out
client-10-48-242-2:~ ieleucu$ ssh wiki.mozilla.org
ssh: connect to host wiki.mozilla.org port 22: Operation timed out

Per runbook https://mana.mozilla.org/wiki/display/MOC/wiki.mozilla.org+Runbook, a Bug has been opened using this template. Proceeding with next steps detailed in the runbook.
Per Daniel, the problem appears to be related to a cert expiration for a backend service. Daniel and Gozer are investigating further.
Timeline of events: 

Saturday, October 6 2018

21:46 - Initial page from New Relic Synthetics for wiki.mozilla.org failing the monitoring check
22:22 - Bug generated, IT CloudOps engaged 
22:32 - danielh responds to escalation
23:44 - danielh attemps fix, requests a re-page/escalation for rest of team, no response

Sunday, October 7 2018

00:11 - danielh able to potentially loop in limed, 
00:27 - Issue is suspected to be related to an expired SSL cert on backend host
01:16 - EU oncall reports in for status check on outage
02:00 - US2 -> EU oncall shift handoff
02:12 - Re-pinged danielh and limed for updates in #webops slack channel
02:15 - danielh re-requests escalation to gozer, no response from PD, phone, SMS
02:26 - Escalated to Josh Howard to ensure visibility as this is actively blocking work for CI team
02:27 - limed attempts another fix, no favorable result
02:53 - limed informs us that wiki.mozilla.org is reachable, needs to investigate that this is fixed entirely
04:00 - limed confirms service is restored

:limed will add details of root cause and fix on Monday, October 9 in the AM morning.  NI'd for reminder.  Thanks again :limed!
Info on what happened

Background
-----------
We use consul as a KV store and coordination layer on a nubis account, consul has a gossip protocol that uses a TLS cert. On top of that we use confd with consul to generate the config (database connection info etc) for wikimo.

What happened
--------------
Consul gossip protocol uses a TLS cert that gets generated via terraform and is set to expire annually per security requirement. Generally this cert will get renewed within a certain number of days and in conjunction with our update cadence to the AWS account it always gets updated when we update nubis on an account. However since we have changed our stance on nubis there has been no updates on the nubis platform which caused the cert to lapse. The second outage was caused by our auto rolling of the consul nodes which caused all data in the KV store to be wiped, so we had to do a restore here again.

The fix
------------
The fix for our end is to run the account upgrade process on our end to generate a new cert for consul and kill the nodes so that ASG can roll the nodes and we also had to restore the consul data from backups. Once that is done we go in and kill the wiki node as well to get the ASG to roll the wiki nodes so that we can reconfigure consul on the nodes which is how we fixed it.

Future improvement
-------------------
We will need to start monitoring our certificate and alert us when the gossip certs are about to expire. And we will need figure out a better way to restore data on our consul cluster

Stale alert bug

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.