Closed Bug 1497031 Opened Last year Closed 11 months ago

MOC PagerDuty Alerts ( violated Check Failure)


(Infrastructure & Operations :: SRE, task)

Not set


(Not tracked)



(Reporter: ingg, Unassigned)


We received a page stating that the site went down, which it appears to have. Site will not resolve; server is responding to pings, however attempting to ssh to it hangs:

client-10-48-242-2:~ ieleucu$ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=237 time=22.878 ms

client-10-48-242-2:~ ieleucu$ ssh
ssh: connect to host port 22: Operation timed out
client-10-48-242-2:~ ieleucu$ ssh
ssh: connect to host port 22: Operation timed out

Per runbook, a Bug has been opened using this template. Proceeding with next steps detailed in the runbook.
Per Daniel, the problem appears to be related to a cert expiration for a backend service. Daniel and Gozer are investigating further.
Timeline of events: 

Saturday, October 6 2018

21:46 - Initial page from New Relic Synthetics for failing the monitoring check
22:22 - Bug generated, IT CloudOps engaged 
22:32 - danielh responds to escalation
23:44 - danielh attemps fix, requests a re-page/escalation for rest of team, no response

Sunday, October 7 2018

00:11 - danielh able to potentially loop in limed, 
00:27 - Issue is suspected to be related to an expired SSL cert on backend host
01:16 - EU oncall reports in for status check on outage
02:00 - US2 -> EU oncall shift handoff
02:12 - Re-pinged danielh and limed for updates in #webops slack channel
02:15 - danielh re-requests escalation to gozer, no response from PD, phone, SMS
02:26 - Escalated to Josh Howard to ensure visibility as this is actively blocking work for CI team
02:27 - limed attempts another fix, no favorable result
02:53 - limed informs us that is reachable, needs to investigate that this is fixed entirely
04:00 - limed confirms service is restored

:limed will add details of root cause and fix on Monday, October 9 in the AM morning.  NI'd for reminder.  Thanks again :limed!
Info on what happened

We use consul as a KV store and coordination layer on a nubis account, consul has a gossip protocol that uses a TLS cert. On top of that we use confd with consul to generate the config (database connection info etc) for wikimo.

What happened
Consul gossip protocol uses a TLS cert that gets generated via terraform and is set to expire annually per security requirement. Generally this cert will get renewed within a certain number of days and in conjunction with our update cadence to the AWS account it always gets updated when we update nubis on an account. However since we have changed our stance on nubis there has been no updates on the nubis platform which caused the cert to lapse. The second outage was caused by our auto rolling of the consul nodes which caused all data in the KV store to be wiped, so we had to do a restore here again.

The fix
The fix for our end is to run the account upgrade process on our end to generate a new cert for consul and kill the nodes so that ASG can roll the nodes and we also had to restore the consul data from backups. Once that is done we go in and kill the wiki node as well to get the ASG to roll the wiki nodes so that we can reconfigure consul on the nodes which is how we fixed it.

Future improvement
We will need to start monitoring our certificate and alert us when the gossip certs are about to expire. And we will need figure out a better way to restore data on our consul cluster

Stale alert bug

Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.