Closed Bug 1497031 Opened Last year Closed 11 months ago
Duty Alerts (wiki .mozilla .org violated Check Failure)
We received a page stating that the wiki.mozilla.org site went down, which it appears to have. Site will not resolve; server is responding to pings, however attempting to ssh to it hangs: client-10-48-242-2:~ ieleucu$ ping wiki.mozilla.org PING wiki-prod-1394614349.us-west-2.elb.amazonaws.com (126.96.36.199): 56 data bytes 64 bytes from 188.8.131.52: icmp_seq=0 ttl=237 time=22.878 ms client-10-48-242-2:~ ieleucu$ ssh wiki-prod-1394614349.us-west-2.elb.amazonaws.com ssh: connect to host wiki-prod-1394614349.us-west-2.elb.amazonaws.com port 22: Operation timed out client-10-48-242-2:~ ieleucu$ ssh wiki.mozilla.org ssh: connect to host wiki.mozilla.org port 22: Operation timed out Per runbook https://mana.mozilla.org/wiki/display/MOC/wiki.mozilla.org+Runbook, a Bug has been opened using this template. Proceeding with next steps detailed in the runbook.
Per Daniel, the problem appears to be related to a cert expiration for a backend service. Daniel and Gozer are investigating further.
Timeline of events: Saturday, October 6 2018 21:46 - Initial page from New Relic Synthetics for wiki.mozilla.org failing the monitoring check 22:22 - Bug generated, IT CloudOps engaged 22:32 - danielh responds to escalation 23:44 - danielh attemps fix, requests a re-page/escalation for rest of team, no response Sunday, October 7 2018 00:11 - danielh able to potentially loop in limed, 00:27 - Issue is suspected to be related to an expired SSL cert on backend host 01:16 - EU oncall reports in for status check on outage 02:00 - US2 -> EU oncall shift handoff 02:12 - Re-pinged danielh and limed for updates in #webops slack channel 02:15 - danielh re-requests escalation to gozer, no response from PD, phone, SMS 02:26 - Escalated to Josh Howard to ensure visibility as this is actively blocking work for CI team 02:27 - limed attempts another fix, no favorable result 02:53 - limed informs us that wiki.mozilla.org is reachable, needs to investigate that this is fixed entirely 04:00 - limed confirms service is restored :limed will add details of root cause and fix on Monday, October 9 in the AM morning. NI'd for reminder. Thanks again :limed!
Had this again, triggered alert - https://mozilla.pagerduty.com/incidents/PEDGGDY
Info on what happened Background ----------- We use consul as a KV store and coordination layer on a nubis account, consul has a gossip protocol that uses a TLS cert. On top of that we use confd with consul to generate the config (database connection info etc) for wikimo. What happened -------------- Consul gossip protocol uses a TLS cert that gets generated via terraform and is set to expire annually per security requirement. Generally this cert will get renewed within a certain number of days and in conjunction with our update cadence to the AWS account it always gets updated when we update nubis on an account. However since we have changed our stance on nubis there has been no updates on the nubis platform which caused the cert to lapse. The second outage was caused by our auto rolling of the consul nodes which caused all data in the KV store to be wiped, so we had to do a restore here again. The fix ------------ The fix for our end is to run the account upgrade process on our end to generate a new cert for consul and kill the nodes so that ASG can roll the nodes and we also had to restore the consul data from backups. Once that is done we go in and kill the wiki node as well to get the ASG to roll the wiki nodes so that we can reconfigure consul on the nodes which is how we fixed it. Future improvement ------------------- We will need to start monitoring our certificate and alert us when the gossip certs are about to expire. And we will need figure out a better way to restore data on our consul cluster
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.