Closed Bug 1289986 Opened 8 years ago Closed 8 years ago

crash-stats.mozilla.org and crash-stats.mozilla.com are resolving with errors

Categories

(Socorro :: Webapp, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: achavez, Unassigned)

Details

Received these 2 alerts in Pingdom filing this for tracking:

Pingdom Alert: Incident #26016 for crash-stats.mozilla.org (https://crash-stats.mozilla.org/home/products/Firefox), has been assigned to you.

Pingdom Alert: Incident #26015 for crash-stats.mozilla.com (https://crash-stats.mozilla.com/home/products/Firefox), has been assigned to you.

Contacted lonnen & jp in #breakpad:

9:18 PM <ashlee> 
crash-stats.mozilla.org and crash-stats.mozilla.com are resolving with errors: There was an error processing the request: Status: error Error: INTERNAL SERVER ERROR

9:23 PM <lonnen>
confirmed, I am getting alerts from pingdom

9:24 PM <lonnen>  
DeleteNetworkInterface triggered on EC2 by root/mozilla-webeng failed

9:26 PM <lonnen> 
ashlee: good news is that the back end is fine, collection is proceeding, there is no data loss

9:44 PM <ashlee> 
lonnen: what does DeleteNetworkInterface triggered on EC2 by root/mozilla-webeng failed mean exactly

9:44 PM <lonnen> 
ashlee: that was a false signal from about 8 hours ago

9:44 PM <lonnen> 
I woke up JP and we're investigating

9:45 PM <lonnen> 
we're going to start rolling reboots of our web servers

9:45 PM <jp> 
ashlee:  that is a standard occurence, and nothing to be alarmed about.  i'm investigating the problems on the webapp and working to resolve now.

9:45 PM <lonnen> 
well, full VM destruction and reconstruction

9:57 PM <lonnen> 
ashlee: so we've isolated it to call involving ES

9:59 PM <lonnen> 
and we see a huge increase in latency for writing to elastic search about the same time as we see wild swing in ELB latency that implies ES was slow to return or not returning
11:23 PM <jp> 
ashlee:  we are working to activate some of our elasticsearch experts, but it will be in about three hours
Thanks for updating Ashlee!  We're going to get up in a couple hours and work with some folks to try to resolve our ES issues, which are the root cause of our issues being reported in webapp.
Looks like they're both back up. Can you please verify on your end?
1:58 AM <adrian> 
ashlee: re bug 1289986, the websites seem to be responding correctly, but I am still seeing issues in our cluster. I'm investigating.
I am taking notes about what I'm finding here: https://docs.google.com/a/mozilla.com/document/d/1jTlv7-sIpFv3b0jUFN3kYa7e7Eo0B6ekonyfK2whocc/edit

There are various problems, the most important one being that it seems nodes cannot elect a master node. The other one is that the latest 2 data nodes are not accessible from the admin box.
The failing master node has been stopped, and our cluster elected a new master node. Since then, shards have been reallocating, the cluster is yellow and responding, processing is proceeding. Things are not yet back to normal as of writing, but they are on their way. 

JP will continue investigating the root cause of the issue and update this bug accordingly.
Any updates?
Flags: needinfo?(adrian)
We have identified an issue where new data nodes came up with insufficient memory allocation, and resolved that issue in the immediate term on the long-running nodes.  

Our root cause is that in fact these are long running nodes, and updates to puppet did not apply to the nodes which came up.  This led to nodes throwing jvm errors, and the master stopped it's shard management on these nodes.  We elected a new master, and the cluster stayed yellow forever, and we noticed the issue with mem on the other nodes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Flags: needinfo?(adrian)
You need to log in before you can comment on or make changes to this bug.