1289986 - crash-stats.mozilla.org and crash-stats.mozilla.com are resolving with errors

Reporter

Description

•

8 years ago

Received these 2 alerts in Pingdom filing this for tracking:

Pingdom Alert: Incident #26016 for crash-stats.mozilla.org (https://crash-stats.mozilla.org/home/products/Firefox), has been assigned to you.

Pingdom Alert: Incident #26015 for crash-stats.mozilla.com (https://crash-stats.mozilla.com/home/products/Firefox), has been assigned to you.

Contacted lonnen & jp in #breakpad:

9:18 PM <ashlee> 
crash-stats.mozilla.org and crash-stats.mozilla.com are resolving with errors: There was an error processing the request: Status: error Error: INTERNAL SERVER ERROR

9:23 PM <lonnen>
confirmed, I am getting alerts from pingdom

9:24 PM <lonnen>  
DeleteNetworkInterface triggered on EC2 by root/mozilla-webeng failed

9:26 PM <lonnen> 
ashlee: good news is that the back end is fine, collection is proceeding, there is no data loss

9:44 PM <ashlee> 
lonnen: what does DeleteNetworkInterface triggered on EC2 by root/mozilla-webeng failed mean exactly

9:44 PM <lonnen> 
ashlee: that was a false signal from about 8 hours ago

9:44 PM <lonnen> 
I woke up JP and we're investigating

9:45 PM <lonnen> 
we're going to start rolling reboots of our web servers

9:45 PM <jp> 
ashlee:  that is a standard occurence, and nothing to be alarmed about.  i'm investigating the problems on the webapp and working to resolve now.

9:45 PM <lonnen> 
well, full VM destruction and reconstruction

9:57 PM <lonnen> 
ashlee: so we've isolated it to call involving ES

9:59 PM <lonnen> 
and we see a huge increase in latency for writing to elastic search about the same time as we see wild swing in ELB latency that implies ES was slow to return or not returning

Ashlee Nguyen [:ashlee]

Reporter

Comment 1

•

8 years ago

11:23 PM <jp> 
ashlee:  we are working to activate some of our elasticsearch experts, but it will be in about three hours

JP Schneider [:jp]

Comment 2

•

8 years ago

Thanks for updating Ashlee!  We're going to get up in a couple hours and work with some folks to try to resolve our ES issues, which are the root cause of our issues being reported in webapp.

Ashlee Nguyen [:ashlee]

Reporter

Comment 3

•

8 years ago

Looks like they're both back up. Can you please verify on your end?

Ashlee Nguyen [:ashlee]

Reporter

Comment 4

•

8 years ago

1:58 AM <adrian> 
ashlee: re bug 1289986, the websites seem to be responding correctly, but I am still seeing issues in our cluster. I'm investigating.

[DEACTIVATED] Adrian Gaudebert

Comment 5

•

8 years ago

I am taking notes about what I'm finding here: https://docs.google.com/a/mozilla.com/document/d/1jTlv7-sIpFv3b0jUFN3kYa7e7Eo0B6ekonyfK2whocc/edit

There are various problems, the most important one being that it seems nodes cannot elect a master node. The other one is that the latest 2 data nodes are not accessible from the admin box.

[DEACTIVATED] Adrian Gaudebert

Comment 6

•

8 years ago

The failing master node has been stopped, and our cluster elected a new master node. Since then, shards have been reallocating, the cluster is yellow and responding, processing is proceeding. Things are not yet back to normal as of writing, but they are on their way. 

JP will continue investigating the root cause of the issue and update this bug accordingly.

Ashlee Nguyen [:ashlee]

Reporter

Comment 7

•

8 years ago

Any updates?

Matt Brandt [:mbrandt]

Updated

•

8 years ago

Flags: needinfo?(adrian)

JP Schneider [:jp]

Comment 8

•

8 years ago

We have identified an issue where new data nodes came up with insufficient memory allocation, and resolved that issue in the immediate term on the long-running nodes.  

Our root cause is that in fact these are long running nodes, and updates to puppet did not apply to the nodes which came up.  This led to nodes throwing jvm errors, and the master stopped it's shard management on these nodes.  We elected a new master, and the cluster stayed yellow forever, and we noticed the issue with mem on the other nodes.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

[DEACTIVATED] Adrian Gaudebert

Updated

•

8 years ago

Flags: needinfo?(adrian)

Bugzilla

Quick Search

crash-stats.mozilla.org and crash-stats.mozilla.com are resolving with errors

Categories

(Socorro :: Webapp, task)

Tracking

(Not tracked)

People

(Reporter: achavez, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated