Closed Bug 837545 Opened 11 years ago Closed 11 years ago

[AirMozilla] way too many OperationalErrors in Air Mozilla

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: peterbe, Assigned: bburton)

Details

Peter Bengtsson [:peterbe]

Reporter

Description

•

11 years ago

One of the events on Air Mozilla got featured on Hacker News on Friday (1 Feb 2013). This event: https://air.mozilla.org/higgs-jit/
This probably resulted in a fair amount of traffic. 

(unfortunately we don't have Google analytics on Air Mozilla but that's another story)

What I was greeted with today on Sunday was this: http://cl.ly/Mct0
They're all OperationalErrors related to connecting to MySQL. That's well over 80 errors. It's way too much.

It's mainly this one::

  OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")

But also, sometimes this one::
 
  OperationalError: (2003, "Can't connect to MySQL server on 'generic-rw-zeus' (110)")

In https://bugzilla.mozilla.org/show_bug.cgi?id=834516 I made some improvements so that if this happens instead of getting a 500 error you get a 503 error (https://github.com/mozilla/airmozilla/blob/master/airmozilla/base/templates/503.html) that politely asks the user to reload the page in 10 seconds since it's probably just a temporary glitch. 

Can we do something to reduce this number? It's happening a bit too often.

If nothing can be done ops-wise, I'd be happy to try to implement a solution in Django that involves some sort of sleep and re-attempt that won't improve the number of OperationalErrors but it will improve the experience for our visitors.

Richard A Milewski[:richard]

Comment 1

•

11 years ago

Stats for that event are at: http://vidly.vm1.labs.sjc1.mozilla.com/stats.php?myLink=3p9l6e&filter=&debug=

Sheeri Cabral [:sheeri]

Comment 2

•

11 years ago

Those errors all look like they're on Saturday between 16:19 and 16:23. Is that accurate? I believe :solarce was working on changing the backend of the generic database on Saturday from the public zlb cluster to the private zlb cluster (zlb1/2)

Can you expand a bit more on the timing? If it's only that blip of time, it's probably the load balancer move.

Sheeri Cabral [:sheeri]

Comment 3

•

11 years ago

More details on timing:
wall clock time was 6:00pm-6:26pm

Peter Bengtsson [:peterbe]

Reporter

Comment 4

•

11 years ago

First of the error emails came at 
Sun, 03 Feb 2013 02:02:47 -0000         

And last of them:
Sun, 03 Feb 2013 02:23:18 -0000       

02:00 GMT == 18:00 PST, which I guess is 21:00 EST

:sheeri, I don't know what 16:00 wall clock time you're referring to.

Sheeri Cabral [:sheeri]

Comment 5

•

11 years ago

6 pm = 18:00 wall clock, and that's for what I was talking about in comment 2 - the maintenance on Saturday evening. 

It was in response to the timeouts you reported and linked to with http://cl.ly/Mct0 - which I see as on Saturday 2/2 between 16:19 and 16:23.

Peter Bengtsson [:peterbe]

Reporter

Comment 6

•

11 years ago

/me blond moment. I saw "6:00pm" but for some reason read it as "16:00". 

So, what do we do? Just write this off as my mistake in planning?

Brandon Burton [:solarce]

Assignee

Comment 7

•

11 years ago

Those errors were do to planned database maintenance where we moved the load balancer VIPs from the external Zeus cluster (meant for web facing traffic now) in PHX1 to the internal Zeus cluster (meant for internal only services) in PHX1.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=762373#c9 this was planned to take less than 10 minutes, but a configuration option was wrong and troubleshooting made it seem like the external cluster may have been holding onto the IPs, so resolving the issue took approx 20 minutes and the outage (sites returning unable to connect to DB style errors) lasted for 26 minutes total.

This is not a recurring issue and should not require further work.

Assignee: server-ops-webops → bburton

Status: NEW → RESOLVED

Closed: 11 years ago

Priority: -- → P5

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[AirMozilla] way too many OperationalErrors in Air Mozilla

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P5)

Tracking

(Not tracked)

People

(Reporter: peterbe, Assigned: bburton)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated