Closed Bug 837545 Opened 11 years ago Closed 11 years ago

[AirMozilla] way too many OperationalErrors in Air Mozilla

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P5)

x86
macOS

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: peterbe, Assigned: bburton)

Details

One of the events on Air Mozilla got featured on Hacker News on Friday (1 Feb 2013). This event: https://air.mozilla.org/higgs-jit/
This probably resulted in a fair amount of traffic. 

(unfortunately we don't have Google analytics on Air Mozilla but that's another story)

What I was greeted with today on Sunday was this: http://cl.ly/Mct0
They're all OperationalErrors related to connecting to MySQL. That's well over 80 errors. It's way too much.

It's mainly this one::

  OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")

But also, sometimes this one::
 
  OperationalError: (2003, "Can't connect to MySQL server on 'generic-rw-zeus' (110)")

In https://bugzilla.mozilla.org/show_bug.cgi?id=834516 I made some improvements so that if this happens instead of getting a 500 error you get a 503 error (https://github.com/mozilla/airmozilla/blob/master/airmozilla/base/templates/503.html) that politely asks the user to reload the page in 10 seconds since it's probably just a temporary glitch. 

Can we do something to reduce this number? It's happening a bit too often.

If nothing can be done ops-wise, I'd be happy to try to implement a solution in Django that involves some sort of sleep and re-attempt that won't improve the number of OperationalErrors but it will improve the experience for our visitors.
Those errors all look like they're on Saturday between 16:19 and 16:23. Is that accurate? I believe :solarce was working on changing the backend of the generic database on Saturday from the public zlb cluster to the private zlb cluster (zlb1/2)

Can you expand a bit more on the timing? If it's only that blip of time, it's probably the load balancer move.
More details on timing:
wall clock time was 6:00pm-6:26pm
First of the error emails came at 
Sun, 03 Feb 2013 02:02:47 -0000         

And last of them:
Sun, 03 Feb 2013 02:23:18 -0000       

02:00 GMT == 18:00 PST, which I guess is 21:00 EST

:sheeri, I don't know what 16:00 wall clock time you're referring to.
6 pm = 18:00 wall clock, and that's for what I was talking about in comment 2 - the maintenance on Saturday evening. 

It was in response to the timeouts you reported and linked to with http://cl.ly/Mct0 - which I see as on Saturday 2/2 between 16:19 and 16:23.
/me blond moment. I saw "6:00pm" but for some reason read it as "16:00". 

So, what do we do? Just write this off as my mistake in planning?
Those errors were do to planned database maintenance where we moved the load balancer VIPs from the external Zeus cluster (meant for web facing traffic now) in PHX1 to the internal Zeus cluster (meant for internal only services) in PHX1.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=762373#c9 this was planned to take less than 10 minutes, but a configuration option was wrong and troubleshooting made it seem like the external cluster may have been holding onto the IPs, so resolving the issue took approx 20 minutes and the outage (sites returning unable to connect to DB style errors) lasted for 26 minutes total.

This is not a recurring issue and should not require further work.
Assignee: server-ops-webops → bburton
Status: NEW → RESOLVED
Closed: 11 years ago
Priority: -- → P5
Resolution: --- → INCOMPLETE
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.