Closed Bug 701049 Opened 13 years ago Closed 13 years ago

AMO returning "Service Unavailable" -- Zeus errors

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: tofumatt, Assigned: oremj)

References

()

Details

Attachments

(2 files)

Attached image Screenshot of error
clouserw mentioned there were some Zeus errors on AMO yesterday. Today, whilst trying to log in I was getting REALLY long-running requests that would eventually time out and show the error paged attached. After I managed to log in, I'd get it on various pages around the site.

My account on the site is under my email "tofumatt@mozilla.com" if that's in the logs anyplace and will help. This has been happening for a good third of the request I've made to AMO prod this afternoon.
Username isn't recorded in the logs. Can you give us the exact time frame?
Assignee: server-ops → oremj
Started today (Nov.9) around 12:30pm Atlantic Time until at least 1:16:04 PM AST. (We're -4 GMT.)

I had a fair amount of log in problems at ~12:35pm.
There have been multiple reports this morning in #amo, no specific time frames.  You can see some traffic dips from ganglia: https://addons-dev.allizom.org/z/services/graphite/addons

Looks like when we get above 2500 concurrent sessions the graphs start to get shaky.
Assignee: oremj → server-ops
Assignee: server-ops → oremj
There is a pretty steady stream of these errors on AMO prod:

OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (110)")
We're also getting tracebacks on Mozillians prod to the tune of:

OperationalError: (2003, "Can't connect to MySQL server on 'generic-rw-zeus' (111)")
(In reply to Matthew Riley MacPherson [:tofumatt] from comment #5)
> We're also getting tracebacks on Mozillians prod to the tune of:
> 
> OperationalError: (2003, "Can't connect to MySQL server on 'generic-rw-zeus'
> (111)")

The bug 701049 has the Mozillians traceback information attached to it.
quick update:

we had Zeus devs looking at our cluster in phx1 for most of yesterday.  They discovered some slowness and issues in their code that might cause slowness, but I don't think they have discovered a root cause yet.

In the meantime, we are working on procuring new servers to test zeus on that would be hosted outside of the blade environment and have more bandwidth to them.  Unfortunately some components are on backorder (~3 weeks) due to the flooding overseas, so this 'solution' is a ways off.
Whiteboard: Waiting on Zeus support.
I don't think these severe problems are still happening.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Verified FIXED, as best I can tell.
Status: RESOLVED → VERIFIED
Whiteboard: Waiting on Zeus support.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: