Closed Bug 743976 Opened 12 years ago Closed 12 years ago

"Lost connection to MySQL server during query" errors on some buildbot masters

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: cshields)

Details

(Whiteboard: [buildduty][buildmasters])

While doing some normal work, I tried to load three buildbot masters (buildbot-master07, 08, and 12). All three of them hung for about 10 minutes, and then came back with a long traceback ending in:
<class '_mysql_exceptions.OperationalError'>: (2013, 'Lost connection to MySQL server during query')

It's hard to tell what impact this is having, since I can't load the buildbot master webpages, but it could be a tree closing event.
More debugging info:

The pattern seems to be that the buildbot servers *outside* of scl3 are the ones having the issue.  All the servers are configured to use tm-b01-master01.mozilla.org.  The ones in sjc1 and scl1 started showing errors at 03:00PDT.  Telentting to the mysql port from a machine throwing errors actually connects.
buildbot-master10 started showing a different error at 03:07:

_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')
Given the error pattern, the only guess I have is that maybe we're hitting maximum connections per host or something? bm07/bm08/bm12 are all build masters, which are probably doing l10n nightlies right now - which is one of our busiest periods of the day.
I haven't been able to repro the web interface symptom at the moment on two of the scl1 test masters, bm04 and bm06, but I see some errors on bm06 in the last hour.
There was a 5 minute existing session timeout in zeus that appeared to be tripped here.  Took out all timeouts and all per-client limitations in zeus and these problems have gone away.
Assignee: server-ops-database → cshields
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
Whiteboard: [buildduty][buildmasters]
reducing to major.  I haven't seen the error yet, will mark as confirmed if that holds true for the next 30 or so minutes
Severity: critical → major
bah - did not realize that bugzilla reset the resolved flag - fixing

and it's confirmed also :)
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.