Closed
Bug 555880
Opened 14 years ago
Closed 14 years ago
[AMO] Allow more connections to database servers
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: clouserw, Assigned: tellis)
References
Details
I've been getting tracebacks every day or two complaining that mysql is refusing connections because there are "too many connections." I don't know whether it's master or slave, but what are the current settings on all the boxes? Also, I looked in munin at tm-amo01-master01 but all the mysql charts are blank (NaN stuff). Have you guys been getting paged about this lately?
Comment 1•14 years ago
|
||
Is "too many connections" definitely coming from tm-c01-master01? Usually bad things are happening if max connections has been reached. Dave, do we need to add more capacity? I haven't been watching the dbs.
Comment 3•14 years ago
|
||
Nope, no pages lately. Most likely the problem only happens briefly and goes away faster than nagios can page about it. Do you get a flurry of them at once and then they're done? The AMO database servers are all set at max_connections=1200 across the board. There are three slaves, so that's 3600 slave connections and 1200 master. Given that we now have 26 webheads for AMO, that gives an average of 46 connections to the master for each webhead in order to hit that, which doesn't sound very implausible. Not sure what the best way to scale this is. That max_connections was set to what it was after monitoring memory usage and figuring out where it would die at under normal load. On the other hand, we've done a lot of fine-tuning to the types of queries that get run and to the way we're utilizing the hardware caches on the RAID cards since then, so there's probably some room to experiment again. I'm also realizing that because of cached connections and whatnot that this problem will only get worse when we have an additional 20 webheads in Phoenix hitting the same master server.
Comment 4•14 years ago
|
||
(In reply to comment #1) > Is "too many connections" definitely coming from tm-c01-master01? Usually bad > things are happening if max connections has been reached. Dave, do we need to > add more capacity? I haven't been watching the dbs. We're seeing it happen in cron jobs connecting to master. There's also been tales of blank pages in the devcp. I wonder if that's also a too-many-connections issue?
Reporter | ||
Comment 5•14 years ago
|
||
The reason I filed this bug is that the Paypal connector failed last night because of this, but it's true it's becoming more common. We've had at least 3 failures already today.
Comment 6•14 years ago
|
||
just caught it in progress, nagios actually paged for it this time. 80% of the 1200 queries in progress were: SELECT `Config`.`key`, `Config`.`value` FROM `config` AS `Config` WHERE 1 = 1 All of them had been running for over 4 minutes and started within about 20 seconds of each other. The oldest queries were updates to the CAKE session table. Since most of the queries had been running for multiple minutes, I'm guessing the only reason it backed up is because it was running slow. It might be some specific query that locked it all up, but with most of the tables being innodb I wouldn't think we'd be having lock contention...
Comment 7•14 years ago
|
||
I did (at runtime, the config still shows 1200 if it ever restarts) change the max_connections to 1500, but it all cleared up about the time I did it anyway, and it's running about 90ish connections now.
Reporter | ||
Comment 8•14 years ago
|
||
Just got 4 "too many connection" errors at 1601
Comment 9•14 years ago
|
||
so, this has happened twice in the last two days with a long enough outage for nagios to page about it. The slow queries log shows a lot of suspicious queries that should probably be getting run against the slaves rather than the masters. I dropped a copy of the slow queries log onto khan, jbalogh has been poking at it a little already.
Comment 10•14 years ago
|
||
I'm out for vacation until April 12th, tagging Tim for this in the interest of someone continuing to keep an eye on it. Feel free to toss it back to me when I get back if it hasn't been resolved by then. My gut feeling at the moment (considering that even at normal peak traffic the master usually only has about 90 connections open until something backs it up in a rather sudden way and triggers this) is that the application is probably sending some major query to the master that ought to be going to the slave pool instead. The necessary clues to figure out what are probably in that log file I tossed at jbalogh the other day.
Assignee: justdave → tellis
Assignee | ||
Comment 11•14 years ago
|
||
My recollection is this got fixed. Reopen if you feel otherwise.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•14 years ago
|
||
We're still having the troubles (less often) but if it sits around 90 and sky rockets to 1500 that sounds like something we should fix in the app. I expect when bug 563727 lands on prod tomorrow (and the subsequent bugs for the rest of the cron jobs soon) we'll stop running into these problems. Thanks.
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•