Closed Bug 555880 Opened 14 years ago Closed 14 years ago

[AMO] Allow more connections to database servers

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: tellis)

References

Details

I've been getting tracebacks every day or two complaining that mysql is refusing connections because there are "too many connections."

I don't know whether it's master or slave, but what are the current settings on all the boxes?  Also, I looked in munin at tm-amo01-master01 but all the mysql charts are blank (NaN stuff).  Have you guys been getting paged about this lately?
Is "too many connections" definitely coming from tm-c01-master01? Usually bad things are happening if max connections has been reached.  Dave, do we need to add more capacity? I haven't been watching the dbs.
Over to the DB folks
Assignee: server-ops → justdave
Nope, no pages lately.  Most likely the problem only happens briefly and goes away faster than nagios can page about it.  Do you get a flurry of them at once and then they're done?

The AMO database servers are all set at max_connections=1200 across the board.  There are three slaves, so that's 3600 slave connections and 1200 master.  Given that we now have 26 webheads for AMO, that gives an average of 46 connections to the master for each webhead in order to hit that, which doesn't sound very implausible.

Not sure what the best way to scale this is.  That max_connections was set to what it was after monitoring memory usage and figuring out where it would die at under normal load.  On the other hand, we've done a lot of fine-tuning to the types of queries that get run and to the way we're utilizing the hardware caches on the RAID cards since then, so there's probably some room to experiment again.

I'm also realizing that because of cached connections and whatnot that this problem will only get worse when we have an additional 20 webheads in Phoenix hitting the same master server.
(In reply to comment #1)
> Is "too many connections" definitely coming from tm-c01-master01? Usually bad
> things are happening if max connections has been reached.  Dave, do we need to
> add more capacity? I haven't been watching the dbs.

We're seeing it happen in cron jobs connecting to master.

There's also been tales of blank pages in the devcp.  I wonder if that's also a too-many-connections issue?
The reason I filed this bug is that the Paypal connector failed last night because of this, but it's true it's becoming more common.  We've had at least 3 failures already today.
just caught it in progress, nagios actually paged for it this time.

80% of the 1200 queries in progress were:

SELECT `Config`.`key`, `Config`.`value` FROM `config` AS `Config`   WHERE 1 = 1

All of them had been running for over 4 minutes and started within about 20 seconds of each other.  The oldest queries were updates to the CAKE session table.  Since most of the queries had been running for multiple minutes, I'm guessing the only reason it backed up is because it was running slow.  It might be some specific query that locked it all up, but with most of the tables being innodb I wouldn't think we'd be having lock contention...
I did (at runtime, the config still shows 1200 if it ever restarts) change the max_connections to 1500, but it all cleared up about the time I did it anyway, and it's running about 90ish connections now.
Just got 4 "too many connection" errors at 1601
so, this has happened twice in the last two days with a long enough outage for nagios to page about it.  The slow queries log shows a lot of suspicious queries that should probably be getting run against the slaves rather than the masters.  I dropped a copy of the slow queries log onto khan, jbalogh has been poking at it a little already.
I'm out for vacation until April 12th, tagging Tim for this in the interest of someone continuing to keep an eye on it.  Feel free to toss it back to me when I get back if it hasn't been resolved by then.

My gut feeling at the moment (considering that even at normal peak traffic the master usually only has about 90 connections open until something backs it up in a rather sudden way and triggers this) is that the application is probably sending some major query to the master that ought to be going to the slave pool instead.  The necessary clues to figure out what are probably in that log file I tossed at jbalogh the other day.
Assignee: justdave → tellis
Depends on: 557100
My recollection is this got fixed. Reopen if you feel otherwise.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
We're still having the troubles (less often) but if it sits around 90 and sky rockets to 1500 that sounds like something we should fix in the app.  I expect when bug 563727 lands on prod tomorrow (and the subsequent bugs for the rest of the cron jobs soon) we'll stop running into these problems.

Thanks.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.