555880 - [AMO] Allow more connections to database servers

Reporter

Description

•

14 years ago

I've been getting tracebacks every day or two complaining that mysql is refusing connections because there are "too many connections."

I don't know whether it's master or slave, but what are the current settings on all the boxes?  Also, I looked in munin at tm-amo01-master01 but all the mysql charts are blank (NaN stuff).  Have you guys been getting paged about this lately?

Jeremy Orem [:oremj]

Comment 1

•

14 years ago

Is "too many connections" definitely coming from tm-c01-master01? Usually bad things are happening if max connections has been reached.  Dave, do we need to add more capacity? I haven't been watching the dbs.

Shyam Mani [:fox2mike]

Comment 2

•

14 years ago

Over to the DB folks

Assignee: server-ops → justdave

Dave Miller [:justdave]

Comment 3

•

14 years ago

Nope, no pages lately.  Most likely the problem only happens briefly and goes away faster than nagios can page about it.  Do you get a flurry of them at once and then they're done?

The AMO database servers are all set at max_connections=1200 across the board.  There are three slaves, so that's 3600 slave connections and 1200 master.  Given that we now have 26 webheads for AMO, that gives an average of 46 connections to the master for each webhead in order to hit that, which doesn't sound very implausible.

Not sure what the best way to scale this is.  That max_connections was set to what it was after monitoring memory usage and figuring out where it would die at under normal load.  On the other hand, we've done a lot of fine-tuning to the types of queries that get run and to the way we're utilizing the hardware caches on the RAID cards since then, so there's probably some room to experiment again.

I'm also realizing that because of cached connections and whatnot that this problem will only get worse when we have an additional 20 webheads in Phoenix hitting the same master server.

Jeff Balogh (:jbalogh)

Comment 4

•

14 years ago

(In reply to comment #1)
> Is "too many connections" definitely coming from tm-c01-master01? Usually bad
> things are happening if max connections has been reached.  Dave, do we need to
> add more capacity? I haven't been watching the dbs.

We're seeing it happen in cron jobs connecting to master.

There's also been tales of blank pages in the devcp.  I wonder if that's also a too-many-connections issue?

Wil Clouser [:clouserw]

Reporter

Comment 5

•

14 years ago

The reason I filed this bug is that the Paypal connector failed last night because of this, but it's true it's becoming more common.  We've had at least 3 failures already today.

Dave Miller [:justdave]

Comment 6

•

14 years ago

just caught it in progress, nagios actually paged for it this time.

80% of the 1200 queries in progress were:

SELECT `Config`.`key`, `Config`.`value` FROM `config` AS `Config`   WHERE 1 = 1

All of them had been running for over 4 minutes and started within about 20 seconds of each other.  The oldest queries were updates to the CAKE session table.  Since most of the queries had been running for multiple minutes, I'm guessing the only reason it backed up is because it was running slow.  It might be some specific query that locked it all up, but with most of the tables being innodb I wouldn't think we'd be having lock contention...

Dave Miller [:justdave]

Comment 7

•

14 years ago

I did (at runtime, the config still shows 1200 if it ever restarts) change the max_connections to 1500, but it all cleared up about the time I did it anyway, and it's running about 90ish connections now.

Wil Clouser [:clouserw]

Reporter

Comment 8

•

14 years ago

Just got 4 "too many connection" errors at 1601

Dave Miller [:justdave]

Comment 9

•

14 years ago

so, this has happened twice in the last two days with a long enough outage for nagios to page about it.  The slow queries log shows a lot of suspicious queries that should probably be getting run against the slaves rather than the masters.  I dropped a copy of the slow queries log onto khan, jbalogh has been poking at it a little already.

Dave Miller [:justdave]

Comment 10

•

14 years ago

I'm out for vacation until April 12th, tagging Tim for this in the interest of someone continuing to keep an eye on it.  Feel free to toss it back to me when I get back if it hasn't been resolved by then.

My gut feeling at the moment (considering that even at normal peak traffic the master usually only has about 90 connections open until something backs it up in a rather sudden way and triggers this) is that the application is probably sending some major query to the master that ought to be going to the slave pool instead.  The necessary clues to figure out what are probably in that log file I tossed at jbalogh the other day.

Assignee: justdave → tellis

Jeff Balogh (:jbalogh)

Updated

•

14 years ago

Depends on: 557100

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

14 years ago

Blocks: 556339

timellis

Assignee

Comment 11

•

14 years ago

My recollection is this got fixed. Reopen if you feel otherwise.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Wil Clouser [:clouserw]

Reporter

Comment 12

•

14 years ago

We're still having the troubles (less often) but if it sits around 90 and sky rockets to 1500 that sounds like something we should fix in the app.  I expect when bug 563727 lands on prod tomorrow (and the subsequent bugs for the rest of the cron jobs soon) we'll stop running into these problems.

Thanks.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

[AMO] Allow more connections to database servers

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: clouserw, Assigned: tellis)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Comment 11

Comment 12

Updated

Updated