565336 - The return of Too Many Connections

Reporter

Description

•

14 years ago

This bug is the sequel to bug 555880.

We're still seeing this problem, although far less often than before.  I don't see it with cron jobs anymore since our last push moved those to celery.  We're seeing it now with just general page loading every few hours, mostly in the early mornings.

From our talks about this, it sounds like nagios isn't paging because it recovers too quickly.  Dave says we average 90 connections and the limits on all boxes is 2000 - that's a pretty huge leap to make and recover from within a couple minutes.  I'd like to find a way to track down what queries are running and who is running them when that happens.

If I give you a time can you give me a dump of all the queries run within +/- 5 minutes from the logs?

Shyam Mani [:fox2mike]

Comment 1

•

14 years ago

Over to the db lads.

Assignee: server-ops → tellis

Wil Clouser [:clouserw]

Reporter

Comment 2

•

14 years ago

This was more common than usual last night, it would be great to get ideas on this.

Severity: normal → major

Wil Clouser [:clouserw]

Reporter

Updated

•

14 years ago

Severity: major → critical

timellis

Assignee

Comment 3

•

14 years ago

What server gets all the connections when this problem manifests itself? I see both tm-b01-master01 and tm-amo01-master01 mentioned. Any slaves? And is it really BOTH of those masters?

Wil Clouser [:clouserw]

Reporter

Comment 4

•

14 years ago

I'm talking about tm-amo01-master01.  I don't know what b01 is.

timellis

Assignee

Comment 5

•

14 years ago

I have started a SQL query monitor on the machine. It logs every SQL query that happens with a timestamp and runtime. So once we see the problem again, I can look at the logs and give you an idea of what's happening. Anytime after 2:20pm Pacific Thursday I will have logs for.

Wil Clouser [:clouserw]

Reporter

Comment 6

•

14 years ago

It's only happened once, very briefly at 4:41am today.

timellis

Assignee

Comment 7

•

14 years ago

This is what we're talking about in IRC.

timellis: At 4:41am a whole shed-tonne of queries came into amo01.	2:49
Many queries with a giant query text like this:	2:49
SELECT `Addon`.`id`, `Addon`.`guid`, IFNULL(`tr_name`.localized_string, `fb_name`.localized_string) AS `name`, `Addon`.`defaultlocale`, `Addon`.`addontyp..........nl') LEFT JOIN translations AS `fb_summary` ON (`Addon`.`summary` = `fb_summary`.id AND `fb_summary`.locale=`Addon`.`defaultlocale`)  WHERE `Addon`.`id` =  13907   LIMIT 1
jbalogh: that's remora pulling an add-on and its translations	2:51
timellis	‣	So I believe that's the root cause of the "too many connections" problem. Well, there is another query in here.	2:56
timellis	‣	much smaller and it EXPLAINs nicer, too: SELECT `AddonCategory`.`id`, `AddonCategory`.`addon_id`, `AddonCategory`.`category_id`, `AddonCategory`.`feature`, `AddonCategory`.`feature_locales` FROM `addons_categories` AS `AddonCategory`   WHERE `AddonCategory`.`addon_id` IN (161617)	2:57
timellis	‣	But still looks like there are many of them.	2:57
timellis	‣	Are all those queries issued in parallel?	2:57

jbalogh	‣	timellis: that addoncategory shows up a ton in the slow query logs	3:05
jbalogh	‣	I don't really know why it's slow though	3:05
jbalogh	‣	that's probably someone opening up an uncached listing page, so it's pulling a bunch of add-ons in one request	3:06
<!>	‣	stephend is now known as stephend|busy	3:08
timellis	‣	probably only slow cuz a bazillion of them run at once.	3:15
timellis	‣	they explain pretty quick/simple.

I'm still trying to prove it's a bunch of little queries running at once, rather than say a bunch of queries locked on one table or a giant query using a lot of disk I/O (and thus causing contention). But so far my best guess is it's caused by many small queries issued in quick succession.

timellis

Assignee

Comment 8

•

14 years ago

Wil, when is the most recent time this has happened? I want to look at logs and compare against the last run and see if the queries are the same at least. If so, we can at least characterise that aspect of the problem.

timellis

Assignee

Comment 9

•

14 years ago

jbalogh: How can we test the hypothesis that it's "probably someone opening up an uncached listing page, so it's pulling a bunch of add-ons in one request"?

If it is that, then we have a cleaer path to the solution, yes?

Wil Clouser [:clouserw]

Reporter

Comment 10

•

14 years ago

A single time at 12:37am today.
Yesterday a few times at 11:07pm, and a few more times at 2:07pm.
May 17th at 10:59am and 9:13pm.

This is happening less and less, I assume because we are moving pages from remora -> zamboni and easing the load on the db.

Jeff Balogh (:jbalogh)

Comment 11

•

14 years ago

(In reply to comment #9)
> jbalogh: How can we test the hypothesis that it's "probably someone opening up
> an uncached listing page, so it's pulling a bunch of add-ons in one request"?

We might be able to divine something from your complete query log from around that time.  If you drop that on khan I can take a look.

I'm starting to suspect it's a bunch of someone's hitting the same uncached page and dogpiling.  We don't have an immediate solution for that, but if we can figure out what page is getting pounded we can force it to the slave dbs. That doesn't always happen automatically with remora.

(Slave dbs will be selected automatically in the glorious zamboni future and the master will feel sad and lonely.)

Wil Clouser [:clouserw]

Reporter

Comment 12

•

14 years ago

It's worth noting that I don't know for sure if the error is from a slave or the master.  That would be good to log on our end.

Jeff Balogh (:jbalogh)

Comment 13

•

14 years ago

(In reply to comment #12)
> It's worth noting that I don't know for sure if the error is from a slave or
> the master.  That would be good to log on our end.

It's the master.  The tracebacks are always starting from <object>.save(), which has to write to the master.

timellis

Assignee

Comment 14

•

14 years ago

Find on khan in /tmp three files:

dbmon-localhost.err.0.gz
dbmon-localhost.log.0.gz
dbmon-localhost.sql.0.gz

The .log is just a hash of the SQL query with extra info like when it started, when it finished. If you just grep for "Finish" then you can assume when it started from the runtime.

The .sql is just a hash, and SQL, so if you want to know what SQL was run from the logfile, you can do "grep asd98f7sda7s9d7f9a87d87987a *.sql | head -1" to get one line with the query that has that hash.

timellis

Assignee

Comment 15

•

14 years ago

Just noting this is top in my priority of bugs to work on. Is there anything else I can help with? Is this still my bug?

Wil Clouser [:clouserw]

Reporter

Comment 16

•

14 years ago

I haven't seen a "Too Many Connections" error in a few days, although I have been seeing a lot of "Lost connection to Mysql" and other failures to connect lately.  A quick glance says that is happening about 6-12 times an hour, all day long.  I can file a different bug if you don't want to confuse this one.

timellis

Assignee

Comment 17

•

14 years ago

Got it. Closing this bug. Open bug with toomanyconnections/lostconnection errors.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

The return of Too Many Connections

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: clouserw, Assigned: tellis)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Updated