The return of Too Many Connections

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
critical
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: clouserw, Assigned: timellis)

Tracking

Details

(Reporter)

Description

8 years ago
This bug is the sequel to bug 555880.

We're still seeing this problem, although far less often than before.  I don't see it with cron jobs anymore since our last push moved those to celery.  We're seeing it now with just general page loading every few hours, mostly in the early mornings.

From our talks about this, it sounds like nagios isn't paging because it recovers too quickly.  Dave says we average 90 connections and the limits on all boxes is 2000 - that's a pretty huge leap to make and recover from within a couple minutes.  I'd like to find a way to track down what queries are running and who is running them when that happens.

If I give you a time can you give me a dump of all the queries run within +/- 5 minutes from the logs?
Over to the db lads.
Assignee: server-ops → tellis
(Reporter)

Comment 2

8 years ago
This was more common than usual last night, it would be great to get ideas on this.
Severity: normal → major
(Reporter)

Updated

8 years ago
Severity: major → critical
(Assignee)

Comment 3

8 years ago
What server gets all the connections when this problem manifests itself? I see both tm-b01-master01 and tm-amo01-master01 mentioned. Any slaves? And is it really BOTH of those masters?
(Reporter)

Comment 4

8 years ago
I'm talking about tm-amo01-master01.  I don't know what b01 is.
(Assignee)

Comment 5

8 years ago
I have started a SQL query monitor on the machine. It logs every SQL query that happens with a timestamp and runtime. So once we see the problem again, I can look at the logs and give you an idea of what's happening. Anytime after 2:20pm Pacific Thursday I will have logs for.
(Reporter)

Comment 6

8 years ago
It's only happened once, very briefly at 4:41am today.
(Assignee)

Comment 7

8 years ago
This is what we're talking about in IRC.

timellis: At 4:41am a whole shed-tonne of queries came into amo01.	2:49
Many queries with a giant query text like this:	2:49
SELECT `Addon`.`id`, `Addon`.`guid`, IFNULL(`tr_name`.localized_string, `fb_name`.localized_string) AS `name`, `Addon`.`defaultlocale`, `Addon`.`addontyp..........nl') LEFT JOIN translations AS `fb_summary` ON (`Addon`.`summary` = `fb_summary`.id AND `fb_summary`.locale=`Addon`.`defaultlocale`)  WHERE `Addon`.`id` =  13907   LIMIT 1
jbalogh: that's remora pulling an add-on and its translations	2:51
timellis	‣	So I believe that's the root cause of the "too many connections" problem. Well, there is another query in here.	2:56
timellis	‣	much smaller and it EXPLAINs nicer, too: SELECT `AddonCategory`.`id`, `AddonCategory`.`addon_id`, `AddonCategory`.`category_id`, `AddonCategory`.`feature`, `AddonCategory`.`feature_locales` FROM `addons_categories` AS `AddonCategory`   WHERE `AddonCategory`.`addon_id` IN (161617)	2:57
timellis	‣	But still looks like there are many of them.	2:57
timellis	‣	Are all those queries issued in parallel?	2:57

jbalogh	‣	timellis: that addoncategory shows up a ton in the slow query logs	3:05
jbalogh	‣	I don't really know why it's slow though	3:05
jbalogh	‣	that's probably someone opening up an uncached listing page, so it's pulling a bunch of add-ons in one request	3:06
<!>	‣	stephend is now known as stephend|busy	3:08
timellis	‣	probably only slow cuz a bazillion of them run at once.	3:15
timellis	‣	they explain pretty quick/simple.

I'm still trying to prove it's a bunch of little queries running at once, rather than say a bunch of queries locked on one table or a giant query using a lot of disk I/O (and thus causing contention). But so far my best guess is it's caused by many small queries issued in quick succession.
(Assignee)

Comment 8

8 years ago
Wil, when is the most recent time this has happened? I want to look at logs and compare against the last run and see if the queries are the same at least. If so, we can at least characterise that aspect of the problem.
(Assignee)

Comment 9

8 years ago
jbalogh: How can we test the hypothesis that it's "probably someone opening up an uncached listing page, so it's pulling a bunch of add-ons in one request"?

If it is that, then we have a cleaer path to the solution, yes?
(Reporter)

Comment 10

8 years ago
A single time at 12:37am today.
Yesterday a few times at 11:07pm, and a few more times at 2:07pm.
May 17th at 10:59am and 9:13pm.

This is happening less and less, I assume because we are moving pages from remora -> zamboni and easing the load on the db.
(In reply to comment #9)
> jbalogh: How can we test the hypothesis that it's "probably someone opening up
> an uncached listing page, so it's pulling a bunch of add-ons in one request"?

We might be able to divine something from your complete query log from around that time.  If you drop that on khan I can take a look.

I'm starting to suspect it's a bunch of someone's hitting the same uncached page and dogpiling.  We don't have an immediate solution for that, but if we can figure out what page is getting pounded we can force it to the slave dbs. That doesn't always happen automatically with remora.

(Slave dbs will be selected automatically in the glorious zamboni future and the master will feel sad and lonely.)
(Reporter)

Comment 12

8 years ago
It's worth noting that I don't know for sure if the error is from a slave or the master.  That would be good to log on our end.
(In reply to comment #12)
> It's worth noting that I don't know for sure if the error is from a slave or
> the master.  That would be good to log on our end.

It's the master.  The tracebacks are always starting from <object>.save(), which has to write to the master.
(Assignee)

Comment 14

8 years ago
Find on khan in /tmp three files:

dbmon-localhost.err.0.gz
dbmon-localhost.log.0.gz
dbmon-localhost.sql.0.gz

The .log is just a hash of the SQL query with extra info like when it started, when it finished. If you just grep for "Finish" then you can assume when it started from the runtime.

The .sql is just a hash, and SQL, so if you want to know what SQL was run from the logfile, you can do "grep asd98f7sda7s9d7f9a87d87987a *.sql | head -1" to get one line with the query that has that hash.
(Assignee)

Comment 15

8 years ago
Just noting this is top in my priority of bugs to work on. Is there anything else I can help with? Is this still my bug?
(Reporter)

Comment 16

8 years ago
I haven't seen a "Too Many Connections" error in a few days, although I have been seeing a lot of "Lost connection to Mysql" and other failures to connect lately.  A quick glance says that is happening about 6-12 times an hour, all day long.  I can file a different bug if you don't want to confuse this one.
(Assignee)

Comment 17

8 years ago
Got it. Closing this bug. Open bug with toomanyconnections/lostconnection errors.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.