Closed
Bug 1132792
Opened 9 years ago
Closed 9 years ago
Large backlog of linux64 tests
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: jlund)
Details
Attachments
(6 files)
At time of filing 5000 pending, 300 running and declining. Started climbing at 2-3pm Pacific. Trees are all closed. Seem to be launching instances like mad in aws_watch_pending, but masters aren't handing out work all that fast. I restarted bm67 and it's much improved (lots of idle slaves on /buildslaves?no_builders=1). We're seeing lines like this in the master twistd.log: 2015-02-12 20:04:07-0800 [-] Failed to claim everything (got 0, expected 1)
Assignee | ||
Comment 1•9 years ago
|
||
Assignee | ||
Comment 2•9 years ago
|
||
somewhere between 22:00-23:00 the ship started to correct itself. nthomas noticed that on at least one linux aws test master, bm67, twistd.log was only showing 'finished prioritization' very sporadically throughout the tree closure and then picked back up again around 22:40: http://pastebin.mozilla.org/8711253 normally it should finish within every min.
Reporter | ||
Comment 3•9 years ago
|
||
Some things we noticed along the way: * this seemed to only affect AWS masters/jobs. The smokeping graphs don't indicate an any network issues on the tunnels back to scl3 * we saw a lot of repeated instance launching 2015-02-12 17:29:40,800 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 17:44:23,159 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 17:59:23,120 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:15:47,217 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:31:17,187 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:42:31,820 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:51:44,973 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 19:04:51,406 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 19:11:58,927 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) when I did get on this slave runner had run fine and buildbot started, then the slave sat idle for 45min * looking at <master>/buildslaves?no_builders=1 pages, it was common to see 10s of idle slaves and only a handful of running jobs; only some 200-400 linux64 tests running in total despite plenty of pending * possible db overloading ** to do this simple (indexed!) query on buildbot_schedulers select id from buildrequests order by id desc limit 10; at least 0.6s, often multiple seconds, sometimes 8-9s, on both vips: r/w (buildbot1) from bm67, and r/o (buildbot2 & 3) from cruncher (~2100 Pacific) ** Now that we're recovering it's < 0.1s * attached is a grep on 'prioritizeBuilders.*finished prioritization' from bm67 in use1, as a proxy measure for how well the master was working. Note that it runs at least every minute until about 14:55, then sporadically until 22:51, then resumes normal operation. By comparison bm103 in scl3 doesn't miss a beat * gcox and sheeri increased the RAM on buildbot3 to help replication stay in sync, ~ 1940 Pacific. There may be more info in newrelic on the effect of that. I wonder if buildbot1 needs to be sized a little bigger than 6G RAM to cope with replication from two slaves; certainly seems to be reading a lot of data off the disk, but color me clueless about db caching
Reporter | ||
Comment 4•9 years ago
|
||
Lowering to major as we're back in business, rest of trees reopened. Leaving open for investigation/comment/monitoring.
Severity: blocker → major
Comment 5•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #4) > Lowering to major as we're back in business, rest of trees reopened. Leaving > open for investigation/comment/monitoring. and we are closed again, backlog is building up again :(
Severity: major → blocker
Reporter | ||
Comment 6•9 years ago
|
||
Currently 2400 pending linux64 tests, only 519 running and on the way down. Seeing idle slaves on bm67, not being given work. From twistd.log again: 2015-02-13 01:10:46-0800 [-] prioritizeBuilders: 0.38s finished prioritization 2015-02-13 01:10:52-0800 [-] prioritizeBuilders: 0.55s finished prioritization 2015-02-13 01:10:57-0800 [-] prioritizeBuilders: 0.36s finished prioritization 2015-02-13 01:17:56-0800 [-] prioritizeBuilders: 0.48s finished prioritization 2015-02-13 01:19:39-0800 [-] prioritizeBuilders: 1.01s finished prioritization 2015-02-13 01:22:57-0800 [-] prioritizeBuilders: 1.98s finished prioritization 2015-02-13 01:24:06-0800 [-] prioritizeBuilders: 0.96s finished prioritization 2015-02-13 01:24:22-0800 [-] prioritizeBuilders: 0.48s finished prioritization 2015-02-13 01:24:35-0800 [-] prioritizeBuilders: 0.58s finished prioritization 2015-02-13 01:24:44-0800 [-] prioritizeBuilders: 0.69s finished prioritization 2015-02-13 01:24:58-0800 [-] prioritizeBuilders: 0.51s finished prioritization 2015-02-13 01:25:16-0800 [-] prioritizeBuilders: 0.54s finished prioritization 2015-02-13 01:25:33-0800 [-] prioritizeBuilders: 3.79s finished prioritization 2015-02-13 01:25:48-0800 [-] prioritizeBuilders: 3.04s finished prioritization 2015-02-13 01:31:34-0800 [-] prioritizeBuilders: 1.25s finished prioritization 2015-02-13 01:41:59-0800 [-] prioritizeBuilders: 0.95s finished prioritization 2015-02-13 01:46:11-0800 [-] prioritizeBuilders: 4.27s finished prioritization 2015-02-13 02:02:04-0800 [-] prioritizeBuilders: 2.39s finished prioritization 2015-02-13 02:07:51-0800 [-] prioritizeBuilders: 2.18s finished prioritization 2015-02-13 02:13:03-0800 [-] prioritizeBuilders: 2.34s finished prioritization 2015-02-13 02:17:57-0800 [-] prioritizeBuilders: 1.39s finished prioritization 2015-02-13 02:35:17-0800 [-] prioritizeBuilders: 4.63s finished prioritization 2015-02-13 02:41:02-0800 [-] prioritizeBuilders: 1.05s finished prioritization 2015-02-13 02:46:51-0800 [-] prioritizeBuilders: 1.03s finished prioritization 2015-02-13 02:49:58-0800 [-] prioritizeBuilders: 1.78s finished prioritization So fine bm67 until ~ 01:10, then laggier than it should be. (FWIW prioritizeBuilders runs when deciding what work should be started). In newrelic, the I/O is low until about the same time, then returns to the high levels that we had during the last tree closure. So we need to find out what the queries are, and if they are unusual, or if something else is not optimal. If not, switching to the lower spec buildbot1 for rw access seems to be causing problems.
Reporter | ||
Comment 7•9 years ago
|
||
Running 'show processlist' on buildbot-rw-vip yields many 10s of 'Waiting for table level lock', usually one or two 'Sending data' for queries like: SELECT br.id FROM buildrequests AS br, buildsets AS bs WHERE br.buildername='Windows 7 32-bit fx-tea.... Changes a bit as you run processlist a few times but that's the basic behaviour. Doesn't matter where the buildbot-master is, eg bm118 in scl3 also sees laggy prioritizeBuilders.
Reporter | ||
Comment 8•9 years ago
|
||
prioritizeBuilders is running regularly again and no DB locking, pending on it's way down/running going up. Need to resolve the underlying issue though, time for a DBA I think.
Comment 9•9 years ago
|
||
paged coop to get help here and hopefullt some open trees at some time.
Comment 10•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #9) > paged coop to get help here and hopefullt some open trees at some time. and now also catlee 04:42 #buildduty: < nagios-releng> Tomcat|sheriffduty: catlee has been paged with the message "bug 1132792 - all trees closed again - seems we need dba/buildbot help(Tomcat|sheriffduty)"
Comment 11•9 years ago
|
||
To sum up the discussions from #releng this morning: * yesterday we failed over the db master from being a 24G RAM server with 4G innodb buffer pool and 512M myisam key cache, to being a 6G RAM server with 4G myisam key cache and 512M innodb buffer pool * between 22:21 and 22:39 PT last night, we increased one of the slaves from 6G RAM to 8G RAM * pending job counts climbed past 2000, jobs were not being taken * catlee found that we weren't reliably updating build_requests here: http://hg.mozilla.org/build/buildbot/file/1a520b4537fb/master/buildbot/db/connector.py#l865 We decided to revert the db master change, and started that process around 07:30 PT.
Comment 12•9 years ago
|
||
We seem to be starting new jobs OK again. I've reopened Try. Doing some merges and other cleanup quick on the other branches, then we can start ramping things up.
Comment 13•9 years ago
|
||
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/95f5dfa2292e has been on inbound for 12min now and still no jobs requested :(
Comment 14•9 years ago
|
||
Still no sign of jobs on that push. Try re-closed.
Comment 15•9 years ago
|
||
OK, I've been given the go-ahead to reopen everything and have done so.
Reporter | ||
Comment 16•9 years ago
|
||
So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and now we're OK ?
Comment 17•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #16) > So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and > now we're OK ? Seems so.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 18•9 years ago
|
||
for lack of a better place, we are seeing a similar spike in linux64 back log. couple ideas, - masters may have had out of sync tools repo revs since we stood up so many masters last week at different times and there for would have cloned tools against tip at different times. There may have been contradictory tools revs that required matching buildbot repo patches - during the reconfig I saw: "[buildbot-master01.bb.releng.use1.mozilla.com] out: _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'buildbot-rw-*****.mozilla.com' (2)")" so we may have dns or connection issues - we are still experiencing db lag
Assignee: nobody → jlund
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 19•9 years ago
|
||
Callek fixed up the regex used to start instances in https://github.com/mozilla/build-cloud-tools/pull/38. There were a lot of jobs that we were either not launching instances (mostly builds), or launched the wrong type of instance (some tests). This should address comments from sheriffs about laggy builds, as well the big pending counts for tests. Also, running tests never getting that close to the instance limit.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•