Closed Bug 1132792 Opened 9 years ago Closed 9 years ago

Large backlog of linux64 tests

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: jlund)

Details

Attachments

(6 files)

Attached image Pending graph for tests
At time of filing 5000 pending, 300 running and declining. Started climbing at 2-3pm Pacific. Trees are all closed.

Seem to be launching instances like mad in aws_watch_pending, but masters aren't handing out work all that fast. I restarted bm67 and it's much improved (lots of idle slaves on /buildslaves?no_builders=1).

We're seeing lines like this in the master twistd.log:
  2015-02-12 20:04:07-0800 [-] Failed to claim everything (got 0, expected 1)
somewhere between 22:00-23:00 the ship started to correct itself.

nthomas noticed that on at least one linux aws test master, bm67, twistd.log was only showing 'finished prioritization' very sporadically throughout the tree closure and then picked back up again around 22:40: http://pastebin.mozilla.org/8711253

normally it should finish within every min.
Some things we noticed along the way:
* this seemed to only affect AWS masters/jobs. The smokeping graphs don't indicate an any network issues on the tunnels back to scl3
* we saw a lot of repeated instance launching
 2015-02-12 17:29:40,800 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 17:44:23,159 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 17:59:23,120 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 18:15:47,217 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 18:31:17,187 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 18:42:31,820 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 18:51:44,973 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 19:04:51,406 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
 2015-02-12 19:11:58,927 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07)
  when I did get on this slave runner had run fine and buildbot started, then the slave sat idle for 45min
* looking at <master>/buildslaves?no_builders=1 pages, it was common to see 10s of idle slaves and only a handful of running jobs; only some 200-400 linux64 tests running in total despite plenty of pending
* possible db overloading
** to do this simple (indexed!) query on buildbot_schedulers
     select id from buildrequests order by id desc limit 10;
   at least 0.6s, often multiple seconds, sometimes 8-9s, on both vips: r/w (buildbot1) from bm67, and 
   r/o (buildbot2 & 3) from cruncher  (~2100 Pacific)
** Now that we're recovering it's < 0.1s
* attached is a grep on 'prioritizeBuilders.*finished prioritization' from bm67 in use1, as a proxy measure for how well the master was working. Note that it runs at least every minute until about 14:55, then sporadically until 22:51, then resumes normal operation. By comparison bm103 in scl3 doesn't miss a beat

* gcox and sheeri increased the RAM on buildbot3 to help replication stay in sync, ~ 1940 Pacific. There may be more info in newrelic on the effect of that. I wonder if buildbot1 needs to be sized a little bigger than 6G RAM to cope with replication from two slaves; certainly seems to be reading a lot of data off the disk, but color me clueless about db caching
Lowering to major as we're back in business, rest of trees reopened. Leaving open for investigation/comment/monitoring.
Severity: blocker → major
(In reply to Nick Thomas [:nthomas] from comment #4)
> Lowering to major as we're back in business, rest of trees reopened. Leaving
> open for investigation/comment/monitoring.

and we are closed again, backlog is building up again :(
Severity: major → blocker
Currently 2400 pending linux64 tests, only 519 running and on the way down. Seeing idle slaves on bm67, not being given work. From twistd.log again:
2015-02-13 01:10:46-0800 [-] prioritizeBuilders: 0.38s finished prioritization
2015-02-13 01:10:52-0800 [-] prioritizeBuilders: 0.55s finished prioritization
2015-02-13 01:10:57-0800 [-] prioritizeBuilders: 0.36s finished prioritization
2015-02-13 01:17:56-0800 [-] prioritizeBuilders: 0.48s finished prioritization
2015-02-13 01:19:39-0800 [-] prioritizeBuilders: 1.01s finished prioritization
2015-02-13 01:22:57-0800 [-] prioritizeBuilders: 1.98s finished prioritization
2015-02-13 01:24:06-0800 [-] prioritizeBuilders: 0.96s finished prioritization
2015-02-13 01:24:22-0800 [-] prioritizeBuilders: 0.48s finished prioritization
2015-02-13 01:24:35-0800 [-] prioritizeBuilders: 0.58s finished prioritization
2015-02-13 01:24:44-0800 [-] prioritizeBuilders: 0.69s finished prioritization
2015-02-13 01:24:58-0800 [-] prioritizeBuilders: 0.51s finished prioritization
2015-02-13 01:25:16-0800 [-] prioritizeBuilders: 0.54s finished prioritization
2015-02-13 01:25:33-0800 [-] prioritizeBuilders: 3.79s finished prioritization
2015-02-13 01:25:48-0800 [-] prioritizeBuilders: 3.04s finished prioritization
2015-02-13 01:31:34-0800 [-] prioritizeBuilders: 1.25s finished prioritization
2015-02-13 01:41:59-0800 [-] prioritizeBuilders: 0.95s finished prioritization
2015-02-13 01:46:11-0800 [-] prioritizeBuilders: 4.27s finished prioritization
2015-02-13 02:02:04-0800 [-] prioritizeBuilders: 2.39s finished prioritization
2015-02-13 02:07:51-0800 [-] prioritizeBuilders: 2.18s finished prioritization
2015-02-13 02:13:03-0800 [-] prioritizeBuilders: 2.34s finished prioritization
2015-02-13 02:17:57-0800 [-] prioritizeBuilders: 1.39s finished prioritization
2015-02-13 02:35:17-0800 [-] prioritizeBuilders: 4.63s finished prioritization
2015-02-13 02:41:02-0800 [-] prioritizeBuilders: 1.05s finished prioritization
2015-02-13 02:46:51-0800 [-] prioritizeBuilders: 1.03s finished prioritization
2015-02-13 02:49:58-0800 [-] prioritizeBuilders: 1.78s finished prioritization

So fine bm67 until ~ 01:10, then laggier than it should be. (FWIW prioritizeBuilders runs when deciding what work should be started). In newrelic, the I/O is low until about the same time, then returns to the high levels that we had during the last tree closure.

So we need to find out what the queries are, and if they are unusual, or if something else is not optimal. If not, switching to the lower spec buildbot1 for rw access seems to be causing problems.
Running 'show processlist' on buildbot-rw-vip yields many 10s of 'Waiting for table level lock', usually one or two 'Sending data' for queries like:
 SELECT br.id FROM buildrequests AS br, buildsets AS bs WHERE br.buildername='Windows 7 32-bit fx-tea....

Changes a bit as you run processlist a few times but that's the basic behaviour. Doesn't matter where the buildbot-master is, eg bm118 in scl3 also sees laggy prioritizeBuilders.
prioritizeBuilders is running regularly again and no DB locking, pending on it's way down/running going up. Need to resolve the underlying issue though, time for a DBA I think.
paged coop to get help here and hopefullt some open trees at some time.
(In reply to Carsten Book [:Tomcat] from comment #9)
> paged coop to get help here and hopefullt some open trees at some time.

and now also catlee 

04:42 #buildduty: < nagios-releng> Tomcat|sheriffduty: catlee has been paged with the message "bug 1132792 - all trees closed again - seems we need dba/buildbot help(Tomcat|sheriffduty)"
To sum up the discussions from #releng this morning:

* yesterday we failed over the db master from being a 24G RAM server with 4G innodb buffer pool and 512M myisam key cache, to being a 6G RAM server with 4G myisam key cache and 512M innodb buffer pool 
* between 22:21 and 22:39 PT last night, we increased one of the slaves from 6G RAM to 8G RAM
* pending job counts climbed past 2000, jobs were not being taken
* catlee found that we weren't reliably updating build_requests here: http://hg.mozilla.org/build/buildbot/file/1a520b4537fb/master/buildbot/db/connector.py#l865

We decided to revert the db master change, and started that process around 07:30 PT.
We seem to be starting new jobs OK again. I've reopened Try. Doing some merges and other cleanup quick on the other branches, then we can start ramping things up.
Still no sign of jobs on that push. Try re-closed.
OK, I've been given the go-ahead to reopen everything and have done so.
So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and now we're OK ?
(In reply to Nick Thomas [:nthomas] from comment #16)
> So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and
> now we're OK ?

Seems so.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
for lack of a better place, we are seeing a similar spike in linux64 back log.

couple ideas,

   - masters may have had out of sync tools repo revs since we stood up so many masters last week at different times and there for would have cloned tools against tip at different times. There may have been contradictory tools revs that required matching buildbot repo patches

   - during the reconfig I saw: "[buildbot-master01.bb.releng.use1.mozilla.com] out:     _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'buildbot-rw-*****.mozilla.com' (2)")" so we may have dns or connection issues


    - we are still experiencing db lag
Assignee: nobody → jlund
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Callek fixed up the regex used to start instances in https://github.com/mozilla/build-cloud-tools/pull/38. There were a lot of jobs that we were either not launching instances (mostly builds), or launched the wrong type of instance (some tests). This should address comments from sheriffs about laggy builds, as well the big pending counts for tests. Also, running tests never getting that close to the instance limit.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: