Closed Bug 1132792 Opened 10 years ago Closed 10 years ago

Large backlog of linux64 tests

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: jlund)

Details

Attachments

(6 files)

Pending graph for tests 10 years ago Nick Thomas [:nthomas] (UTC+12) 10.68 KB, image/png		Details
Screen Shot 2015-02-13 at 12.03.52 AM.png 10 years ago Jordan Lund (:jlund) 188.23 KB, image/png		Details
Screen Shot 2015-02-13 at 12.04.04 AM.png 10 years ago Jordan Lund (:jlund) 356.05 KB, image/png		Details
bm67 prioritizeBuilders 10 years ago Nick Thomas [:nthomas] (UTC+12) 330.44 KB, text/plain		Details
newrelic i/o graph for r/w db (buildbot1) 10 years ago Nick Thomas [:nthomas] (UTC+12) 294.38 KB, image/png		Details
Screen Shot 2015-02-17 at 4.57.20 PM.png 10 years ago Jordan Lund (:jlund) 204.46 KB, image/png		Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

10 years ago

Attached image Pending graph for tests — Details

At time of filing 5000 pending, 300 running and declining. Started climbing at 2-3pm Pacific. Trees are all closed. Seem to be launching instances like mad in aws_watch_pending, but masters aren't handing out work all that fast. I restarted bm67 and it's much improved (lots of idle slaves on /buildslaves?no_builders=1). We're seeing lines like this in the master twistd.log: 2015-02-12 20:04:07-0800 [-] Failed to claim everything (got 0, expected 1)

Jordan Lund (:jlund)

Assignee

Comment 1

•

10 years ago

Attached image Screen Shot 2015-02-13 at 12.03.52 AM.png — Details

Jordan Lund (:jlund)

Assignee

Comment 2

•

10 years ago

Attached image Screen Shot 2015-02-13 at 12.04.04 AM.png — Details

somewhere between 22:00-23:00 the ship started to correct itself. nthomas noticed that on at least one linux aws test master, bm67, twistd.log was only showing 'finished prioritization' very sporadically throughout the tree closure and then picked back up again around 22:40: http://pastebin.mozilla.org/8711253 normally it should finish within every min.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

10 years ago

Attached file bm67 prioritizeBuilders — Details

Some things we noticed along the way: * this seemed to only affect AWS masters/jobs. The smokeping graphs don't indicate an any network issues on the tunnels back to scl3 * we saw a lot of repeated instance launching 2015-02-12 17:29:40,800 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 17:44:23,159 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 17:59:23,120 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:15:47,217 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:31:17,187 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:42:31,820 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 18:51:44,973 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 19:04:51,406 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) 2015-02-12 19:11:58,927 - Spot request for tst-linux64-spot-123.test.releng.use1.mozilla.com (0.07) when I did get on this slave runner had run fine and buildbot started, then the slave sat idle for 45min * looking at <master>/buildslaves?no_builders=1 pages, it was common to see 10s of idle slaves and only a handful of running jobs; only some 200-400 linux64 tests running in total despite plenty of pending * possible db overloading ** to do this simple (indexed!) query on buildbot_schedulers select id from buildrequests order by id desc limit 10; at least 0.6s, often multiple seconds, sometimes 8-9s, on both vips: r/w (buildbot1) from bm67, and r/o (buildbot2 & 3) from cruncher (~2100 Pacific) ** Now that we're recovering it's < 0.1s * attached is a grep on 'prioritizeBuilders.*finished prioritization' from bm67 in use1, as a proxy measure for how well the master was working. Note that it runs at least every minute until about 14:55, then sporadically until 22:51, then resumes normal operation. By comparison bm103 in scl3 doesn't miss a beat * gcox and sheeri increased the RAM on buildbot3 to help replication stay in sync, ~ 1940 Pacific. There may be more info in newrelic on the effect of that. I wonder if buildbot1 needs to be sized a little bigger than 6G RAM to cope with replication from two slaves; certainly seems to be reading a lot of data off the disk, but color me clueless about db caching

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

10 years ago

Lowering to major as we're back in business, rest of trees reopened. Leaving open for investigation/comment/monitoring.

Severity: blocker → major

Carsten Book [:Tomcat]

Comment 5

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #4) > Lowering to major as we're back in business, rest of trees reopened. Leaving > open for investigation/comment/monitoring. and we are closed again, backlog is building up again :(

Severity: major → blocker

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

10 years ago

Attached image newrelic i/o graph for r/w db (buildbot1) — Details

Currently 2400 pending linux64 tests, only 519 running and on the way down. Seeing idle slaves on bm67, not being given work. From twistd.log again: 2015-02-13 01:10:46-0800 [-] prioritizeBuilders: 0.38s finished prioritization 2015-02-13 01:10:52-0800 [-] prioritizeBuilders: 0.55s finished prioritization 2015-02-13 01:10:57-0800 [-] prioritizeBuilders: 0.36s finished prioritization 2015-02-13 01:17:56-0800 [-] prioritizeBuilders: 0.48s finished prioritization 2015-02-13 01:19:39-0800 [-] prioritizeBuilders: 1.01s finished prioritization 2015-02-13 01:22:57-0800 [-] prioritizeBuilders: 1.98s finished prioritization 2015-02-13 01:24:06-0800 [-] prioritizeBuilders: 0.96s finished prioritization 2015-02-13 01:24:22-0800 [-] prioritizeBuilders: 0.48s finished prioritization 2015-02-13 01:24:35-0800 [-] prioritizeBuilders: 0.58s finished prioritization 2015-02-13 01:24:44-0800 [-] prioritizeBuilders: 0.69s finished prioritization 2015-02-13 01:24:58-0800 [-] prioritizeBuilders: 0.51s finished prioritization 2015-02-13 01:25:16-0800 [-] prioritizeBuilders: 0.54s finished prioritization 2015-02-13 01:25:33-0800 [-] prioritizeBuilders: 3.79s finished prioritization 2015-02-13 01:25:48-0800 [-] prioritizeBuilders: 3.04s finished prioritization 2015-02-13 01:31:34-0800 [-] prioritizeBuilders: 1.25s finished prioritization 2015-02-13 01:41:59-0800 [-] prioritizeBuilders: 0.95s finished prioritization 2015-02-13 01:46:11-0800 [-] prioritizeBuilders: 4.27s finished prioritization 2015-02-13 02:02:04-0800 [-] prioritizeBuilders: 2.39s finished prioritization 2015-02-13 02:07:51-0800 [-] prioritizeBuilders: 2.18s finished prioritization 2015-02-13 02:13:03-0800 [-] prioritizeBuilders: 2.34s finished prioritization 2015-02-13 02:17:57-0800 [-] prioritizeBuilders: 1.39s finished prioritization 2015-02-13 02:35:17-0800 [-] prioritizeBuilders: 4.63s finished prioritization 2015-02-13 02:41:02-0800 [-] prioritizeBuilders: 1.05s finished prioritization 2015-02-13 02:46:51-0800 [-] prioritizeBuilders: 1.03s finished prioritization 2015-02-13 02:49:58-0800 [-] prioritizeBuilders: 1.78s finished prioritization So fine bm67 until ~ 01:10, then laggier than it should be. (FWIW prioritizeBuilders runs when deciding what work should be started). In newrelic, the I/O is low until about the same time, then returns to the high levels that we had during the last tree closure. So we need to find out what the queries are, and if they are unusual, or if something else is not optimal. If not, switching to the lower spec buildbot1 for rw access seems to be causing problems.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

10 years ago

Running 'show processlist' on buildbot-rw-vip yields many 10s of 'Waiting for table level lock', usually one or two 'Sending data' for queries like: SELECT br.id FROM buildrequests AS br, buildsets AS bs WHERE br.buildername='Windows 7 32-bit fx-tea.... Changes a bit as you run processlist a few times but that's the basic behaviour. Doesn't matter where the buildbot-master is, eg bm118 in scl3 also sees laggy prioritizeBuilders.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

10 years ago

prioritizeBuilders is running regularly again and no DB locking, pending on it's way down/running going up. Need to resolve the underlying issue though, time for a DBA I think.

Carsten Book [:Tomcat]

Comment 9

•

10 years ago

paged coop to get help here and hopefullt some open trees at some time.

Carsten Book [:Tomcat]

Comment 10

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #9) > paged coop to get help here and hopefullt some open trees at some time. and now also catlee 04:42 #buildduty: < nagios-releng> Tomcat|sheriffduty: catlee has been paged with the message "bug 1132792 - all trees closed again - seems we need dba/buildbot help(Tomcat|sheriffduty)"

Chris Cooper [:coop] (he/him)

Comment 11

•

10 years ago

To sum up the discussions from #releng this morning: * yesterday we failed over the db master from being a 24G RAM server with 4G innodb buffer pool and 512M myisam key cache, to being a 6G RAM server with 4G myisam key cache and 512M innodb buffer pool * between 22:21 and 22:39 PT last night, we increased one of the slaves from 6G RAM to 8G RAM * pending job counts climbed past 2000, jobs were not being taken * catlee found that we weren't reliably updating build_requests here: http://hg.mozilla.org/build/buildbot/file/1a520b4537fb/master/buildbot/db/connector.py#l865 We decided to revert the db master change, and started that process around 07:30 PT.

Ryan VanderMeulen [:RyanVM]

Comment 12

•

10 years ago

We seem to be starting new jobs OK again. I've reopened Try. Doing some merges and other cleanup quick on the other branches, then we can start ramping things up.

Ryan VanderMeulen [:RyanVM]

Comment 13

•

10 years ago

https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/95f5dfa2292e has been on inbound for 12min now and still no jobs requested :(

Ryan VanderMeulen [:RyanVM]

Comment 14

•

10 years ago

Still no sign of jobs on that push. Try re-closed.

Ryan VanderMeulen [:RyanVM]

Comment 15

•

10 years ago

OK, I've been given the go-ahead to reopen everything and have done so.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 16

•

10 years ago

So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and now we're OK ?

Chris Cooper [:coop] (he/him)

Comment 17

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #16) > So we swapped buildbot-rw-vip back to buildbot2, then hit a few bumps, and > now we're OK ? Seems so.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Jordan Lund (:jlund)

Assignee

Comment 18

•

10 years ago

Attached image Screen Shot 2015-02-17 at 4.57.20 PM.png — Details

for lack of a better place, we are seeing a similar spike in linux64 back log. couple ideas, - masters may have had out of sync tools repo revs since we stood up so many masters last week at different times and there for would have cloned tools against tip at different times. There may have been contradictory tools revs that required matching buildbot repo patches - during the reconfig I saw: "[buildbot-master01.bb.releng.use1.mozilla.com] out: _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'buildbot-rw-*****.mozilla.com' (2)")" so we may have dns or connection issues - we are still experiencing db lag

Assignee: nobody → jlund

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

10 years ago

Callek fixed up the regex used to start instances in https://github.com/mozilla/build-cloud-tools/pull/38. There were a lot of jobs that we were either not launching instances (mostly builds), or launched the wrong type of instance (some tests). This should address comments from sheriffs about laggy builds, as well the big pending counts for tests. Also, running tests never getting that close to the instance limit.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.