Trees closed due to Windows test backlog

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: KWierso, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
I'll be closing the trunk trees due to a 3-5 hour backlog on starting up Windows test jobs.
(Reporter)

Comment 1

3 years ago
At 4:30PM, Inbound/Fx-team closed, m-c/b2g-inbound set to approval required
(Reporter)

Comment 2

3 years ago
Closed try at 2014-08-05T17:12:18
Severity: normal → blocker
(Reporter)

Comment 3

3 years ago
Reopened try, but left a warning in the "reason" field warning of the windows test delays.
Created attachment 8468169 [details]
Master memory use

* we have 4 masters assigned to handle the windows test slaves - bm109 through to bm112, all in scl3.
* bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
* bm109 through bm112 are all swapping a bit and have been running the buildbot process since May 6, see attached information
* I'm going to try a rolling graceful restart of bm109 to bm111, starting with bm111. Doing this manually rather than via fabric
bm111 is done, on to bm110.
(In reply to Nick Thomas [:nthomas] from comment #4)
> Created attachment 8468169 [details]
> Master memory use
> 
> * we have 4 masters assigned to handle the windows test slaves - bm109
> through to bm112, all in scl3.
> * bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
Bug 1005133 c#7 was where this looks like it was disabled.

> * bm109 through bm112 are all swapping a bit and have been running the
> buildbot process since May 6, see attached information
> * I'm going to try a rolling graceful restart of bm109 to bm111, starting
> with bm111. Doing this manually rather than via fabric


Some interesting numbers to correlate to windows jobs taking longer overall, and thus the backlog:

(data taken from https://secure.pub.build.mozilla.org/buildapi/reports/builders --- warning loading can kill buildapi if you're not very very careful)

On July 16->July 18

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    18:35        |   7:13          |     37:48
win7-ix opt "talos"      |    22:58        |  10:28          |   1:02:10
win7-ix debug "unittest" |    32:35        |   8:35          |   1:25:59
win8 opt "unittest"      |    18:51        |   6:28          |   1:01:34
win8 opt "talos"         |    25:14        |  13:31          |   1:02:25
win8 debug "unittest"    |    32:06        |   7:50          |   1:31:26
wXP  opt "unittest"      |    16:15        |   3:59          |     38:49
wXP  opt "talos"         |    22:15        |  10:57          |   1:03:08
wXP  debug "unittest"    |    30:52        |   7:31          |   1:15:12
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest&starttime=1405483200&endtime=1405569600


For ~ the past day (Aug/04->Aug/05)

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    25:28        |   8:16          |   1:07:57
win7-ix opt "talos"      |    27:41        |  10:41          |   1:18:26
win7-ix debug "unittest" |    42:56        |  10:26          |   1:59:19
win8 opt "unittest"      |    24:51        |   8:54          |     46:27
win8 opt "talos"         |    28:25        |  14:12          |     49:52
win8 debug "unittest"    |    41:39        |   8:53          |   1:37:24
wXP  opt "unittest"      |    22:18        |   5:35          |   1:10:13
wXP  opt "talos"         |    28:18        |  11:48          |     55:55
wXP  debug "unittest"    |    40:15        |   8:15          |   1:42:35
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest


Now the differences
Job Type:                |   Avg Runtime Difference
win7-ix opt "unittest"   |    + 6:53   ->  + 37%
win7-ix opt "talos"      |    + 4:43   ->  + 20%
win7-ix debug "unittest" |    +10:21   ->  + 31%
win8 opt "unittest"      |    + 6:00   ->  + 53%
win8 opt "talos"         |    + 3:11   ->  + 12%
win8 debug "unittest"    |    + 9:33   ->  + 29%
wXP  opt "unittest"      |    + 6:03   ->  + 61%
wXP  opt "talos"         |    + 6:03   ->  + 45%
wXP  debug "unittest"    |    + 9:23   ->  + 30%
--------------------------------------------------------------------------


To *add* to that mess, nick saw overhead that looked like 30 min or so extra time *in between* jobs:

(from IRC) <nthomas> currently 2s between slave attach and job start on bm112, saw more than 30m on one of the others
bm111 and bm112 are coping well, so started graceful shutdown on bm109 too; bm110 in progress.
bm110 is done, just waiting on bm109 to finish 50 or so jobs that are still running.
nigelb reopened the trees at this point, we'd cleared all the non-try backlog.
Created attachment 8468208 [details]
Memory trend for bm109, from graphite

bm109 is done too, so we're finished bar the post mortem.

Here's a graph of its memory usage over time. Restarting after 1 month would be a good idea, definitely no more than 2 months.

Updated

3 years ago
Attachment #8468208 - Attachment description: Memory trend for bm190, from graphite → Memory trend for bm109, from graphite

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

3 years ago
See Also: → bug 1050594

Updated

3 years ago
See Also: → bug 1056348
You need to log in before you can comment on or make changes to this bug.