Closed Bug 1049227 Opened 10 years ago Closed 10 years ago

Trees closed due to Windows test backlog

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 8.1
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: KWierso, Unassigned)

References

Details

Attachments

(2 files)

I'll be closing the trunk trees due to a 3-5 hour backlog on starting up Windows test jobs.
At 4:30PM, Inbound/Fx-team closed, m-c/b2g-inbound set to approval required
Closed try at 2014-08-05T17:12:18
Severity: normal → blocker
Reopened try, but left a warning in the "reason" field warning of the windows test delays.
Attached file Master memory use
* we have 4 masters assigned to handle the windows test slaves - bm109 through to bm112, all in scl3.
* bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
* bm109 through bm112 are all swapping a bit and have been running the buildbot process since May 6, see attached information
* I'm going to try a rolling graceful restart of bm109 to bm111, starting with bm111. Doing this manually rather than via fabric
bm111 is done, on to bm110.
(In reply to Nick Thomas [:nthomas] from comment #4)
> Created attachment 8468169 [details]
> Master memory use
> 
> * we have 4 masters assigned to handle the windows test slaves - bm109
> through to bm112, all in scl3.
> * bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
Bug 1005133 c#7 was where this looks like it was disabled.

> * bm109 through bm112 are all swapping a bit and have been running the
> buildbot process since May 6, see attached information
> * I'm going to try a rolling graceful restart of bm109 to bm111, starting
> with bm111. Doing this manually rather than via fabric


Some interesting numbers to correlate to windows jobs taking longer overall, and thus the backlog:

(data taken from https://secure.pub.build.mozilla.org/buildapi/reports/builders --- warning loading can kill buildapi if you're not very very careful)

On July 16->July 18

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    18:35        |   7:13          |     37:48
win7-ix opt "talos"      |    22:58        |  10:28          |   1:02:10
win7-ix debug "unittest" |    32:35        |   8:35          |   1:25:59
win8 opt "unittest"      |    18:51        |   6:28          |   1:01:34
win8 opt "talos"         |    25:14        |  13:31          |   1:02:25
win8 debug "unittest"    |    32:06        |   7:50          |   1:31:26
wXP  opt "unittest"      |    16:15        |   3:59          |     38:49
wXP  opt "talos"         |    22:15        |  10:57          |   1:03:08
wXP  debug "unittest"    |    30:52        |   7:31          |   1:15:12
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest&starttime=1405483200&endtime=1405569600


For ~ the past day (Aug/04->Aug/05)

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    25:28        |   8:16          |   1:07:57
win7-ix opt "talos"      |    27:41        |  10:41          |   1:18:26
win7-ix debug "unittest" |    42:56        |  10:26          |   1:59:19
win8 opt "unittest"      |    24:51        |   8:54          |     46:27
win8 opt "talos"         |    28:25        |  14:12          |     49:52
win8 debug "unittest"    |    41:39        |   8:53          |   1:37:24
wXP  opt "unittest"      |    22:18        |   5:35          |   1:10:13
wXP  opt "talos"         |    28:18        |  11:48          |     55:55
wXP  debug "unittest"    |    40:15        |   8:15          |   1:42:35
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest


Now the differences
Job Type:                |   Avg Runtime Difference
win7-ix opt "unittest"   |    + 6:53   ->  + 37%
win7-ix opt "talos"      |    + 4:43   ->  + 20%
win7-ix debug "unittest" |    +10:21   ->  + 31%
win8 opt "unittest"      |    + 6:00   ->  + 53%
win8 opt "talos"         |    + 3:11   ->  + 12%
win8 debug "unittest"    |    + 9:33   ->  + 29%
wXP  opt "unittest"      |    + 6:03   ->  + 61%
wXP  opt "talos"         |    + 6:03   ->  + 45%
wXP  debug "unittest"    |    + 9:23   ->  + 30%
--------------------------------------------------------------------------


To *add* to that mess, nick saw overhead that looked like 30 min or so extra time *in between* jobs:

(from IRC) <nthomas> currently 2s between slave attach and job start on bm112, saw more than 30m on one of the others
bm111 and bm112 are coping well, so started graceful shutdown on bm109 too; bm110 in progress.
bm110 is done, just waiting on bm109 to finish 50 or so jobs that are still running.
nigelb reopened the trees at this point, we'd cleared all the non-try backlog.
bm109 is done too, so we're finished bar the post mortem.

Here's a graph of its memory usage over time. Restarting after 1 month would be a good idea, definitely no more than 2 months.
Attachment #8468208 - Attachment description: Memory trend for bm190, from graphite → Memory trend for bm109, from graphite
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
See Also: → 1050594
See Also: → 1056348
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: