Note: There are a few cases of duplicates in user autocompletion which are being worked on.
Last Comment Bug 1049227 - Trees closed due to Windows test backlog
: Trees closed due to Windows test backlog
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Buildduty (show other bugs)
: unspecified
: x86_64 Windows 8.1
-- blocker (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
: Justin Wood (:Callek)
: Chris AtLee [:catlee]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-05 16:29 PDT by Wes Kocher (:KWierso)
Modified: 2014-08-20 14:49 PDT (History)
5 users (show)
See Also:
Crash Signature:
(edit)
Machine State: ---
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Master memory use (1.44 KB, text/plain)
2014-08-05 19:14 PDT, Nick Thomas [:nthomas]
no flags Details
Memory trend for bm109, from graphite (283.30 KB, image/png)
2014-08-05 21:17 PDT, Nick Thomas [:nthomas]
no flags Details

Description User image Wes Kocher (:KWierso) 2014-08-05 16:29:59 PDT
I'll be closing the trunk trees due to a 3-5 hour backlog on starting up Windows test jobs.
Comment 1 User image Wes Kocher (:KWierso) 2014-08-05 16:32:47 PDT
At 4:30PM, Inbound/Fx-team closed, m-c/b2g-inbound set to approval required
Comment 2 User image Wes Kocher (:KWierso) 2014-08-05 17:12:50 PDT
Closed try at 2014-08-05T17:12:18
Comment 3 User image Wes Kocher (:KWierso) 2014-08-05 18:06:23 PDT
Reopened try, but left a warning in the "reason" field warning of the windows test delays.
Comment 4 User image Nick Thomas [:nthomas] 2014-08-05 19:14:20 PDT
Created attachment 8468169 [details]
Master memory use

* we have 4 masters assigned to handle the windows test slaves - bm109 through to bm112, all in scl3.
* bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
* bm109 through bm112 are all swapping a bit and have been running the buildbot process since May 6, see attached information
* I'm going to try a rolling graceful restart of bm109 to bm111, starting with bm111. Doing this manually rather than via fabric
Comment 5 User image Nick Thomas [:nthomas] 2014-08-05 20:02:13 PDT
bm111 is done, on to bm110.
Comment 6 User image Justin Wood (:Callek) 2014-08-05 20:11:43 PDT
(In reply to Nick Thomas [:nthomas] from comment #4)
> Created attachment 8468169 [details]
> Master memory use
> 
> * we have 4 masters assigned to handle the windows test slaves - bm109
> through to bm112, all in scl3.
> * bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work
Bug 1005133 c#7 was where this looks like it was disabled.

> * bm109 through bm112 are all swapping a bit and have been running the
> buildbot process since May 6, see attached information
> * I'm going to try a rolling graceful restart of bm109 to bm111, starting
> with bm111. Doing this manually rather than via fabric


Some interesting numbers to correlate to windows jobs taking longer overall, and thus the backlog:

(data taken from https://secure.pub.build.mozilla.org/buildapi/reports/builders --- warning loading can kill buildapi if you're not very very careful)

On July 16->July 18

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    18:35        |   7:13          |     37:48
win7-ix opt "talos"      |    22:58        |  10:28          |   1:02:10
win7-ix debug "unittest" |    32:35        |   8:35          |   1:25:59
win8 opt "unittest"      |    18:51        |   6:28          |   1:01:34
win8 opt "talos"         |    25:14        |  13:31          |   1:02:25
win8 debug "unittest"    |    32:06        |   7:50          |   1:31:26
wXP  opt "unittest"      |    16:15        |   3:59          |     38:49
wXP  opt "talos"         |    22:15        |  10:57          |   1:03:08
wXP  debug "unittest"    |    30:52        |   7:31          |   1:15:12
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest&starttime=1405483200&endtime=1405569600


For ~ the past day (Aug/04->Aug/05)

Job Type:                |   Avg Runtime   | Min  Runtime    | Max Runtime
win7-ix opt "unittest"   |    25:28        |   8:16          |   1:07:57
win7-ix opt "talos"      |    27:41        |  10:41          |   1:18:26
win7-ix debug "unittest" |    42:56        |  10:26          |   1:59:19
win8 opt "unittest"      |    24:51        |   8:54          |     46:27
win8 opt "talos"         |    28:25        |  14:12          |     49:52
win8 debug "unittest"    |    41:39        |   8:53          |   1:37:24
wXP  opt "unittest"      |    22:18        |   5:35          |   1:10:13
wXP  opt "talos"         |    28:18        |  11:48          |     55:55
wXP  debug "unittest"    |    40:15        |   8:15          |   1:42:35
--------------------------------------------------------------------------
https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest


Now the differences
Job Type:                |   Avg Runtime Difference
win7-ix opt "unittest"   |    + 6:53   ->  + 37%
win7-ix opt "talos"      |    + 4:43   ->  + 20%
win7-ix debug "unittest" |    +10:21   ->  + 31%
win8 opt "unittest"      |    + 6:00   ->  + 53%
win8 opt "talos"         |    + 3:11   ->  + 12%
win8 debug "unittest"    |    + 9:33   ->  + 29%
wXP  opt "unittest"      |    + 6:03   ->  + 61%
wXP  opt "talos"         |    + 6:03   ->  + 45%
wXP  debug "unittest"    |    + 9:23   ->  + 30%
--------------------------------------------------------------------------


To *add* to that mess, nick saw overhead that looked like 30 min or so extra time *in between* jobs:

(from IRC) <nthomas> currently 2s between slave attach and job start on bm112, saw more than 30m on one of the others
Comment 7 User image Nick Thomas [:nthomas] 2014-08-05 20:26:49 PDT
bm111 and bm112 are coping well, so started graceful shutdown on bm109 too; bm110 in progress.
Comment 8 User image Nick Thomas [:nthomas] 2014-08-05 20:53:57 PDT
bm110 is done, just waiting on bm109 to finish 50 or so jobs that are still running.
Comment 9 User image Nick Thomas [:nthomas] 2014-08-05 20:56:26 PDT
nigelb reopened the trees at this point, we'd cleared all the non-try backlog.
Comment 10 User image Nick Thomas [:nthomas] 2014-08-05 21:17:22 PDT
Created attachment 8468208 [details]
Memory trend for bm109, from graphite

bm109 is done too, so we're finished bar the post mortem.

Here's a graph of its memory usage over time. Restarting after 1 month would be a good idea, definitely no more than 2 months.

Note You need to log in before you can comment on or make changes to this bug.