Closed
Bug 1049227
Opened 10 years ago
Closed 10 years ago
Trees closed due to Windows test backlog
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: KWierso, Unassigned)
References
Details
Attachments
(2 files)
I'll be closing the trunk trees due to a 3-5 hour backlog on starting up Windows test jobs.
Reporter | ||
Comment 1•10 years ago
|
||
At 4:30PM, Inbound/Fx-team closed, m-c/b2g-inbound set to approval required
Reporter | ||
Comment 2•10 years ago
|
||
Closed try at 2014-08-05T17:12:18
Updated•10 years ago
|
Severity: normal → blocker
Reporter | ||
Comment 3•10 years ago
|
||
Reopened try, but left a warning in the "reason" field warning of the windows test delays.
Comment 4•10 years ago
|
||
* we have 4 masters assigned to handle the windows test slaves - bm109 through to bm112, all in scl3. * bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work * bm109 through bm112 are all swapping a bit and have been running the buildbot process since May 6, see attached information * I'm going to try a rolling graceful restart of bm109 to bm111, starting with bm111. Doing this manually rather than via fabric
Comment 5•10 years ago
|
||
bm111 is done, on to bm110.
Comment 6•10 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #4) > Created attachment 8468169 [details] > Master memory use > > * we have 4 masters assigned to handle the windows test slaves - bm109 > through to bm112, all in scl3. > * bm112 wasn't enabled in slavealloc, so I've fixed that and it's taking work Bug 1005133 c#7 was where this looks like it was disabled. > * bm109 through bm112 are all swapping a bit and have been running the > buildbot process since May 6, see attached information > * I'm going to try a rolling graceful restart of bm109 to bm111, starting > with bm111. Doing this manually rather than via fabric Some interesting numbers to correlate to windows jobs taking longer overall, and thus the backlog: (data taken from https://secure.pub.build.mozilla.org/buildapi/reports/builders --- warning loading can kill buildapi if you're not very very careful) On July 16->July 18 Job Type: | Avg Runtime | Min Runtime | Max Runtime win7-ix opt "unittest" | 18:35 | 7:13 | 37:48 win7-ix opt "talos" | 22:58 | 10:28 | 1:02:10 win7-ix debug "unittest" | 32:35 | 8:35 | 1:25:59 win8 opt "unittest" | 18:51 | 6:28 | 1:01:34 win8 opt "talos" | 25:14 | 13:31 | 1:02:25 win8 debug "unittest" | 32:06 | 7:50 | 1:31:26 wXP opt "unittest" | 16:15 | 3:59 | 38:49 wXP opt "talos" | 22:15 | 10:57 | 1:03:08 wXP debug "unittest" | 30:52 | 7:31 | 1:15:12 -------------------------------------------------------------------------- https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest&starttime=1405483200&endtime=1405569600 For ~ the past day (Aug/04->Aug/05) Job Type: | Avg Runtime | Min Runtime | Max Runtime win7-ix opt "unittest" | 25:28 | 8:16 | 1:07:57 win7-ix opt "talos" | 27:41 | 10:41 | 1:18:26 win7-ix debug "unittest" | 42:56 | 10:26 | 1:59:19 win8 opt "unittest" | 24:51 | 8:54 | 46:27 win8 opt "talos" | 28:25 | 14:12 | 49:52 win8 debug "unittest" | 41:39 | 8:53 | 1:37:24 wXP opt "unittest" | 22:18 | 5:35 | 1:10:13 wXP opt "talos" | 28:18 | 11:48 | 55:55 wXP debug "unittest" | 40:15 | 8:15 | 1:42:35 -------------------------------------------------------------------------- https://secure.pub.build.mozilla.org/buildapi/reports/builders?detail_level=job_type&platform=win7,win7-ix,win764,win8,xp,xp-ix&build_type=debug,opt&job_type=build,repack,talos,unittest Now the differences Job Type: | Avg Runtime Difference win7-ix opt "unittest" | + 6:53 -> + 37% win7-ix opt "talos" | + 4:43 -> + 20% win7-ix debug "unittest" | +10:21 -> + 31% win8 opt "unittest" | + 6:00 -> + 53% win8 opt "talos" | + 3:11 -> + 12% win8 debug "unittest" | + 9:33 -> + 29% wXP opt "unittest" | + 6:03 -> + 61% wXP opt "talos" | + 6:03 -> + 45% wXP debug "unittest" | + 9:23 -> + 30% -------------------------------------------------------------------------- To *add* to that mess, nick saw overhead that looked like 30 min or so extra time *in between* jobs: (from IRC) <nthomas> currently 2s between slave attach and job start on bm112, saw more than 30m on one of the others
Comment 7•10 years ago
|
||
bm111 and bm112 are coping well, so started graceful shutdown on bm109 too; bm110 in progress.
Comment 8•10 years ago
|
||
bm110 is done, just waiting on bm109 to finish 50 or so jobs that are still running.
Comment 9•10 years ago
|
||
nigelb reopened the trees at this point, we'd cleared all the non-try backlog.
Comment 10•10 years ago
|
||
bm109 is done too, so we're finished bar the post mortem. Here's a graph of its memory usage over time. Restarting after 1 month would be a good idea, definitely no more than 2 months.
Updated•10 years ago
|
Attachment #8468208 -
Attachment description: Memory trend for bm190, from graphite → Memory trend for bm109, from graphite
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•