Windows builders can't keep up with mozilla-inbound load

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
7 months ago

People

(Reporter: justin.lebar+bug, Unassigned)

Tracking

Details

mozilla-inbound has been closed a lot lately because Windows builders can't keep up with load.

Any developer who caused inbound to be closed a few times a week for a few hours would have sheriffs banging on his door, so I hope release engineering will consider this a high-priority issue.

I know someone has a calendar with the reasons for tree closures; that would be a nice datapoint to have here.

RyanVM says that the tree closures have been happening for different reasons ("usually l10n," but not always), so let's use this as metabug tracking progress on the larger problem.
jlebar: can you give specific time range when this last happened? That'd help us figure out what might be going on.

fyi: There's work already in progress for moving the win32 l0n repack jobs from win32 to win64, but I'd not heard that being a tree-closing event... we do those nightly repacks every night!
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: armenzg
Aside from the tree being closed now, I don't recall the exact date of the last time the tree was closed because of Windows load.  I think it was twice last week, but that's just a guess.  We need to find this calendar of tree closures I saw.

I'm actually surprised you haven't seen the calendar, because it contains a lot of information about how and when our infrastructure is failing to keep up with load!

Comment 3

5 years ago
(In reply to John O'Duinn [:joduinn] from comment #1)
> 
> fyi: There's work already in progress for moving the win32 l0n repack jobs
> from win32 to win64, but I'd not heard that being a tree-closing event... we
> do those nightly repacks every night!

Moving L10n repacks from win32 to win64 actually worsens things (since developers build are being done in the win64 pool).
In fact, such works has already been done.

On the other hand, we have a bunch of iX Linux machines being reimaged as Win64 build machines. That would improve our situation.
Depends on: 894398
Yes, although I saw someone visualize this in a calendar-like thing.

Note that "bustage" doesn't necessarily mean infra is off the hook -- bustage often closes the tree because coalescing and/or the test backlog means that we're not getting the tests we need in time.
The full logs for closures can be fetched from https://treestatus.mozilla.org/mozilla-inbound/logs if you want to do some number crunching.
(In reply to Ted Mielczarek [:ted.mielczarek] (post-vacation backlog) from comment #4)
> Presumably this is the record you want?
> https://treestatus.mozilla.org/mozilla-inbound

As far as I can tell, this shows "only" 3 treeclosures caused by windows builder backlog, specifically:

...
2013-07-29 09:45:52 open 		
2013-07-29 08:03:59 closed Waiting for Windows builders to catch up 	backlog 
...
2013-07-22 15:25:34 open 
2013-07-22 15:25:31 closed jit-test orange 	
2013-07-22 12:47:23 closed Windows build backlog 	backlog 
...
2013-07-19 07:07:59 open 		
2013-07-19 07:05:26 closed Waiting for Windows builders to catch up 	backlog 
...


Obviously, three closures over two weeks period is worse than zero closures, which is what we'd all want here. But to be fair, I note that in the same period, there were a total of 59 closures of mozilla-inbound. Of those 59 closures, 3 closures were because of windows builders machine backlog, and another 1 closure was for unspecified machine backlog. 

From looking at our wait-times reports in dev.tree-management, here are the wait times across the day I see on those dates (note: 1st col == wait time; 2nd col == number of jobs processed; 3rd col == % of that day's jobs... so as example, on 29jul 237 builds (93% of the day's builds for that OS) were started in 0-14minutes):

29jul: 
======
win2k3: 253
  0:      237    93.68%
 15:        8     3.16%
 30:        2     0.79%
 45:        0     0.00%
 60:        2     0.79%
 75:        0     0.00%
 90+:        4     1.58%
win64: 7
  0:        7   100.00%


22jul:
======
win2k3: 176
  0:      141    80.11%
 15:        9     5.11%
 30:        9     5.11%
 45:        8     4.55%
 60:        0     0.00%
 75:        3     1.70%
 90+:        6     3.41%
win64: 10
  0:        6    60.00%
 15:        1    10.00%
 30:        1    10.00%
 45:        1    10.00%
 60:        0     0.00%
 75:        0     0.00%
 90+:        1    10.00%


19jul:
======
win2k3: 210
  0:      207    98.57%
 15:        3     1.43%
win64: 9
  0:        9   100.00%



Hope all that data helps; I think its important to help put concrete data on something... really helps make sure the fix is actually for the right thing.



We're reimaging some linux machines to be additional windows builders, see bug#894398 for details. Lets keep this open for now, and see if this specific windows-builder-backlog-causing-inbound-closures bug is resolved when bug#894398 is closed.
> From looking at our wait-times reports in dev.tree-management, here are the wait times across the 
> day I see on those dates

It's clear that our machines didn't keep up with load during those days, if we had to close the tree.  So if these numbers look good to you, we're either collecting the wrong data or interpreting it incorrectly.  Or otherwise the machines were keeping up with load and we didn't close the tree for the stated reason.

But more importantly, the numbers here neglect the effects of coalescing, which is a main concern.

Coalescing is a major cause of tree closures.  I don't have a good idea as to whether Windows coalesces more than other trees; someone else might be able to speak to that, or maybe we have some data.

IMO our CI capacity is inadequate until we can disable coalescing on all trees.  Until that point, the fact that the tree is being closed so often is probably your best indicator that we have insufficient capacity.  Because coalescing causes us to run fewer jobs when we need them most, it's going to make the load numbers above look much better than they actually are.

I thought Windows would be a good place to start in this respect -- I see six tree closures related to "windows".  (There are, in fact, more closures related to Mac, but none of them call out backlog explicitly.)  But you're the expert here; maybe you have in mind other lower-hanging fruit to fix the fact that our infrastructure is dramatically oversubscribed.  I've already tried other things, like suggesting tests that could be turned off so long as we can continue to be able to run them on try, and it's been explained that this is too hard.  So I wanted to try another tack.

Maybe we should close this bug, track bug 894398, and file a new bug that tracks being able to disable coalescing.  It sounds like that's what I really care about.  :)
Product: mozilla.org → Release Engineering

Comment 9

5 years ago
Win64 build wait times have been good after adding that many more hosts back in August.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

7 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.