Closed Bug 772458 Opened 8 years ago Closed 8 years ago

Try is extremely backed up

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: khuey, Unassigned)

References

Details

(Keywords: sheriffing-P1, Whiteboard: [tryserver][capacity])

Wait times for tests are closing in on 24 hrs.

Seems that Android and Windows are more backed up than the other platforms, but that's just anecdotal.
Severity: normal → major
Do we have any idea of what is happening? Do people push more to try those days than before? Do we simply need more slaves?

It seems to be a critical issue for engineering: pushing to try takes a so ridiculous amount of time that it will whether reduce productivity or people will just push to m-i without waiting for full results.
(In reply to Mounir Lamouri (:mounir) from comment #1)
> or people
> will just push to m-i without waiting for full results.

Which has already happened several times this week; with the ensuing layers of bustage made worse by high infra load on non-try trees and the coalescing that brings :-(
(In reply to Mounir Lamouri (:mounir) from comment #1)
> Do we have any idea of what is happening? Do people push more to try those
> days than before? Do we simply need more slaves?
> 
> It seems to be a critical issue for engineering: pushing to try takes a so
> ridiculous amount of time that it will whether reduce productivity or people
> will just push to m-i without waiting for full results.

One of the issues with test pool capacity is that *all* current tests run on Mac minis of various vintages. This is due to a historical notion that we wanted to be able to compare test results between different platforms/OSes on the same hardware. Apple's hardware rev cycle is aggressive, so we simply can't buy any more of these older rev minis any more. The existing pool capacity is static, modulo attrition via hardware failure. We can create some extra capacity on one platform only at the expense of another, e.g. stopping tests on 10.5 (bug 773120).

We no longer think these inter-platform comparisons are meaningful. Releng is extremely resource-constrained at present for setting up new hardware. We have an effort underway to refresh our test pool to newer non-Mac hardware (for non-Mac OSes), but this is blocked by getting test coverage setup for Win8 (bug 731280) and Mountain Lion (10.8) (bug 731278), platforms where we currently have no coverage at all.
Depends on: 764713, 775149, 773120, 750285
Depends on: 773331, 602949
Depends on: 725362
Depends on: 771508
Depends on: 775744
We've just had 38 consecutive pushes (81 changesets) of bustage on inbound, since Try results are taking so long to come back, that people are pushing regardless. 

Are there any other quick wins that we can do here? eg: disabling platforms/tests on twigs that could do without them; or re-balancing the try vs non-try buildpool?

Looking at http://build.mozilla.org/builds/pending/pending.html shows that the Try linux compile pending count is always an order of magnitude higher than the others. Can we spare some more non-try linux builds for Try? The graph for non-try would imply there is capacity going unused that could be switched over perhaps?
Depends on: 774799
No longer depends on: 725362
Component: Release Engineering → Release Engineering: Developer Tools
QA Contact: lsblakk
Whiteboard: [tryserver][buildduty][capacity]
bug#750285, bug#777037 track disabling a bunch of android builds/unittest/talos jobs which will help reduce android load. This is an interim solution while we wait for additional tegras to come online.
Depends on: 774424, 777037
Depends on: 777273
If this bug and its dependents were resolved fix, what would the expected turn around time for try be?

try used to a tremendously useful development tool. Now its pretty much just a pain.
Depends on: 767456
Whiteboard: [tryserver][buildduty][capacity] → [tryserver][buildduty][capacity][sheriff-want]
No longer depends on: 777521
No longer depends on: 765830
We can turn off tests for the UX branch (https://tbpl.mozilla.org/?tree=UX). I've been maintaining the branch for the past N months (doing daily merges between m-c and ux).

Devs sending their patch to UX branch can just run it through try server first and in total that will save some build resources since we have vastly more merges between m-c to ux then we do have checkins to ux.

It should also be noted that ux is a dead-end branch, which doesn't feed anywhere but is used for functional testing of new ux features.
Depends on: 779419
Depends on: 737661
Depends on: 779784
No longer depends on: 779784
No longer depends on: 602949
Depends on: 779921
QA Contact: lsblakk → hwine
Depends on: 782627
Now that we're running android/b2g/nativefennec builds over on AWS, we're freeing up cycles on our linux32/linux64 machines. 

bug#784891 tracks converting a bunch of existing linux32/linux64/win32 physical ix builders into win64 builders. This will improve wait times for windows builds in both the production build pool and try build pool.
Depends on: 784891
(In reply to Jared Wein [:jaws] from comment #7)
> We can turn off tests for the UX branch (https://tbpl.mozilla.org/?tree=UX).

Done in bug 779419.
This isn't an acute issue, and thus not a buildduty concern.
Whiteboard: [tryserver][buildduty][capacity][sheriff-want] → [tryserver][capacity][sheriff-want]
Keywords: sheriffing-P1
Whiteboard: [tryserver][capacity][sheriff-want] → [tryserver][capacity]
Depends on: toodamnhigh!
Depends on: 691177
Depends on: 847868
How are try turn around times these days?
:khuey: per dev.tree-management, we've been hitting consistently great wait times on builds and tests, across the board including Try. This is thanks to moving more jobs to AWS, reshuffling existing hardware inhouse, and turning off broken builds/tests. 

Any objections to closing this as FIXED?
Flags: needinfo?(khuey)
Yeah I think we're doing pretty well these days.  Someone else can file a new bug if they have current issues.

Good job folks.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(khuey)
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.