Closed Bug 1035304 Opened 10 years ago Closed 10 years ago

Give gaia-try try priority, rather than the default integration branch priority

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Assigned: emorley)

References

Details

(Whiteboard: [capacity])

Attachments

(1 file)

Right now, gaia-try gets job priority from http://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/master_common.py#36 at the same level as mozilla-inbound and b2g-inbound and fx-team. If we actually intended that, which I hope we didn't, it should be explicitly listed there, but I think instead it should be a 5 like the other try branches.
Unless we can prioritize pull requests separately from branch pushes, this is not really OK from the Gaia world.


If you're concerned about backlog this morning specifically, that was caused by a backup in triggering and represents very unusual load.
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #1)
> Unless we can prioritize pull requests separately from branch pushes, this
> is not really OK from the Gaia world.

I'm with Phil on this, a try tree should not take priority over gecko trunk and release branches. 

We should lower the priority now - and then if needs be split out the branch pushes to another gaia repo that can be prioritised accordingly.
I believe our current plan was to rely on gaia-try as the main indicator of gaia branch health (and not Travis). Without knowing what kind of delay we are looking at for branch pushes, it's hard to say what kind of impact this would have. Having things run at > try though seems like it could be painful for us.

I'm not too familiar with the priority, but are we talking about a usual delay here when pushing? If so, would this be in minutes, or hours?
Oh, I see, I was thrown off by the name, and gaia-replacement-for-travis doesn't actually have any relationship whatsoever to "try" as Mozilla knew it before this, does it?

Backlog of hours - the current priority 5 (try/twig) backlog for the slave pool you use is a bit over 8 hours, while priority 4 (integration branches and gaia-try) is 30 minutes.

Probably wontfix (or, better, fixed by giving it an explicit priority 4 so that other people don't think it was an accidentally over-prioritized try branch because it's misleadingly named as one). Were we to have a debate, I'd have no preference over which position I was assigned, the "it should have a newly created priority 4.5, lower than integration but higher than try" or "it should have a newly created 3.5 priority, higher than integration" or "it really should be exactly priority 4, slowing down the recovery from capacity emergencies because it cannot be usefully closed, but not entirely stopping the recovery."

Some things that you need to keep in mind while choosing your debating position:

* we will not "just have enough capacity to run the jobs we need to run," because we intentionally use coalescing, the way that during busy times integration branches will just run one set of tests for every up-to-four pushes rather than on every push, as a cost-saving measure

* we deal with capacity emergencies, when even with coalescing things have gotten too far behind, by closing integration trees, but that would only work for gaia-replacement-for-travis if there was a switch to both close the gaia repo and to close the ability to create PRs - otherwise closing gaia-try just puts off the time when the pushbot will ram through several hundred separate jobs all at once, and since those jobs don't have to wait for builds and don't coalesce, they can immediately saturate the pool and put us right back where we were

* making it 3.5 makes it possible to add just (roughly just, to the extent that anyone can agree on an appropriate SLA for gaia-replacement-for-travis) the amount of extra capacity it requires to not get too backlogged, while 4 or 4.5 requires adding more capacity because some unknowable amount will be soaked up by the way we'll wind up doing less coalescing, running more jobs, and spending more money on integration branches if we have more capacity that isn't used up first by things with higher priority
(In reply to Phil Ringnalda (:philor) from comment #4)
> Oh, I see, I was thrown off by the name, and gaia-replacement-for-travis
> doesn't actually have any relationship whatsoever to "try" as Mozilla knew
> it before this, does it?

I agree, the name is not ideal.  I think it was originally set up to emulate try server methods using releng infra, but we decided to use it instead of travis.  Our pull requests work sorta like try, but we can't (aiui) have the buildbot prioritization stuff deal with PR commits differently than Branch commits because there's nothing but the commit message and commit body to distinguish the two types

> Backlog of hours - the current priority 5 (try/twig) backlog for the slave
> pool you use is a bit over 8 hours, while priority 4 (integration branches
> and gaia-try) is 30 minutes.

If backlog == wait times, that's untenable.  These are only test jobs, no massive 8-hour builds.

> Probably wontfix (or, better, fixed by giving it an explicit priority 4 so
> that other people don't think it was an accidentally over-prioritized try
> branch because it's misleadingly named as one). Were we to have a debate,
> I'd have no preference over which position I was assigned, the "it should
> have a newly created priority 4.5, lower than integration but higher than
> try" or "it should have a newly created 3.5 priority, higher than
> integration" or "it really should be exactly priority 4, slowing down the
> recovery from capacity emergencies because it cannot be usefully closed, but
> not entirely stopping the recovery."

I would agree with WONTFIXing this bug or just adding an explicit priority that equals the current priority.  Even within pull requests, a pull request could be a 10 person project testing a patch or a spelling mistake.  In Gaia everything is either a branch push or a pull request.

> Some things that you need to keep in mind while choosing your debating
> position:
> 
> * we will not "just have enough capacity to run the jobs we need to run,"
> because we intentionally use coalescing, the way that during busy times
> integration branches will just run one set of tests for every up-to-four
> pushes rather than on every push, as a cost-saving measure

Coalescing in the current form is an invalid thing to do on this branch.  We could theoretically have a master branch commit followed by v1.3 commit followed by a random PR followed by.  Each commit stomps on the contents of the previous commit.  As long as we have a single branch to push to for all branch commits and all pull requests, we cannot coalesce anything.

See bug 1026246 comment 3 for background on why setting up, e.g., gaia-master, gaia-v2.0, is non-trivial.  If that were done, I'd be happy to push to respective branches.

I'm going to eventually have automatic job canceling for pull requests such that if I force push a commit to a pull request, it'll cancel outstanding jobs for previous pushes to the pull request.

> * we deal with capacity emergencies, when even with coalescing things have
> gotten too far behind, by closing integration trees, but that would only
> work for gaia-replacement-for-travis if there was a switch to both close the
> gaia repo and to close the ability to create PRs - otherwise closing
> gaia-try just puts off the time when the pushbot will ram through several
> hundred separate jobs all at once, and since those jobs don't have to wait
> for builds and don't coalesce, they can immediately saturate the pool and
> put us right back where we were

Again, if we had a way to discriminate between PRs and Pushes in the prioritization, we could do this more intelligently.

> * making it 3.5 makes it possible to add just (roughly just, to the extent
> that anyone can agree on an appropriate SLA for gaia-replacement-for-travis)
> the amount of extra capacity it requires to not get too backlogged, while 4
> or 4.5 requires adding more capacity because some unknowable amount will be
> soaked up by the way we'll wind up doing less coalescing, running more jobs,
> and spending more money on integration branches if we have more capacity
> that isn't used up first by things with higher priority

I'm not entirely sure what you mean by this, could you explain differently?
Simpler version: say you require that tests always start running within 30 minutes of a push. If you have higher priority than integration branches, you add slaves until at peak load your tests always start running within 30 minutes of a push. If you have the same priority as integration branches, you add slaves until at peak load your tests *and* all tests on integration branches that use the same pool always start running within 30 minutes of a push. Why shouldn't we just add more AWS instances until the wait for tests to start on every integration branch is no more than the time that you're willing to wait for gaia-replacement-for-travis? Because that means actually running more tests than we are right now, by removing opportunities to coalesce jobs.

And unless there is even more to being replacement-for-travis than I've guessed so far, I'm not sure I buy that branch pushes are sacred and must run right away while longer delays are okay for PRs. Sure, in the Travis era when tests ran differently on Travis than they did on releng slaves, and what mattered was whether they passed on Travis, but now there are two sets of tests run on absolutely identical slaves: the set on replacement-for-travis which runs against some random Gecko off some other tree that it isn't going to land directly on which doesn't matter whether or not it passes, and the real b2g-inbound run which runs on the actual Gecko which it has to pass against. I understand the point and priority of them on Travis, but I sort of think maybe the only point to them on replacement-for-travis might be to show that the reason later PR runs failed was because a branch push landed which failed, which then calls for investigating whether that means it was then backed out for failing on b2g-inbound, or just that it required a Gecko change which also landed there, making runs against some random build from a different branch meaningless until that Gecko change is merged around. If that's the case, then maybe the entire tree really is priority 4.5 or even 5, and 2 hours or 8 hours is reasonable.
Depends on: 1037005
Few things that stand out from this bug:
1) It's really really not helpful to have all jobs lumped into the same tree. Filed bug 1037005.
2) Try jobs (in the not-misued sense of the word) absolutely should have lower priority - the same as we do for gecko's try.

With bug 1037005, this bug becomes easy to fix.

In the meantime, I'll unfortunately just have to close the tree if we end up short of capacity due to gaia-try stealing it from the gecko jobs.
gaia-try is currently being given the same priority as the integration repos and so stealing capacity. The added complication is that it is being mis-used for both try job types and master/branch jobs types - however that will be fixed by bug 1037005. However for the meantime (and long term, given that "gaia-try" is [one would hope] the name that will be given to the try part of the split repos) this will save us giving try-type jobs too high a priority.
Attachment #8453853 - Flags: review?(catlee)
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Comment on attachment 8453853 [details] [diff] [review]
Give gaia-try try priority, rather than the default integration branch priority

No, please do not do this.

The name gaia-try is not accurate.  Please thing of this as all gaia testing.

Would we set mozilla-central to priority 5?  This patch sets gaia's mozilla-central equivalent to priority 5.
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #9)
> The name gaia-try is not accurate.  Please thing of this as all gaia testing.

Yeah I've read the comments in this bug and that's why I filed bug 1037005 - so I understand this - it's just a really bad decision given what else it impacts.
(In reply to Ed Morley (Away 12th-20th July) from comment #10)
> Yeah I've read the comments in this bug and that's why I filed bug 1037005 -
> so I understand this - it's just a really bad decision given what else it
> impacts.

To be clearer - I realise using a PR based approach limits a few things - but we should have either used a gaia clone on github or else faked the repo name in buildbotcustom.
Attachment #8453853 - Flags: review?(catlee) → review+
The problem here is not on the Gaia side.  We don't need any more or fewer clones of Gaia on github.  What we need is automation branches in Releng infra.  The problem is that the special buildbot scheduler only works with a single, hardcoded branch.  Creating the second branch is a huge cost, but creating branches after that should be low cost.

The git stuff is handled all in mozharness, so what we need is repos like:

http://hg.mozilla.org/integration/gaia-pullrequests
http://hg.mozilla.org/integration/gaia-master
http://hg.mozilla.org/integration/gaia-v2.0

The *contents* of those repositories always points to github.com:mozilla-b2g/gaia.git.
So we moved a bunch of tests that used to run on Travis over to our infrastructure and we're now experiencing higher load because of it. Seems that would be the expected outcome here?

Outside of OSX Gu (which is pretty crappy given our current 10.8 situation as-is), why can't we make use of AWS' scalability and just add the capacity we need to properly take these extra jobs on?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #13)
> Backlog of hours - the current priority 5 (try/twig) backlog for the slave
> pool you use is a bit over 8 hours, while priority 4 (integration branches
> and gaia-try) is 30 minutes.

So I'm trying to follow-along with this thread, and if I'm reading this right - could it potentially take 8 hours instead of 30 minutes to even *start* the tests? Currently our longest test runs for 1.5 hours (perhaps we should split that up), which would make it 9.5 hours to complete a pull request? I don't think this is acceptable for gaia developers. Is this what gecko developers are used to? Our travis runs on forked branches which are free, generally complete in < 1 hour, and start instantly. 


(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #13)
> Outside of OSX Gu (which is pretty crappy given our current 10.8 situation
> as-is), why can't we make use of AWS' scalability and just add the capacity
> we need to properly take these extra jobs on?

+1. It does seem like we're getting rid of our paid plan with travis, so if it's an issue with money, surely spending that money on AWS could get us the same capacity?

Sorry my noobness, but could someone point me to what the bottleneck is in the 30min vs 8hr figure?
It seems like we're discussing multiple issues here:

1) The name "gaia-try" is confusing (given how we are currently using it), and it should really be "gaia-ci" or something. However the churn of this probably isn't worth it given #2.

2) We should split up the conflicting purposes of gaia-try (filed as bug 1037005; and it sounds like jhford has a plan for this in comment 12) in order to:
  * facilitate later sheriffing of just the master/branch pieces (hard if they are muddled with the rest).
  * allow for configurable priority based on master vs branch vs try.
  * allow for saving of resources/speeding up jobs, by not clobbering each time & also allowing coalescing, on !try.

3) Currently "gaia-try" doesn't have a priority entry in master_common.py (see patch), so it takes the default priority (4), which is non-obvious. Even if we _don't_ want to change the priority right now, we should explicitly set it in the file to 4, so it's clear what is being used.

4) Regardless of our current wait times [1] IMO gaia "try" type jobs should not have equal priority to those on master/branches (the same as for gecko's try) . It's much harder to sheriff those trees if master gets backed up and by the time a failure is seen 20 more things have landed on top. Yes developers shouldn't be made to wait too long, but that's orthogonal, and #5.

5) We need to raise the AWS machine limits and/or reduce other bottlenecks so that we have the capacity to run both master/branch jobs and also try jobs within whatever timeframe is deemed appropriate. Given that we've saved money via not using Travis, we should surely be able to justify raising the AWS budget, right?


[1] time taken for a job to actually start, once it has been scheduled, due to lack of resources

(In reply to Kevin Grandon :kgrandon from comment #14)
> So I'm trying to follow-along with this thread, and if I'm reading this
> right - could it potentially take 8 hours instead of 30 minutes to even
> *start* the tests? Currently our longest test runs for 1.5 hours (perhaps we
> should split that up), which would make it 9.5 hours to complete a pull
> request? I don't think this is acceptable for gaia developers. Is this what
> gecko developers are used to? Our travis runs on forked branches which are
> free, generally complete in < 1 hour, and start instantly. 

In busy times, tests run on try type repos may not start for roughly that amount of time. Yes not ideal, and yes this is what gecko devs are used to. The way we handle constrained resources is by having try being a lower priority than the main repos, so we don't have to close all trees as often.

> +1. It does seem like we're getting rid of our paid plan with travis, so if
> it's an issue with money, surely spending that money on AWS could get us the
> same capacity?
> 
> Sorry my noobness, but could someone point me to what the bottleneck is in
> the 30min vs 8hr figure?

I'm not sure if this is just a case of "raise the hardcoded cap on the number of concurrent AWS instances", or whether there is more to it. #releng will know.
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #12)
> The git stuff is handled all in mozharness, so what we need is repos like:
> 
> http://hg.mozilla.org/integration/gaia-pullrequests
> http://hg.mozilla.org/integration/gaia-master
> http://hg.mozilla.org/integration/gaia-v2.0

To be extra clear - once we have the repos above thanks to bug 1037005, I would envisage something like:
(lower number means the job is closer to the front of the queue):

    'mozilla-central': 3,
    'mozilla-aurora': 3,
    'gaia-v2.0': 3,
    'mozilla-inbound': 4,
    'b2g-inbound': 4,
    'fx-team': 4,
    'gaia-master': 4,
    'try': 5,
    'gaia-pullrequests': 5,
The vast majority of jobs running on gaia-try _are_ try jobs and post bug 1037005 those that are not, will be on their own repo, which will have their own repos per comment 16, so landing this (if anything just to see how it affects the numbers; looking at recent pending graphs for try shows the wait times for the last 24 hours aren't as bad for EC2 as they have been in the past):
remote:   https://hg.mozilla.org/build/buildbot-configs/rev/5cb6b2f7e268
(In reply to Ed Morley [:edmorley] from comment #17)
> The vast majority of jobs running on gaia-try _are_ try jobs and post bug
> 1037005 those that are not, will be on their own repo, which will have their
> own repos per comment 16, so landing this (if anything just to see how it
> affects the numbers; looking at recent pending graphs for try shows the wait
> times for the last 24 hours aren't as bad for EC2 as they have been in the
> past):
> remote:   https://hg.mozilla.org/build/buildbot-configs/rev/5cb6b2f7e268

Please revert this change.  Until non-prs are split out, we need to ensure they are tested in a timely fashion.
(In reply to Ed Morley [:edmorley] from comment #16)
\To be extra clear - once we have the repos above thanks to bug 1037005, I
> would envisage something like:
> (lower number means the job is closer to the front of the queue):
> 
>     'mozilla-central': 3,
>     'mozilla-aurora': 3,
>     'gaia-v2.0': 3,
>     'mozilla-inbound': 4,
>     'b2g-inbound': 4,
>     'fx-team': 4,
>     'gaia-master': 4,

I disagree.  Gaia-master is our Mozilla-Central and should be prioritized at least as high as Mozilla-Central

>     'try': 5,
>     'gaia-pullrequests': 5,

I still disagree with putting PRs at the absolute lowest priority given that testing a PR is *way* less machine time than even a single platform for Gecko.
I think this bug needs a meeting and consensus between the teams.  Jordan's going to back out the patch until that happens.
Comment on attachment 8453853 [details] [diff] [review]
Give gaia-try try priority, rather than the default integration branch priority

backed out: https://hg.mozilla.org/build/buildbot-configs/rev/aef4bf97d6b4
Attachment #8453853 - Flags: checked-in-
I understand that gaia-try is poorly named and the way we are using it does not fit the established pattern of the other trees. However, it is also the life blood of the Gaia development process right now. Until we have something like bug 1037005 in place, we need to maintain the priority of this tree as we will be unable to deliver on FxOS if we see significant increases in wait times for tests. 

Developers are calling loudly for this move to TBPL; reducing our priority could render fruitless a lot of the work we've been doing to make this change. Can we look into adding capacity instead?

If the sticking point is the OSX tests, we are open to reducing the priority/SLA of those tests in some way or perhaps turning them off if required as those are less valuable to the day-to-day cycle for gaia devs.
Ed, are you and the sheriffs OK with the status quo for now?  At some point in the not-too-distant future (Taras is aiming for 3 weeks), they'd like to move gaia-try from buildbot to TaskCluster, at which point it won't affect slave load in buildbot at all.

If you're not ok with this, I'll set up a meeting so we can try to arrive at a consensus.
Flags: needinfo?(emorley)
(In reply to Jonathan Griffin (:jgriffin) from comment #23)
> Ed, are you and the sheriffs OK with the status quo for now?  At some point
> in the not-too-distant future (Taras is aiming for 3 weeks), they'd like to
> move gaia-try from buildbot to TaskCluster, at which point it won't affect
> slave load in buildbot at all.

That sounds great - I'm happy to wait until then - thank you :-)

(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #19)
> >     'try': 5,
> >     'gaia-pullrequests': 5,
> 
> I still disagree with putting PRs at the absolute lowest priority given that
> testing a PR is *way* less machine time than even a single platform for
> Gecko.

The amount of machine time shouldn't come into it - B2G developers can't just walk to the front of the queue ahead of gecko devs, that's just not reasonable. In the example above (which is in a post bug 1037005 world), with both priorities set at 5 - gaia-pullrequests would have equal priority with try, not lower (which is how I read your 'absolute lowest priority', though not sure if you meant equal to try).
Flags: needinfo?(emorley)
(In reply to Phil Ringnalda (:philor) from comment #0)
> If we actually intended that, which I hope we didn't, it should be
> explicitly listed there

I've filed bug 1043415 for explicitly listing (but not changing) gaia-try's priority, to make it clearer when reading the priority list (since it currently hits the default-case, so is non-obvious.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: