Closed Bug 787447 Opened 12 years ago Closed 11 years ago

Stop coalescing Talos test runs on mozilla-inbound

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: ehsan.akhgari, Assigned: catlee)

Details

Please see <https://groups.google.com/d/msg/mozilla.dev.platform/gX7Wpqj6Ldw/9viYgEc_w9AJ> and the recent discussion on dev.platform.  This is really hurting our ability to deal with the regressions on Talos.
I worry about our ability to ever catch up to load if we do this. Can we instead say "don't coalesce more than N test runs"? And then we can tweak N depending on our need.
I don't know why Talos gets special treatment here, tbh.  If we can have N for Talos, can we have N' for mochitests?
(In reply to comment #2)
> I don't know why Talos gets special treatment here, tbh.  If we can have N for
> Talos, can we have N' for mochitests?

My idea was to start small, but I would definitely like to see this happen for all tests if RelEng agrees.
Sure, we can set this to anything at the cost of increased wait times.
(In reply to comment #4)
> Sure, we can set this to anything at the cost of increased wait times.

It would probably be helpful if we decide on a very conservative starting value for N, and then try to see how much we can decrease it without seriously hurting the wait times.
One difference from Mochitests is that latency matters less, because patches are rarely if ever backed out the same day for Talos regressions.
Talos burns. It burns like Android talos burned this morning, from making it impossible for the talos code to close the browser, it burns from the way it considers it a failure when you introduce an uncaught JS exception that winds up in the log with "Error" in it. So if rather than getting one or two talos runs on every other push, we are getting every single one on every other push twenty pushes back, it'll matter.
In terms of implementation, I'm not sure what the best way to implement "coalesce up to N" changes would be. The coalescing in buildbot happens here:
http://hg.mozilla.org/build/buildbot/file/e632e2e38fec/master/buildbot/process/builder.py#l530

We use the shouldMergeRequests function already to completely disable merging on some branches. It doesn't look like there's a way to add any kind of counting of merged requests in there.

One idea for fixing this is to have our shouldMergeRequests function try and implement counting on its own. Currently it's always called with the 1st build request as the 1st parameter, and the other requests to be merged passed in as the 2nd build request parameter with repeated calls. I'm pretty sure this function is only called from the main thread, and so this function will see a series of calls with the same 1st build request and varying 2nd build requests. A small cache of "breq" => # merged requests should allow the merge function to return False after the appropriate number of calls.
Assignee: nobody → catlee
Priority: -- → P3
So, instead of limiting the amount of coalescing that happens, how about we instead make it possible to actually run jobs that have been coalesced if we need to?

self-serve could be made to expose the coalescing information, and then tbpl could give you the ability to re-trigger build requests that were merged into others. Together with bug 690672, this should make it easier to track down bustage.
There are two problems with that solution: the first is that the reported problem does not exist, and the second is that the proposed solution already exists.

According to comment 0's link, the problem is that "It's crazy to waste developer time bisecting Talos locally when we don't run it on every push." Agreed, but... that's doing it the crazy-hard way. It took me quite a bit of looking to find a regression email blaming more than one push (other than the stupid ones about merges, which are just totally useless noise) because they really aren't all that frequent, but once I did find https://groups.google.com/d/topic/mozilla.dev.tree-management/GyEQkL20coE/discussion none of my problems involved a need to run it locally.

Since there's nothing to tell you that tbpl does fromchange/tochange ranges, it's not obvious that you can just take the regression email's http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7a86d43be76f&tochange=77c30c1388ee link and turn it into https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=7a86d43be76f&tochange=77c30c1388ee but once you do, you have self-serve links for any push in the range that didn't get a run of the talos job in question.

It's not obvious what platform or what job "XP Mozilla-Inbound-Non-PGO" and "Dromaeo (DOM)" would be, but despite the first one being coalesced away, https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/078f3af11d13 did include a "Rev3 WINNT 5.1 mozilla-inbound talos dromaeojs" in the Builds section, and when I hit the "rebuild" button, it did retrigger on 078f3af11d13 rather than on 77c30c1388ee where it had been coalesced before, and because both my retriggers were (roughly) the (probably noise) regressed value, we sent off a fresh one-push-blamed regression email in https://groups.google.com/d/topic/mozilla.dev.tree-management/GyEQkL20coE/discussion.

Then things go downhill - graphserver puts the retriggers in today's date, which makes the graph pretty senseless. The low value on 078f3af11d13 is low because the 30 runs before it and the 5 runs after it say it was low, but those 30 runs before might be on the previous 30 pushes (they aren't), or they might be on the previous push, and 29 pushes 15 hours before that, or any other random distribution, and the same for the 5 runs after. So if 30 and 5 are magic numbers, then I want the 30 previous to be 5 runs on each of the previous 6 pushes, and the five after to be all on the presumed regressing push (or perhaps some other more sensible distribution). I already know graphserver's going to make a mess of that, and I don't have any idea whether or not the regression script will do the right thing, but you know what? I don't want to be the one doing it anyway - the regression script is the one that thinks it saw something, let it do the retriggering, both the first retriggering to see what within its range of pushes that got coalesced together was the actual drop, and then the second retriggering to make sure it wasn't seeing noise from too wide a range.
The graphserver bug is bug 727828.
The regression script sorts changes by their push time, not by when the test ended up being run. How much do you trust the regression script to retrigger jobs?
(In reply to comment #10)
> There are two problems with that solution: the first is that the reported
> problem does not exist, and the second is that the proposed solution already
> exists.

I'm not sure why you think the reported problem does not exist.  The rest of your comment clearly demonstrates that it does exist.

So, why would comment 9 be a better idea than reducing the coalescing which happens on inbound?
(In reply to Ehsan Akhgari [:ehsan] from comment #13)
> (In reply to comment #10)
> > There are two problems with that solution: the first is that the reported
> > problem does not exist, and the second is that the proposed solution already
> > exists.
> 
> I'm not sure why you think the reported problem does not exist.  The rest of
> your comment clearly demonstrates that it does exist.
> 
> So, why would comment 9 be a better idea than reducing the coalescing which
> happens on inbound?

Our concern is that by limiting coalescing, we'll get drastically worse wait times, which means that you won't have *any* test data from builds that finished several hours previously.

In any case as philor points out, we already have the tools required in tbpl/self-serve to retrigger coalesced jobs, although the UI could be improved. The regression script should be doing the Right Thing, and so the only annoying thing left is how graph server is showing the results.
(In reply to comment #14)
> (In reply to Ehsan Akhgari [:ehsan] from comment #13)
> > (In reply to comment #10)
> > > There are two problems with that solution: the first is that the reported
> > > problem does not exist, and the second is that the proposed solution already
> > > exists.
> > 
> > I'm not sure why you think the reported problem does not exist.  The rest of
> > your comment clearly demonstrates that it does exist.
> > 
> > So, why would comment 9 be a better idea than reducing the coalescing which
> > happens on inbound?
> 
> Our concern is that by limiting coalescing, we'll get drastically worse wait
> times, which means that you won't have *any* test data from builds that
> finished several hours previously.

Hmm, well, I thought the plan was to due that in small pieces to see where we get near to that threshold?
Could you please describe the actual problem you are trying to solve? The reported problem, that you have to run talos locally when a regression occurs within coalesced jobs, does not exist. So, we're solving some other problem here, but we have to actually know what that problem is before we can say what might solve it, and whether doing something has solved it.

One thing you might be trying to solve is "if a regression is reported on three coalesced pushes, then the people who pushed will still feel sort of responsible and one of them might retrigger the skipped jobs, but if it's four or more pushes, nobody will feel responsible and nobody will retrigger the skipped jobs, they'll just ignore it." Limiting the maximum number of coalesced pushes would solve that, but I'm pretty sure that the number there is not three, it is not two, it is one, which is not currently an option.

Or, you might be trying to minimize the number of times when a regression happens within coalesced jobs. I very much doubt that limiting the size of coalescings would do that, and in fact the opposite is probably true. If we mostly coalesce during the US day (we do) and many people intentionally push their risky things during quiet times instead (they do), then by converting two runs which each coalesce nine pushes into six runs that each coalesce three pushes, and then tacking on another one or two coalesced jobs after those because now you've used up four more slaves and expanded the range of the busy part of the day, you may very well have increased the chances of having a regression within coalesced jobs rather than decreased it.
I don't think we have consensus here.

Please re-open if/when we decide this is a good idea. bugs aren't a good place to have these discussions.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.