Closed Bug 1299377 Opened 8 years ago Closed 8 years ago

Fix SETA scheduling to not panic when it doesn't get a full set of tests building right away

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: philor, Unassigned)

References

Details

AIUI, the meaning of https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py#43 is that if we are pushing quickly, we would get a full set of WinXP tests on every 14th push, but if we don't get to the 14th push within 2 hours, we get a full set for the 2 hour gap.

That would be fine, except that what we see in practice is that if we've pushed enough to overload the slave pool, so that we don't get the full set running after 2 hours, then SETA starts scheduling a full set on every single push, and we actually run those full sets, making things exponentially worse.

Either one of two things would be acceptable alternatives: Don't Panic, schedule a full set after the 2 hour gap and then trust that it will run and go back to scheduling small sets, or, let the coalescing which worked just fine to handle backlogs back into the mix, so that after SETA panics and thinks it is scheduling a full set on each of eight pushes (well, currently 3 because coalescing is excessively limited right now), we only actually run the tests on the tipmost of those.
(In reply to Phil Ringnalda (:philor) from comment #0)
> That would be fine, except that what we see in practice is that if we've
> pushed enough to overload the slave pool, so that we don't get the full set
> running after 2 hours, then SETA starts scheduling a full set on every
> single push, and we actually run those full sets, making things
> exponentially worse.

Would you mind rewording?
I think this is the issue I have been asking about over and over again in IRC.
I haven't had a chance to dig into this, but I'd like just to clarify a few things about how SETA works in buildbot:

1) The schedulers don't count pushes, they count builds finishing and running sendchange. So a mobile-only push wouldn't be counted for the purposes of SETA.

2) When the schedulers decide to create jobs (due to the job count, or time), it creates jobs for all previous builds that were skipped.

3) There's nothing in place that will ensure that tests run on the same build across suites.

One thing that's confusing to me is "we don't get the full set running after 2 hours, then SETA starts scheduling a full set on every single push, and we actually run those full sets, making things exponentially worse". That's definitely not supposed to be happening, but I haven't caught this happening at the right time, and been looking at the right place, to figure out why.

Links to builds in buildbot while this is going on would be a big help. TH links are ok, but harder to map to things on specific buildbot masters.
I'm not sure "while it's going on" is even possible, since there's so much weirdness and lies in what treeherder gets told/displays. https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=windows%208%20x64%20opt&fromchange=6c382a30453a97024d3c8357d8f6a266486a88b1&group_state=expanded&tochange=fabfb2ff761eace61d0433e4d6e3d74e0cba193e shows (from the bottom) a completed small set, a completed full set, then 8 pending full sets in a row, though the oldest 4 of those 8 have actually completed the small set, so the pending on them may or may not be real. The only thing I can say for sure is this is when you look back at completed jobs, and we (eventually, six or eight hours later) ran every test on every push at a time when the pushes weren't more than two hours apart.
(In reply to Chris AtLee [:catlee] from comment #3)
> 2) When the schedulers decide to create jobs (due to the job count, or
> time), it creates jobs for all previous builds that were skipped.

Mmm, so we have builds A, B, C, D, E, F, G. A sendchanges, we schedule 10 tests on it. B, C, D, E, and F do the same, we do the same. G sendchanges, that's the seventh push/build/senchange/whatever, so we schedule all 60 tests on it, *and* we schedule the missing 50 tests on A, B, C, D, E, and F? And then, so the theory goes, coalesce them on the tests on G?
(In reply to Phil Ringnalda (:philor) from comment #5)
> (In reply to Chris AtLee [:catlee] from comment #3)
> > 2) When the schedulers decide to create jobs (due to the job count, or
> > time), it creates jobs for all previous builds that were skipped.
> 
> Mmm, so we have builds A, B, C, D, E, F, G. A sendchanges, we schedule 10
> tests on it. B, C, D, E, and F do the same, we do the same. G sendchanges,
> that's the seventh push/build/senchange/whatever, so we schedule all 60
> tests on it, *and* we schedule the missing 50 tests on A, B, C, D, E, and F?
> And then, so the theory goes, coalesce them on the tests on G?

yup, that's the theory, subject to "3) There's nothing in place that will ensure that tests run on the same build across suites." So when G sendchanges, we'll have pending for A, B, C, D, E, F, G for all suites, on top of whatever else was pending at the time. When a worker comes to take a job, it should grab oldest to newest, up to 7 (or whatever the SETA configs say).
Up to 7 or up to an hour, the hour being based on the time of the original sendchange, which leads to a lot of the confusion while looking at how often tests actually run.

Okay, this bug is invalid, the product of fevered minds without an understanding of how SETA scheduling actually works.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Is there some way we could make SETA more understandable, or behave better under load?
when you say SETA, I assume you are referring to the integration with buildbot?  SETA doesn't do anything dynamically per commit- although in taskcluster SETA will have a more active role with per commit decisions.
(In reply to Phil Ringnalda (:philor) from comment #7)
> Up to 7 or up to an hour, the hour being based on the time of the original
> sendchange, which leads to a lot of the confusion while looking at how often
> tests actually run.

Actually, I was still wrong there: the time limit only affects when EveryNthScheduler will unleash the full set of tests, and which we actually run depends on mergeRequests which only knows about counts.

But, I think that while I was wrong about what was broken, I was right that something was broken.

If you look in the logs at a time when we are backlogged and trying to catch up, do you see lines for "mergeRequests: some-win8-test: exceeded limit 7" or do you see "mergeRequests: some-win8-test: exceeded limit 3"?

I think, if I finally understand what we intend correctly, that we think https://hg.mozilla.org/build/buildbotcustom/annotate/515987988fd4b9855661c5733708fa8fb24aa0b9/misc.py#l2228 will set Win8 tests to merge up to 7, but what we actually run looks a whole lot more like they are getting the default 3 (or, once bug 1299378 hits prod, will be getting 8).
Taking that back to bug 1296329, having gotten confirmation that we are in fact doing "mergeRequests: Windows 8 64-bit mozilla-inbound opt test mochitest-media: exceeded limit (3)".
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.