Closed Bug 1054247 Opened 10 years ago Closed 9 years ago

Investigate chunk-by-manifest strategy for mochitest

Categories

(Testing :: Mochitest, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Unassigned)

References

(Blocks 1 open bug)

Details

In bug 1053916 RyanVM attempted to enable chunk-by-dir for mobile, but found it hard to balance the chunks.

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2)
> The end result of that Try run was massively-imbalanced chunks, leading to
> some pushing 2+hr runtime and others timing out. Clearly we'll need
> something better.

One possibility is to implement a 'chunk-by-manifest' option in which we would explicitly list which tests go into which chunks via the manifest.

I'm picturing something like this:

manifest1.ini:
[DEFAULT]
chunk-group = 'media-content'

[test_foo.js]
[test_bar.js]


manifest2.ini:
[DEFAULT]
chunk-group = 'dom'

[test_baz.js]
[random_media_related_test.js]
chunk-group = 'media-content'


manifest3.ini:
[test_fleem.js]
[test_xyz.js]


Here every unique 'chunk-group' corresponds to a separate chunk. Ideally there would be more chunks than chunk-groups, but we could run them together if that's not the case. Manifest2 has a test that overrides its default chunk-group. Manifest3 has tests without any 'chunk-group'. When this happens, the leftover tests could get split evenly across however many chunks are remaining. E.g say we have 6 chunks and 3 chunk-groups, the tests without a group would get split across chunks 4, 5, and 6.

This lets us explicitly separate tests however we want. For example if we ever wanted to implement Clint's quarantine idea, we could just add individual tests to the 'quarantine' chunk-group (and figure out how to get treeherder to display the chunk-group name instead of a number).

Anyway, good chance this a dumb idea, but figured it might be worth looking into.
We talked about this a bit on IRC, and our thought was that we could add a "weight" of some sort to tests, something like:
[test_foo.html]
weight = 2

would say "this test counts for the weight of two tests", so the chunking would split things up into chunks of roughly equal weight. Our thought with this was that the harness could output timing information after the run for each chunk, and we could use that to feed back into the manifests to set the proper weights.
(It might also be nice to build the chunking logic into manifestparser rather than each harness.)
The ability to do fine grained chunking is a precursor to several of the high level goals for 2015 including normalized chunks and supporting some of the intermittent and sheriffing goals. I'm looking into what will be involved in getting this implemented for a harness like mochitest. Here is what I've found so far (random thoughts, no particular order):

* +1 to comment 2. Chunking algorithms should live in manifestparser and use the strategy pattern (be swappable). Manifestparser should tell the harness *what* needs to get run based on the chunking parameters. The harness should be responsible for *how* things get run. This means things like --run-by-dir need to continue to live in mochitest.

* manifestparser needs to get split up a bit. There is a lot of special case logic in manifestparser.py and I don't think chunking will be the the last of it. I'd like to avoid tacking on yet another feature without first figuring out how to make sure it's maintainable. We should maybe think of things like subtest and chunking as "plugin-like".

* I prefer the solution in comment 0 over the weight one in comment 1. Achieving normalized chunks is possible with both (albeit the algorithm to do so is more complicated with the comment 0 solution), but the comment 0 solution allows very fine grained control of what tests show up in what chunks. I think this fine grained control will be needed by some of the intermittent orange things we may work on in the future.

* To avoid buildbot dependencies, calling the chunk-groups '1', '2', '3', etc.. to start out with would likely be simpler.

* mochitest chunking is very fragmented at the moment. Currently we have --total-chunks, --this-chunk, --chunk-by-dir, --run-by-dir and --bisect-chunk. Chunking happens in JS unless --run-by-dir or --bisect-chunk is passed in, then it happens in python. Desktop and b2g both use mochitest.ini, but android uses json files except for robocop which also uses mochitest.ini. So we can't get rid of JS based chunking until android is fully converted to mochitest.ini (bug 1083347, apparently this is close). There seems to be agreement that moving chunking out of JS and into python is a good thing. Afaict, the chunking seems to be the same across mochitest flavours.

In conclusion, implementing a good maintainable solution and getting all of mochitest to use it is going to be a lot of work, probably more than a quarter's worth. I think there exist some well defined deliverables that get us most of the way there though.
FWIW, if we wanted a feedback loop that measured test runtime and then updated manifests to try to keep the overall runtime of chunks equal, it wouldn't be that hard to implement that using your proposal in comment 0. We'd just have a script that took the test runtimes and sorted things into chunk-groups as appropriate.
I think the proposal in comment 0 should be modified a bit. Instead of:
[test_foo.py]
chunk-group = bar

It should be:
[test_foo.py]
chunk-groups = bar,baz

This way a test can belong to more than one group. For example take the mochitest.ini for indexedDB tests:

[DEFAULT]
chunk-groups = dom,indexedDB,1

This means these tests will appear in chunk 1 when run in automation via buildbot. But will also be run if you pass in --this-chunk dom (which presumably covers all dom related tests) or --this-chunk indexedDB.
isn't there a danger of running the tests twice if we are not super careful with our configs and parameters?
Yes, that's true. But I realized the single chunk-group solution might not be flexible enough since the total tests run can vary drastically by configuration. E.g We will likely want to chunk very differently between mochitest-plain and mochitest-plain-e10s, so we'd need something like:

chunk-groups = plain-4,plain-e10s-2

I agree this will be hard to manage by hand, so step 2 would be a tool that can normalize chunks and update manifests automatically.

Also, I'm open to less complicated solutions.. but I'm having trouble thinking of anything else.
so why do we want to hand select tests to be in a specific chunk?  Can we step back and outline a problem or set of problems we are trying to solve here?  I am getting concerned that we will be checking large manifest changes into the tree on a regular basis just to manage these chunk-groups.
FWIW I think that both solutions have merit here.

To enable chunking of full test runs, we should use as input a list of all tests and the expected duration (weight), the total chunks, and the chunk number. To work out how many chunks are needed we can run this in an alternate mode where we ask "given the tests we have, what number of chunks are required so that each takes N minutes?" (in a more sophisticated system this step would happen at runtime, but unless we can make that work with buildbot it seems impractical at the moment).

This will allow deterministic production of a list of tests to run in that specific chunk with no possibility of overlap (assuming the software that does the chunking isn't buggy). It won't require frequent changes to actual manifest files but the test durations will have to be regularly updated based on the observed runtimes, and the number of chunks for each platform may also require updates.

The ability to select specific tests to run is useful when trying to e.g. verify if new tests are stable (at least when run outside of the context of a chunk) or green up known unstable tests. For these situations being able to commit a manifest and say "run exactly these tests" seems like a very useful ability.
(In reply to Joel Maher (:jmaher) from comment #8)
> so why do we want to hand select tests to be in a specific chunk?  Can we
> step back and outline a problem or set of problems we are trying to solve
> here?  I am getting concerned that we will be checking large manifest
> changes into the tree on a regular basis just to manage these chunk-groups.

Fine grained test selection is explicitly listed in our 2015 planning document and lists some of the projects it supports:
https://docs.google.com/a/mozilla.com/document/d/19gY2WRyvFNwbxYXAacaJWyBdSQPCzHINfOfTQl9cvBk/edit#

Things that stand out to me:
* normalizing test chunks - a goal in the 2015 plan
* scheduling certain tests periodically - supports releng goal for smarter scheduling
* allows test quarantine for intermittents - supports intermittent goal

Do you have other ideas on how to implement fine grained test selection?
I am under the assumption that test selection is completely independent of deciding which chunk each test will fall into; and I believe this is what James is suggesting.

On the issue of specifying test selection, I do not know how it is done now: If tests are selected by directory or by some predefined group, or by some common attribute.  If tests are selected by chunk now (for example "plain-e10s-2" has "2" which I assume indicates chunk 2), then that should only be indicating the tests to run and no longer have any say about the chunk they will be run in.  I would expect chunk-based test selection will be quickly depreciated in favor of logical groupings (or specific tests) only

I am assuming every test will have an average-run-time statistic, and that statistic will be used as the weight to fill the chunks with an equal amount of work.  Upon requesting a set of tests, the chunking algorithm takes the tests' statistics, the number of desired chunks, and returns a fine-grained list of every selected test, and the chuck it will be run in.

If I am wrong about how I imagine chunking will work, or there are constraints I am missing, please elaborate:  This may impact the data being gathered.
Flags: needinfo?(ahalberstadt)
There was a fairly productive irc chat that just happened. I'll try to summarize it and make some sort of a conclusion. See if you all agree with me :)

First some clarifications:

1) My "chunk-group" proposal is very similar to the currently existing subsuites (though not quite the same, more on this later).
2) I'm renaming "chunk-group" to "label" which I think has clearer meaning.
2) I was conflating labels with chunking, when really they should be two different things.
3) Labels or no, tests should still have weights so that it is easier to normalize chunks across a variety of platforms/configurations.

Problem statement:

Manifestparser is ultimately responsible for outputting a list of tests based on a wide variety of different inputs. E.g one input is the filter values used by skip-if, another is the subsuite, a third will be thisChunk and totalChunks, a fourth will be a weighting representing the runtime of a test, a fifth might be a logical label that groups tests. Several or all the inputs can be used in conjunction with one another to inform the final list of tests. For example we can have subsuite=devtools and still perform normalized chunking using weights on top of that.

The problem we're trying to solve is figuring out all the different ways we need to slice and dice tests for all of the various use cases we have. Then we need to figure out how to combine those inputs to achieve the desired list.

First, how might we want to group tests?
1) All
2) By subsuite, e.g devtools
3) By label, e.g dom, indexedDB [1]

Second, how might we want to split up that group?
1) split all tests uniformly across chunks - current default case
2) split all tests up across chunks based on their runtime - future default case
3) split tests up across chunks based on some commonality - e.g by directory, by module, by manifest etc. [2]

Proposal:

The ideal chunking algorithm of the future is one that normalizes total runtime per chunk and also ensures tests in the same manifest (and/or logical label) are run together.

To do this each test is assigned a weight based on its runtime [3] by an external process that gets run periodically. These weights are used in conjuction with totalChunks, thisChunk and the full set of manifests (or labels) to group manifests (or labels) together in such a way that each chunk is as close to the same total runtime as possible.

Side notes:

[1] The difference between subsuite and label is that being part of a subsuite excludes a test from "All", whereas a label wouldn't. It would be nice not to have both of them but I think they solve two different things. Maybe label is unnecessary and we can just go by manifest for now.

[2] In our case manifest and directory are the same for the most part. Thinking in terms of manifests seems more useful to me, but it probably makes little practical difference.

[3] Unfortunately weights need to be per-test. If weights were per-manifest, we wouldn't be able to figure out the updated total weight if one of the tests gets skipped on a specific platform. This means the manifests could potentially get very ugly :(
Flags: needinfo?(ahalberstadt)
Since weights have little to no value for human consumption and are intended to be read and written by automated processes, we could create a separate mochitest-runtimes.json file similar to all-tests.json instead of storing this data in the manifests. I guess in this case the weight could also just be the actual average runtime.
Yeah, that's probably fair, it's just clutter in the manifests and we're not expecting humans to consume the data. It'll likely wind up being updated as a single blob by an automated process at some point derived from current runtime data.

If we're going to go that route maybe we just write a json file per-platform with weights for that platform, since it's almost certain that these will vary by platform? We could optimize this a bit by having a threshold below which we don't include tests (I assume most tests will be fairly quick) and only include those tests that are above the threshold. Here we would probably want to use the weight concept, where the threshold is "1" and we give the weight as this test's execution time as a multiple of the threshold value.

I'm sure there's some fancy stats terms to describe what I'm talking about here, maybe Kyle can chime in and give me a clue. :)

Let me repeat Ted, so I can be corrected if I'm wrong:

If the current average runtimes (weight) are available for each test on each platform, then we could write a json file with that information.  That file could be much smaller if we excluded all the fast running (with weight below some threshold) tests.  Calculating the optimal arrangement of tests into chucks is minimally impacted by this reduced information.  

I have used similar strategies in the past, but I am not aware it has a formal name.  If we choose this route, an analysis on the expected error should be done:  We should ask ourselves how tolerant we are of imbalanced chunks (phrased as "optimal chunk run time + N minutes"), which can be used to determine a threshold that would reasonably give us that behaviour.  We may find that the threshold can set quite high [1], especially if test arrangement for "optimal chunk run time" turns out to have high variance.

Is there a reason why a json file of these average runtimes is required?  Would it be better for the test runner to get this information via service call? 


[1] Resulting in a very small json file
(In reply to Kyle Lahnakoski [:ekyle] from comment #15)
> 
> Let me repeat Ted, so I can be corrected if I'm wrong:
> 
> If the current average runtimes (weight) are available for each test on each
> platform, then we could write a json file with that information.  That file
> could be much smaller if we excluded all the fast running (with weight below
> some threshold) tests.  Calculating the optimal arrangement of tests into
> chucks is minimally impacted by this reduced information.  

My understanding is that we will not aim purely for optimal chunks, but will impose additional constraints. Notably I expect us to impose the constraint that tests always run in the same order in order to increase the determinism of the chunking process. I might be wrong about this, however.

> I have used similar strategies in the past, but I am not aware it has a
> formal name.  If we choose this route, an analysis on the expected error
> should be done:  We should ask ourselves how tolerant we are of imbalanced
> chunks (phrased as "optimal chunk run time + N minutes"), which can be used
> to determine a threshold that would reasonably give us that behaviour.  We
> may find that the threshold can set quite high [1], especially if test
> arrangement for "optimal chunk run time" turns out to have high variance.

So I think that this error analysis is quite complicated to get right in detail because each test's runtimes probably doesn't have a simple distribution. In particular it's likely that the failure case has a different mean runtime than the pass case, and certain error cases may peg the runtime at the test timeout (for suites that have such a thing). Assuming that tests are grouped by feature and breakage of tests is correlated by feature, this means that where a build breaks some tests the batch runtime might be substantially different from the prediction you would make from a model that considers only passing runs.

Of course, if you only care about fully passing runs, a model where test runtimes have some simple (e.g. Gaussian, although again I expect this is not really true) distribution seems more reasonable.

> Is there a reason why a json file of these average runtimes is required? 
> Would it be better for the test runner to get this information via service
> call? 

Depends on the latency of getting data from the service, I guess. And perhaps how reproducible we want each run to be.
(In reply to Kyle Lahnakoski [:ekyle] from comment #15)
> Is there a reason why a json file of these average runtimes is required? 
> Would it be better for the test runner to get this information via service
> call? 

This information needs to be in the tree for reproducibility. The chunking should be deterministic for a particular changeset so that retriggering jobs or doing bisection produces consistent results.
Ok, I think I have a good understanding of what needs to get done next quarter. There seems to be two separate problems discussed in this bug:

1) group by runtime
2) group by label

I think 1) is more closely aligned with our immediate goals, so I'll focus on that first. At the same time I'll focus on implementing classical chunking in manifestparser rather than the harnesses (though switching harnesses to use this new manifestparser based chunking will be another matter).

One problem we'll run into whenever "re-normalizing" the chunks will be new order based failures. So we may want to try to normalize the chunks while also making sure not to split up directories (i.e a combination of chunk-by-dir and runtimes).

I feel this was thoroughly "investigated", so I'll close it out and file new bugs for the specific tasks, though don't let that stop you from chiming in if you still have more to discuss.

Thanks everyone!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1123763
Blocks: chunking
You need to log in before you can comment on or make changes to this bug.