Chunks (no matter what algorithm) are a really bad way of dividing up tests:
1) Test suites span multiple jobs (hard to find result of a specific test)
2) Individual tests can run in different jobs from one push to the next
3) No logical grouping of tests, each chunk runs tests that are unrelated
4) Places burden for dealing with this on developers
Chunks came into existence because our job scheduling capabilities were not flexible enough to do anything better. But with taskcluster and jobs moving to AWS, this is no longer true.
Instead, I propose that we move away from the notion of "chunks". Each "suite" corresponds to a single symbol on treeherder, and all tests in the suite get run under that symbol. Now obviously we still need to split the job up across multiple test instances for parallelism. But this should be an implementation detail that developers don't have to know or care about.
One potential way to do this, is to create a 'parent' taskcluster task, that spawns N 'child' tasks, collects all their results and displays the results back to the user in treeherder. Each 'child' job could potentially run a single directory/manifest of tests, as we likely need to keep directories of tests together. Taskcluster has an API to insert new jobs into an existing task graph that I believe the 'parent' task could make use of this.
This will also require some (possibly heavy) modifications in treeherder, and there are a lot of UX problems to deal with. For example, should the 'child' tasks still be visible but hidden (e.g maybe tier 3)? How can we display an aggregate result of all the 'child' tasks in a easy to understand format? Some suites may become nearly perma-orange simply because of the sheer number of tests they'll be running in a single job, how do we deal with this?
This is a large project that will at least span taskcluster, treeherder and test harnesses. This bug will act as a tracking bug for more specific work.
Adding joel as, in many of my conversations with him, he's often addressed specific chunks, rather than whole suites. So hiding this too deeply may be problematic.
I'm changing the title to better reflect the terminology we've been using.
"Hyper-chunking" is the term we've started using to describe the case where a job is split up into so many chunks that individual chunks become meaningless to developers/sheriffs. Instead of creating tools to help developers/sheriffs deal with chunks, we're basically starting fresh and seeing if we can come up with a better system altogether. The main benefit we hope to achieve is super fast task run times and a lot of scheduling flexibility.
Note: when I say "chunk" in this context, I don't necessarily mean a "number in treeherder". I mean an execution instance of a test harness that runs a subset of the overall tests. Whether that execution instance happens on a separate worker, a different container on the same worker, or something else is part of the implementation details that this bug will aim to figure out.