Open Bug 1323839 Opened 3 years ago Updated 2 years ago

Turn off chunk-by-runtime for test suites currently using it


(Testing :: General, defect, P3)



(Not tracked)


(Reporter: RyanVM, Unassigned)


While chunk-by-runtime was a good idea in theory, I'm of the opinion that it's ultimately done more harm than good and should be shut off in favor of going back to plain run-by-dir.

* It leads to ongoing developer confusion as it makes which tests run in which chunk completely unpredictable. It can vary by platform/flavor on the same revision and move around from push to push even for the same platform/flavor. This leads to frustration and wasted resources as developers have no choice but to run the full mochitest-bc/dt suite on a Try push to test their patches rather than just the specific chunk containing the test they care about (i.e. test gets backed out for failures in a test that ran in dt4 on their push to inbound but the test is no longer in dt4 on their Try push and there's no way to predict where it *will* be).

* It requires actively maintaining in-tree runtime data to work effectively. A cursory glace at Treeherder says that things are way out balance at this point and that data hasn't been updated in over a year. While we could rely on people happening to notice when this happens and force an update, that clearly hasn't been happening up to this point. It also seems unlikely that an automated solution is going to be prioritized against other Engineering Productivity projects.

* The cost of chunking isn't as high as it used to be. Builder limits for buildbot jobs are largely irrelevant nowadays and Taskcluster allows for easy chunking changes as needed. While there is still some per-job overhead we'll face for adding more chunks, it's my understanding that we're in a better place than we used to be there due to focused efforts on reducing that overhead. Adding more chunks leads to much easier to understand and predictable behavior.

Awhile back I attempted a Try push with run-by-dir enabled for m-bc/dt again, but I actually did encounter some test failures along the way. That was surprising to me since AFAIK, chunk-by-runtime still uses run-by-dir behind the scenes, but it at least says that we'll need an updated Try push to see where things stand. Maybe the work Joel and company have done to isolate tests that don't run well in isolation has helped in the mean time.
we still have a high cost for adjusting chunks- all windows/osx tests will require buildbot changes, I don't see that changing for at least 3 months if not 6 months.  Then we have to ride trains once we get trunk all on taskcluster.

I do agree that per job overhead has been reduced- I believe there is a lot more we can do to reduce some of the mozharness and virtualenv cleanup- likewise for linux/android reducing our docker download/setup costs have a lot of room for improvement.

This should be an easy change as there are few places where we actually define chunk-by-runtime:

the chunk-by-dir work is where we actually take the list of directories and chunk based on the number of directories found.  In that case we still do --run-by-dir, so I would be surprised if we have any differences by removing chunk-by-runtime.

RyanVM- can you do a new try push without chunk-by-runtime for |try: -b do -p all -u mochitests -t none --rebuild 2| ?  that would give us a good idea of where we stand today
Changing the number of chunks for a buildbot job is about as trivial as you can get as far as editing buildbot-configs is concerned, so I don't see that as a high cost here.
yes/no- there are so many exceptions right now and will it work in trunk/trains, it isn't impossible, but it is still problematic.
In the Q1/Q2 timeframe, we're going to be moving to a manifest-based display in Treeherder; along with try syntax changes this will mean developers don't need to know what chunk their tests are running in any longer.

As we increase the number of chunks, the value of chunk-by-runtime increases; without it we get wildly unbalanced chunks (in your try run, linux64 opt bc ranges from 8 min to 46 min), which makes reducing E2E times efficiently difficult. There's substantial overhead in TC related to docker setup (which isn't obvious from looking at Treeherder, and can be more than 10 min a chunk).

If we do decide to do this as a stop gap to solve developer confusion, it's very likely we'll turn it back on as we move to manifest-based display and execution.
If that's the case, then we definitely need a solution for keeping the timing information up-to-date. One need only look at Linux32 debug mochitest-dt to see how all over the place things currently are.
I've asked ahal to implement a solution to make keep the timing information automatically up-to-date (the current process is manual). I don't think that will take long to implement.
if we really get to hyper chunking, won't we be running one manifest/chunk in an ideal world?  If so, then keeping the timing up to date is a temporary measure as well.
The Try push is hitting timeouts for some chunks, so this is a non-starter for now without adjustments to the number of chunks. As a band-aid, let's at least update the in-tree timings for better balancing. I've filed bug 1324047 for that.
(In reply to Joel Maher ( :jmaher) from comment #8)
> if we really get to hyper chunking, won't we be running one manifest/chunk
> in an ideal world?  If so, then keeping the timing up to date is a temporary
> measure as well.

Maybe. The current thinking is that some manifests are so small that running them independently won't make sense; we'll likely still combine the shortest-running manifests. But that's all subject to change as we decide on implementation strategies.
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.