Closed Bug 1474005 (chunk-buster) Opened 7 years ago Closed 6 years ago

[meta] Tracking bug to hide test chunking from developers

Categories

(Testing :: General, enhancement, P3)

Version 3
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ahal, Unassigned)

Details

(Keywords: meta)

In order to parallelize the massive amount of tests that need to run on every push, we break test task's up into chunks. This decreases the end-to-end times of an overall push and let's CI keep up with the rate of pushes to our trees. However there are several drawbacks to chunks: 1. They move tests around from task to task which complicates things like treeherder bisection, reproducing failures locally and scheduling specific tests on try. 2. They make it difficult to summarize the overall result of a suite. Results must be gathered from every chunk task and combined in order to do this. 3. They add complexity for tools and visual clutter for humans. 4. They increase the overall runtime (due to the extra setup/teardown) costs. Ideally, chunks are something that should happen "behind the scenes" such that developers aren't even aware of them. Accomplishing this will be no small feat and will likely involve many quarters of work from several teams (taskcluster, treeherder, test harnesses). This bug will track the work items towards our end goal. For more information on the problems this is trying to solve, requirements and brainstorming, see this doc: https://docs.google.com/document/d/1D4_wxi45vIKYxSe-5VzXV8Vde9QiYw0KstxvMmZF4p8/edit#heading=h.5x0d5h95i329
Priority: -- → P3
a brief chat yesterday yielded a couple next steps: 1) treeherder specific changes needed to group tests- maybe slight adjustments are needed to change meta data in taskgraph and tasks, but primary work here is to get treeherder to have a different view 2) focus on run-by-manifest. If we need to backfill or query if a test was run, we can go down to the manifest level instead of the test level. This would include making it easy to schedule a manifest of tests instead of a task or a single test, ideally in a way that we can backfill and retrigger easily.
:ahal, could you outline more thoughts on this, I think you had a lot of thoughts on #1
Flags: needinfo?(ahal)
Originally, I thought this was a problem that would need to be (partially) solved in taskcluster. I thought we would need to define some kind of "parent" task that contains "subtasks", or else invent some kind of notion of a "group of tasks" so treeherder could display them all together. But after conversations at the work week and with :bstack on irc, I think this doesn't even need any taskcluster modifications at all. There was general consensus at the work week that treeherder is the thing that should be responsible for aggregating the result of these "sub-tasks" into some kind of parent task. Originally, I thought the definition of this "parent" task was something that taskcluster would need to provide to treeherder, but I no longer think that is the case. This is something that can be defined solely with the in-tree parts of taskgraph. Currently, there is a 'treeherder' section in a task definition that treeherder uses to grab things like the description and symbol to display. I propose that we extend this value to support something like this: treeherder: parent: platform: linux64/opt symbol: M ... If treeherder finds that a task doesn't have a symbol of its own (but does have a parent key), it would know that all tasks with the same "parent" should be logically grouped together as a subtask. Treeherder could then insert the `parent['symbol']` into its display and do whatever log aggregating of subtasks it needs to. From taskcluster's point of view this "parent" doesn't exist. Since we'll need to continue supporting the current `treeherder` schema, it may be worth formalizing this a little more. Maybe we can invent a concept of a treeherder "view" or "type" which is defined per-task: treeherder: type: subtask parent: ... ... treeherder: type: default symbol: 1 groupSymbol: M ... This proposal puts basically 100% of the onus on treeherder for this project and I recognize it's going to be a massive amount of work. Plus there are tons of details that still haven't been thought through. But from an extremely high level, this looks like a decent way to move forward. At least it seems like the best bet so far. I'd love to hear opinions on this approach. I'd also like to spend a bit of time familiarizing myself with the treeherder code base to try and get a sense of how crazy this is going to be :)
Flags: needinfo?(ahal)
I didn't address point 2) from comment 1. One nice property of this proposal is that it doesn't change how the tasks work from taskcluster's perspective. So backfills and retriggers can continue to work the same way they do now with action tasks. There will be non-trivial UX complexity around how to schedule retriggers/backfills from the treeherder parent task, but on the back end things shouldn't look much different. This means that this "subtask" view and moving to a more "manifest-centric" test model are kind of two separate projects. We could work on retriggering/backfilling specific manifests right now without needing to block on this massive treeherder UX project.

The meta keyword is there, the bug doesn't depend on other bugs and there is no activity for 12 months.
:gbrown, maybe it's time to close this bug?

Flags: needinfo?(gbrown)

I could probably add bug 1583353 as a dependency, but IMO it's not worth it. Most of the discussion in here likely isn't going to be the route we take anyway. If we find we need a meta bug after 1583353 is landed then we can file a new one with better context.

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.