Open Bug 1653058 Opened 5 years ago Updated 3 years ago

Allow sheriffs to identify that a current push is waiting on tasks from prior push(es)

Categories

(Tree Management :: Treeherder, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: Callek, Unassigned)

References

Details

User Story

From a Sheriff Point of view (after Bug 1653050):

Sheriffs start their shift or start to look at a tree, see several pushes with only the Decision task having been run and no other indication of anything happening on their screen. Suspects something is wrong with scheduling/taskcluster and raises it up to Taskcluster/Releng teams as to there being a likely outage.

The most recent pushes were all actually waiting on pending/started tasks on prior pushes that was off screen. The sheriff has no indication of which recent pushes are waiting on those tasks, nor any indication of if/when one of those older tasks fails and is thus blocking a more recent push...

We should strive to make it clear that a given push has pending jobs from an earlier push and make it easy for sheriffs to identify what is going on with a given push.

Some ideas floated in a joint meeting with myself, tom, jordan, camd, armenzg, ekyle, and sclements today:

UX:

  • A header item on the treeherder veiw of the push header that this push is waiting/blocked on X tasks from a prior push, clickable to show what tasks/pushes.
  • An extra tab that shows which tasks/pushes have items blocking
  • Allow the display of Unscheduled tasks on treeherder as well (instead of just ignoring them)

Implementation:

  • A TC index we add to these pushes when replacement happens that Treeherder can look up somehow.
  • An artifact published as part of the decision task identifying what tasks we are replacing with, from earlier pushes.
    • This seemed like the likely choice in discussion
    • Possibly JSON artifact, and we could always add fields/details to this file if treeherder needs more data.

CC's did I get the summary here right?

(In reply to Justin Wood (:Callek) from comment #0)

Implementation:

  • A TC index we add to these pushes when replacement happens that Treeherder can look up somehow.
  • An artifact published as part of the decision task identifying what tasks we are replacing with, from earlier pushes.
    • This seemed like the likely choice in discussion
    • Possibly JSON artifact, and we could always add fields/details to this file if treeherder needs more data.

JSON artifact seems like it would be easy to implement. Artifacts are currently only accessed via the firefox-ci API for both display in the UI and as part of task ingestion. Plus, we might not need to store anything in our databases this way.

Depends on: 1657939

Sarah, Armen asked me to n-i you about this...

How does the artifact on https://firefoxci.taskcluster-artifacts.net/FIKKv92cR1GxU2LnriSLUg/0/public/optimization-data.json look in regards to our first pass here?

I'll paste the Matrix chat between armen and myself:

armenzg Callek: let's say there's 10 task ids on that artifact. What does it mean to me?
was that generated by the gecko decision task?
Callek armenzg: that is "TaskIDs that are deps of at least 1 task in the current decision task, that were 'replaced' and not in this currently generated graph."
armenzg: so, for example, if any of those ids are not complete we'd want some UX to identify which ones, and how many are not complete
armenzg: this is so that, for example, if a build is waiting on a docker image, we'd be able to see that the docker image, from an earlier push, is pending/running and not complete.
if we need more/different data that is fine, but this was my "Keep it Simple" principle at play, until and unless we need more.
armenzg Callek: for instance, push A has 20 tasks from another push and some tasks with push A are waiting on those 20 tasks
so we need a way to show that some tasks have not yet run on push A bc some of those 20 tasks have not yet completed
Callek exactly
armenzg Callek: from my POV it looks OK, however, if possible please NI sclements on that bug
Callek this artifact (decision task) doesn't know [or care] whether or not those replaced tasks have completed or not, at least how I wrote it.

Flags: needinfo?(sclements)

I think this artifact should work great. Is there is an api in Taskcluster where you can pass multiple task ids and get their status and push revision, perhaps? Anyway, this sounds like a great start. :)

(In reply to Cameron Dawson [:camd] from comment #3)

I think this artifact should work great. Is there is an api in Taskcluster where you can pass multiple task ids and get their status and push revision, perhaps? Anyway, this sounds like a great start. :)

I don't think there is one API for multiple-IDs and getting the push revision, in my mind the "push revision" would be from treeherder's DB since it would already have ingested these taskIDs (I think)

Dustin may know about APIs that exist around this area.

Flags: needinfo?(dustin)

All TC APIs are going to be for a single taskId, but you're welcome to call that API method multiple times. TC has no native notion of a revision, but if you want to bake in some knowledge of what Firefox CI tasks look like, you may be able to determine that. I don't know how Firefox-specific (or taskgraph-specifica) you want to make things.

Flags: needinfo?(dustin)

I think all the tasks should already been known to treeherder, so it would be possible to query treeherder for that data, rather than taskcluster.

I agree that JSON artifact looks good. I'm wondering if we would need to know the run (retry) id too for each of those task ids?

Flags: needinfo?(sclements)

(In reply to Sarah Clements [:sclements] from comment #7)

I agree that JSON artifact looks good. I'm wondering if we would need to know the run (retry) id too for each of those task ids?

In this case the task dependencies do not care about Run/Retry ID because it is waiting for "first successful run with that taskID" .. and so generally speaking once a run succeeds the jobs that are depended on will start. If it fails the jobs won't start, but a rerun on the failed thing, if it was intermittent would then kick off the dependent jobs still.

Is this a duplicate of bug 1066272?

You need to log in before you can comment on or make changes to this bug.