Closed Bug 1205871 Opened 10 years ago Closed 9 years ago

Jobs cannot be transitioned from a ``complete`` state to ``retry``

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: camd, Assigned: camd)

Details

Attachments

(1 file)

46 bytes, text/x-github-pull-request
Details | Review
In buildbot, a job transitions directly from ``running`` to ``retry`` in the cases where it will be retried automatically. But for upcoming Buildbot Bridge and Task Cluster jobs that we will ingest, we need to be able to make that transition.
Assignee: nobody → cdawson
Attached file PR
Attachment #8662661 - Flags: review?(emorley)
Comment on attachment 8662661 [details] [review] PR Left a comment on the PR :-)
Attachment #8662661 - Flags: review?(emorley)
Answering the question in the PR about "why do we need this?": Yes, this is needed for our new pulse ingestion work. But there is no current bug with this on our buildbot side. We don't come across it due to the way BB works. My understanding is that, for Task Cluster, a job will be marked completed/failed. Then a retry will happen, creating a new job. It then marks the old job as "retried" when the new one is scheduled. However, your question is a fair one. I'm uncertain if Jonas and I actually discussed it or I somehow got that understanding based on my own assumption. I don't recall now. Maybe TC does it the same way as BB in that an auto-retry job is never marked as ``completed`` without also getting a result of ``retry``. Needinfo'ing Jonas to help clarify.
Flags: needinfo?(jopsen)
Greg: I found out Jonas is on PTO till the 28th. Are you able to answer the above question?
Flags: needinfo?(garndt)
Hrm, maybe I'm misunderstanding the question. Taskcluster only has a limited set of states for a task... exception, failed, completed, unscheduled, running, and pending. When a task is retriggered or retried, we do not update the state of the previous run. If the retry was because of an automatic rerun of the task, a new run for that task ID will be created with a reasonCreated of "rerun". Retriggers (like when clicking retrigger from the TH UI) from my understanding are just new tasks added to the graph along with all the dependents. I didn't see anything immediately obvious in the mozilla-taskcluster code to indicate that we update the previous state in TH for a job. Maybe I'm mistaken about that.
Flags: needinfo?(garndt)
Thanks for the info, Greg. It sounds like, for Task Cluster, if a job is retried, the NEW job that is the retry has that "rerun" notation, but the older failed job that was the reason for the auto-retry is not updated. Since this works differently than BuildBot, we may need to either have TH figure out how to make it look the same in the UI, or change people's expectations for auto-retries. I'll put in my calendar to chat with you and Jonas some time next quarter to nail down our story on it. I know that this info is important to a few folks.
I think Greg nailed it. If we have two jobs: j1 and j2 (where j2 is a retry of j1) then I imagine the following event stream: j1 is unscheduled j1 is scheduled j1 is pending j1 is running j1 is exception j2 is unscheduled (and some property on the message says: "j2 is a retry of j1") j2 is scheduled (and some property on the message says: "j2 is a retry of j1") j2 is pending (and some property on the message says: "j2 is a retry of j1") j2 is running (and some property on the message says: "j2 is a retry of j1") j2 is completed (and some property on the message says: "j2 is a retry of j1") --- You can have multiple retries of the same task (at least that makes sense in theory), not sure we will ever have that. But it's not natural to send a message: j1 is retried (and some property on messages says: "j2 is the retry") because at such a state j2 won't even exist yet. --- Just my random thoughts here, that being a retry of a task is a one-to-many relation from j2 -> j1.
Flags: needinfo?(jopsen)
I think this state flow explained in comment 7 is fine. I don't think it's necessary to mark j1 any differently than we are now. The fact that buildbot differentiated between a "retry" that was automatic, and a retry that a human performed, by using a different state for j1 is perhaps unhelpful. Instead it's more useful to: * call j1 what it is - "exception" * optionally annotating j2 in some way
Just noticed this was still opened. We don't need it.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: