Closed
Bug 1205871
Opened 10 years ago
Closed 9 years ago
Jobs cannot be transitioned from a ``complete`` state to ``retry``
Categories
(Tree Management :: Treeherder: Data Ingestion, defect)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: camd, Assigned: camd)
Details
Attachments
(1 file)
In buildbot, a job transitions directly from ``running`` to ``retry`` in the cases where it will be retried automatically. But for upcoming Buildbot Bridge and Task Cluster jobs that we will ingest, we need to be able to make that transition.
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → cdawson
| Assignee | ||
Comment 1•10 years ago
|
||
Attachment #8662661 -
Flags: review?(emorley)
Comment 2•10 years ago
|
||
Comment on attachment 8662661 [details] [review]
PR
Left a comment on the PR :-)
Attachment #8662661 -
Flags: review?(emorley)
| Assignee | ||
Comment 3•10 years ago
|
||
Answering the question in the PR about "why do we need this?":
Yes, this is needed for our new pulse ingestion work. But there is no current bug with this on our buildbot side. We don't come across it due to the way BB works.
My understanding is that, for Task Cluster, a job will be marked completed/failed. Then a retry will happen, creating a new job. It then marks the old job as "retried" when the new one is scheduled.
However, your question is a fair one. I'm uncertain if Jonas and I actually discussed it or I somehow got that understanding based on my own assumption. I don't recall now. Maybe TC does it the same way as BB in that an auto-retry job is never marked as ``completed`` without also getting a result of ``retry``.
Needinfo'ing Jonas to help clarify.
Flags: needinfo?(jopsen)
| Assignee | ||
Comment 4•10 years ago
|
||
Greg: I found out Jonas is on PTO till the 28th. Are you able to answer the above question?
Flags: needinfo?(garndt)
Comment 5•10 years ago
|
||
Hrm, maybe I'm misunderstanding the question. Taskcluster only has a limited set of states for a task...
exception, failed, completed, unscheduled, running, and pending. When a task is retriggered or retried, we do not update the state of the previous run. If the retry was because of an automatic rerun of the task, a new run for that task ID will be created with a reasonCreated of "rerun". Retriggers (like when clicking retrigger from the TH UI) from my understanding are just new tasks added to the graph along with all the dependents.
I didn't see anything immediately obvious in the mozilla-taskcluster code to indicate that we update the previous state in TH for a job. Maybe I'm mistaken about that.
Flags: needinfo?(garndt)
| Assignee | ||
Comment 6•10 years ago
|
||
Thanks for the info, Greg.
It sounds like, for Task Cluster, if a job is retried, the NEW job that is the retry has that "rerun" notation, but the older failed job that was the reason for the auto-retry is not updated.
Since this works differently than BuildBot, we may need to either have TH figure out how to make it look the same in the UI, or change people's expectations for auto-retries.
I'll put in my calendar to chat with you and Jonas some time next quarter to nail down our story on it. I know that this info is important to a few folks.
Comment 7•10 years ago
|
||
I think Greg nailed it. If we have two jobs:
j1 and j2 (where j2 is a retry of j1)
then I imagine the following event stream:
j1 is unscheduled
j1 is scheduled
j1 is pending
j1 is running
j1 is exception
j2 is unscheduled (and some property on the message says: "j2 is a retry of j1")
j2 is scheduled (and some property on the message says: "j2 is a retry of j1")
j2 is pending (and some property on the message says: "j2 is a retry of j1")
j2 is running (and some property on the message says: "j2 is a retry of j1")
j2 is completed (and some property on the message says: "j2 is a retry of j1")
---
You can have multiple retries of the same task (at least that makes sense in theory), not sure we will
ever have that. But it's not natural to send a message:
j1 is retried (and some property on messages says: "j2 is the retry")
because at such a state j2 won't even exist yet.
---
Just my random thoughts here, that being a retry of a task is a one-to-many relation from j2 -> j1.
Flags: needinfo?(jopsen)
Comment 8•10 years ago
|
||
I think this state flow explained in comment 7 is fine.
I don't think it's necessary to mark j1 any differently than we are now.
The fact that buildbot differentiated between a "retry" that was automatic, and a retry that a human performed, by using a different state for j1 is perhaps unhelpful. Instead it's more useful to:
* call j1 what it is - "exception"
* optionally annotating j2 in some way
| Assignee | ||
Comment 9•9 years ago
|
||
Just noticed this was still opened. We don't need it.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•