Closed Bug 1148965 Opened 10 years ago Closed 9 years ago

mozilla-taskcluster: treeherder integration - don't report end_timestamp without start_timestamp

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: garndt)

References

Details

Attachments

(2 files)

github PR queue part (A) 10 years ago Jonas Finnemann Jensen (:jonasfj) 56 bytes, text/x-github-pull-request	pmoore : review+	Details \| Review
Mozilal-taskcluster PR 20 9 years ago Greg Arndt [:garndt] 58 bytes, text/x-github-pull-request		Details \| Review

Jonas Finnemann Jensen (:jonasfj)

Reporter

Description

•

10 years ago

run.started is not present if deadline-exceeded before the scheduled run started. mozilla-taskcluster should fallback to the run.scheduled timestamp if, the run.started isn't present. See: https://queue.taskcluster.net/v1/task/IQCtuwNUSv2o04Fz27PG_Q/status Which is reported as taking 1444 min in treeherder: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=db80ac5a48aa This is probably because TH is confused about missing start_timestamp. We can easily mitigate this confusing by reporting run.scheduled in place of run.started. Code: https://github.com/taskcluster/mozilla-taskcluster/blob/master/src/treeherder/job_handler.js#L163-L164 Note. this was a minor change with upgraded queue to better facilitate cancelTask, we did indeed used to promise presence of run.started if run.resolved was present. But it seems to make little sense with cancelTask and deadline expiration.

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 1

•

10 years ago

Actually looking at the state: { "runId": 0, "state": "exception", "reasonCreated": "scheduled", "reasonResolved": "deadline-exceeded", "scheduled": "2015-03-29T19:59:50.932Z", "resolved": "2015-03-29T19:59:50.932Z" } It's clear that this task was never really scheduled. But when a task expires by deadline the queue creates a run with scheduled = resolved, in-order to mark the task as resolved. Perhaps reasonCreated should be "deadline-exceeded" or something. I'm not sure, how we add runs when we do cancelTask or resolved an expired deadline is not perfect. I'm open to suggestions. I believe the reasoning is that a task always has the state unscheduled if it doesn't have any runs. Also we add runId in pulse message routing keys... Basically it's very clean to say that whenever the queue makes a change to the state, it involves some run having properties added to it. This way, it's easier to see how decisions progress (I think of runs and sort of a per task log). For cancelTask it is a necessity to add a run, because scheduleTask, which only schedules the run zero should never succeed after cancelTask. Hence, adding run zero is necessary when canceling an unscheduled task.

James Lal [:lightsofapollo]

Comment 2

•

10 years ago

Hrm I will dig more into this Monday but is there an easy way to detect if this task was never scheduled? I don't really care to show exceptions (purple) on TH if this is just a task that never ran.

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 3

•

10 years ago

Sorry, Monday came and went so fast... For now the hack is to check if run.scheduled === run.resolved Then clearly it's a created to facilitate an exception state. @jlal, Going forward I propose that we set {reasonCreated: 'exception'} When we create a run in-order to cancel a task, or declare it expired by deadline. If the run was created before the task was canceled or the deadline expired, we'll leave the reasonCreated as 'scheduled', 'retry' or 'rerun', depending on why the run was created. So reasonCreated: 'exception' will only be used when we create a run on with the single purpose of putting the task in an exception state. Really there is only two cases: - task canceled before first run was scheduled - deadline expired before first run was scheduled @jlal, do you agree? I think it's sane, but I'm cautious about changes to the queue. So speak up if you have any doubts about this design.

Flags: needinfo?(jlal)

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 4

•

10 years ago

Update: Sheriffs are greatly annoyed by tests showing up with "deadline-exceeded" 24 hours later, because the build failed, obviously that's not a good TH flow (credits to philor for making this clear). I propose that we: A) Set {reasonCreated: 'exception'}, if a run was created solely for one of the two cases: 1) task canceled before first run was scheduled 2) deadline expired before first run was scheduled B) mozilla-taskcluster doesn't report tasks/runs with to TreeHerder if {reasonCreated: 'exception'} Basically, we never want to report anything to TH that wasn't scheduled. And we want to know if a run was created only to signal an exception, as oppose to being created because the task was "scheduled", "retry", or "rerun".

Severity: normal → major

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 5

•

10 years ago

Attached file github PR queue part (A) — Details

This does the (A) part. I'm not completely convinced that we want to do this. So please put some thought into whether or not this is the right design too! The pro/cons are as follows: Cons: (things that worry me a bit) - Adding entries to the enum, means we have more special cases when consuming task status structures else where. - We could argue that cancelTask should be implemented as scheduleTask/cancelTask (and result in same state) for task that are "unscheduled". (this is not a crazy argument) - Are we sure we want to hide canceled tasks, if they were unscheduled before we canceled them? - Are we catering specifically to Treeherder here, if so this is possibly an indicator of bad design. Pros: (things that makes me thing this is good design) + We won't loose information, hence, it's evident if the task was "unscheduled" or "pending" before it was canceled or deadline exceeded + reasonCreated is an enum explicitly marked as "may be expended" in the future, so we won't break any promises. + the reasonCreated, reasonResolved, were explicitly created in mind that we could have all sorts of special values here. As simple consumers of task status structures will rely on "state", rather than the more complicated _reason_ for the state. ---- Note, the (B) part has to be fixed in mozilla-taskcluster treeherder integration.

Attachment #8606764 - Flags: review?(jlal)

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 6

•

10 years ago

Comment on attachment 8606764 [details] [review] github PR queue part (A) @pmoore, James didn't get around to reviewing this. And you accidentally communicated your interest, so now I'm flagging you for reviewing the sanity here :) Note, we already use a run to signal deadline-exceeded or task-canceled when deadline is reached or task is canceled before the first run is scheduled. I think it makes a lot of sense for cancelTask, because we basically say that calling cancelTask is equivalent to calling scheduleTask/cancelTask, if the task is unscheduled. It's a bit more sketchy with deadline-exceeded, but I think we want to use a run to signal this too. Basically, we want some sort of resolution, and keeping it so that tasks are resolved by runs is nice. A nice result of this is that the flow of state changes and timestamps of these changes is evident from the .runs property of the task status structure. This is mostly about introducing a new value for "reasonCreated". So that it's clear that the run was made to signal an exception.

Attachment #8606764 - Flags: review?(jlal) → review?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 7

•

10 years ago

Comment on attachment 8606764 [details] [review] github PR queue part (A) See PR comments for details.

Attachment #8606764 - Flags: review?(pmoore) → review+

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 8

•

10 years ago

Debated (A) with pmoore, so clearing jlal NI. Note, we still need to fix (B) to fix this bug.

Flags: needinfo?(jlal)

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Component: TaskCluster → General

Product: Testing → Taskcluster

Jonas Finnemann Jensen (:jonasfj)

Reporter

Updated

•

9 years ago

Component: General → Integration

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Depends on: 1183528

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Blocks: q3-bb-tc-migration

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Blocks: 1183528

No longer depends on: 1183528

Greg Arndt [:garndt]

Assignee

Comment 9

•

9 years ago

just as a follow up, the solution to this is pretty straightforward, but running the tests much less so. I would like to get the tests running for this change before deploying. Working on that today and hopefully will have an update soon.

Greg Arndt [:garndt]

Assignee

Comment 10

•

9 years ago

Attached file Mozilal-taskcluster PR 20 — Details

Greg Arndt [:garndt]

Assignee

Comment 11

•

9 years ago

Part B has been submitted to production. Mailing lists have been emailed about this behavior change. Confirmed with :tomcat today that he saw a lot less exceptions today. https://github.com/taskcluster/mozilla-taskcluster/commit/5b23872a666fd7c1b42c9e55c2696e8ab3363079

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Blocks: 1182491

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Assignee: nobody → garndt

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Integration → Services

You need to log in before you can comment on or make changes to this bug.