Closed Bug 1148965 Opened 9 years ago Closed 9 years ago

mozilla-taskcluster: treeherder integration - don't report end_timestamp without start_timestamp

Categories

(Taskcluster :: Services, defect)

x86_64
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: garndt)

References

Details

Attachments

(2 files)

run.started is not present if deadline-exceeded before the scheduled run started.

mozilla-taskcluster should fallback to the run.scheduled timestamp if,
the run.started isn't present.
See: https://queue.taskcluster.net/v1/task/IQCtuwNUSv2o04Fz27PG_Q/status

Which is reported as taking 1444 min in treeherder:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=db80ac5a48aa
This is probably because TH is confused about missing start_timestamp.
We can easily mitigate this confusing by reporting run.scheduled in place of
run.started.

Code:
https://github.com/taskcluster/mozilla-taskcluster/blob/master/src/treeherder/job_handler.js#L163-L164

Note. this was a minor change with upgraded queue to better facilitate
cancelTask, we did indeed used to promise presence of run.started if
run.resolved was present. But it seems to make little sense with cancelTask
and deadline expiration.
Actually looking at the state:
      {
        "runId": 0,
        "state": "exception",
        "reasonCreated": "scheduled",
        "reasonResolved": "deadline-exceeded",
        "scheduled": "2015-03-29T19:59:50.932Z",
        "resolved": "2015-03-29T19:59:50.932Z"
      }

It's clear that this task was never really scheduled.
But when a task expires by deadline the queue creates a run with scheduled = resolved,
in-order to mark the task as resolved.

Perhaps reasonCreated should be "deadline-exceeded" or something. I'm not sure, how we add runs
when we do cancelTask or resolved an expired deadline is not perfect. I'm open to suggestions.
I believe the reasoning is that a task always has the state unscheduled if it doesn't have any runs.
Also we add runId in pulse message routing keys... Basically it's very clean to say that whenever the
queue makes a change to the state, it involves some run having properties added to it. This way,
it's easier to see how decisions progress (I think of runs and sort of a per task log).

For cancelTask it is a necessity to add a run, because scheduleTask, which only schedules the run zero
should never succeed after cancelTask. Hence, adding run zero is necessary when canceling an unscheduled task.
Hrm I will dig more into this Monday but is there an easy way to detect if this task was never scheduled? I don't really care to show exceptions (purple) on TH if this is just a task that never ran.
Sorry, Monday came and went so fast...

For now the hack is to check if run.scheduled === run.resolved
Then clearly it's a created to facilitate an exception state.

@jlal,
Going forward I propose that we set {reasonCreated: 'exception'}
When we create a run in-order to cancel a task, or declare it expired by deadline.

If the run was created before the task was canceled or the deadline expired, we'll leave the
reasonCreated as 'scheduled', 'retry' or 'rerun', depending on why the run was created.

So reasonCreated: 'exception' will only be used when we create a run on with the single purpose
of putting the task in an exception state. Really there is only two cases:
 - task canceled before first run was scheduled
 - deadline expired before first run was scheduled

@jlal, do you agree?
I think it's sane, but I'm cautious about changes to the queue. So speak up if you have any doubts
about this design.
Flags: needinfo?(jlal)
Update:
  Sheriffs are greatly annoyed by tests showing up with "deadline-exceeded" 24 hours later, because
  the build failed, obviously that's not a good TH flow (credits to philor for making this clear).

I propose that we:
  A) Set {reasonCreated: 'exception'}, if a run was created solely for one of the two cases:
     1) task canceled before first run was scheduled
     2) deadline expired before first run was scheduled
  B) mozilla-taskcluster doesn't report tasks/runs with to TreeHerder if {reasonCreated: 'exception'}

Basically, we never want to report anything to TH that wasn't scheduled.
And we want to know if a run was created only to signal an exception, as oppose to being created
because the task was "scheduled", "retry", or "rerun".
Severity: normal → major
This does the (A) part.

I'm not completely convinced that we want to do this.
So please put some thought into whether or not this is the right design too!

The pro/cons are as follows:

Cons: (things that worry me a bit)
 - Adding entries to the enum, means we have more special cases when consuming
   task status structures else where.
 - We could argue that cancelTask should be implemented as
   scheduleTask/cancelTask (and result in same state) for task that are
   "unscheduled". (this is not a crazy argument)
 - Are we sure we want to hide canceled tasks, if they were unscheduled before
   we canceled them?
 - Are we catering specifically to Treeherder here, if so this is possibly an
   indicator of bad design.

Pros: (things that makes me thing this is good design)
 + We won't loose information, hence, it's evident if the task was
   "unscheduled" or "pending" before it was canceled or deadline exceeded
 + reasonCreated is an enum explicitly marked as "may be expended" in the
   future, so we won't break any promises.
 + the reasonCreated, reasonResolved, were explicitly created in mind that we
   could have all sorts of special values here.
   As simple consumers of task status structures will rely on "state", rather
   than the more complicated _reason_ for the state.

----
Note, the (B) part has to be fixed in mozilla-taskcluster treeherder integration.
Attachment #8606764 - Flags: review?(jlal)
Comment on attachment 8606764 [details] [review]
github PR queue part (A)

@pmoore,
James didn't get around to reviewing this.
And you accidentally communicated your interest, so now I'm flagging you
for reviewing the sanity here :)

Note, we already use a run to signal deadline-exceeded or task-canceled
when deadline is reached or task is canceled before the first run is scheduled.

I think it makes a lot of sense for cancelTask, because we basically say that
calling cancelTask is equivalent to calling scheduleTask/cancelTask, if the
task is unscheduled.

It's a bit more sketchy with deadline-exceeded, but I think we want to use
a run to signal this too. Basically, we want some sort of resolution, and
keeping it so that tasks are resolved by runs is nice.
A nice result of this is that the flow of state changes and timestamps of these
changes is evident from the .runs property of the task status structure.

This is mostly about introducing a new value for "reasonCreated". So that it's
clear that the run was made to signal an exception.
Attachment #8606764 - Flags: review?(jlal) → review?(pmoore)
Comment on attachment 8606764 [details] [review]
github PR queue part (A)

See PR comments for details.
Attachment #8606764 - Flags: review?(pmoore) → review+
Debated (A) with pmoore, so clearing jlal NI.

Note, we still need to fix (B) to fix this bug.
Flags: needinfo?(jlal)
Component: TaskCluster → General
Product: Testing → Taskcluster
Component: General → Integration
Blocks: 1183528
No longer depends on: 1183528
just as a follow up, the solution to this is pretty straightforward, but running the tests much less so.  I would like to get the tests running for this change before deploying.  Working on that today and hopefully will have an update soon.
Part B has been submitted to production.  Mailing lists have been emailed about this behavior change.  Confirmed with :tomcat today that he saw a lot less exceptions today.

https://github.com/taskcluster/mozilla-taskcluster/commit/5b23872a666fd7c1b42c9e55c2696e8ab3363079
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1182491
Assignee: nobody → garndt
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: