Closed Bug 1059909 Opened 7 years ago Closed 3 years ago

Rethink the way we track cancellation of pending/running jobs to avoid incorrect job states

Categories

(Tree Management :: Treeherder: Job Triggering & Cancellation, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: camd, Unassigned)

References

Details

(Keywords: regression)

If a job is in ``pending`` state, and self-serve is called directly to cancel it, Treeherder will not know to transition it to ``usercancel``.  So the job will show as ``pending`` indefinitely.

This bug is a second part to bug 1052891.  But this part should be able to wait till after the full transition to Treeherder from TBPL.

We should create some background "auditor" task that will look through the jobs pending in Treeherder that are no longer in ``pending.js`` but that have been gone long enough that we trust they are not in the process of transitioning to ``running`` state on their own.
The approach would likely be to
Depends on: 1052891
Ed - Can you asses how "blocky" this one is.  Can we place it in the post-transition list?
Flags: needinfo?(emorley)
The approach would likely be to add a new process to the ``etl`` layer in ``etl/buildapi.py`` in the form on a new class called ``PendingJobsPruneProcess``.  It would need its own transformer mixin that would just extract all the job_guids from all the pending jobs.  Here's what I have in mind:

1. extract ``job_guids`` from ``pending.js``
2. For each repo, check if we have ``pending`` jobs that are not in that list of ``job_guids``
3. for jobs not in that list, transition them to ``usercancel``

caveats: There is a race condition possible here.  Jobs are removed from ``pending.js`` as they're transitioned to ``running.js`` and if we happen to catch it right in the middle of that transition, we'll make the job ``usercancel`` and then try to make it ``running``.  So we'd either need to allow transitioning that direction (which is currently prohibited by the SQL update procs) or use some kind of timeout to only transition jobs gone from ``pending.js`` for a certain amount of time.  The latter will be harder to get right than the prior.  

Another approach would be to ALSO fetch the ``job_guids`` from running.js and add them to the list of ``job_guids`` NOT to transition to ``usercancel``.  However, I suspect there is a lag time between disappearing from ``pending.js`` and appearing is ``running.js``.  So, again, timing.

Yet ANOTHER approach would be to modify self-serve so that it posts something to pulse when a job is canceled while pending.  Or, even simpler would be for self-serve to transition canceled pending jobs to builds4hr.js as ``usercancel``.  If we were able to do that last thing, then we wouldn't need any modifications in Treeherder at all.
(In reply to Cameron Dawson [:camd] from comment #2)
> Ed - Can you asses how "blocky" this one is.  Can we place it in the
> post-transition list?

Hmm on the fence. The problem is that it's not just direct self-serve uses that will trigger this issue, but anyone using cancel in TBPL (which is everyone at the current time). That said, if having these orphans doesn't cause any harm (and we can educate people to ignore them), then guess it doesn't matter as much.

(In reply to Cameron Dawson [:camd] from comment #3)
> The approach would likely be to add a new process to the ``etl`` layer in
> ``etl/buildapi.py`` in the form on a new class called
> ``PendingJobsPruneProcess``.  It would need its own transformer mixin that
> would just extract all the job_guids from all the pending jobs.  Here's what
> I have in mind:
> 
> 1. extract ``job_guids`` from ``pending.js``
> 2. For each repo, check if we have ``pending`` jobs that are not in that
> list of ``job_guids``
> 3. for jobs not in that list, transition them to ``usercancel``
> 
> caveats: There is a race condition possible here.  Jobs are removed from
> ``pending.js`` as they're transitioned to ``running.js`` and if we happen to
> catch it right in the middle of that transition, we'll make the job
> ``usercancel`` and then try to make it ``running``.  So we'd either need to
> allow transitioning that direction (which is currently prohibited by the SQL
> update procs) or use some kind of timeout to only transition jobs gone from
> ``pending.js`` for a certain amount of time.  The latter will be harder to
> get right than the prior.  
> 
> Another approach would be to ALSO fetch the ``job_guids`` from running.js
> and add them to the list of ``job_guids`` NOT to transition to
> ``usercancel``.  However, I suspect there is a lag time between disappearing
> from ``pending.js`` and appearing is ``running.js``.  So, again, timing.

What about:
1) Each time we ingest builds-pending.js, we update a last_seen for that job in the jobs table
2) Periodically a cleanup job runs that removes/marks as inactive any job if:
 * job state == pending
 * last_seen < (current time - N minutes) ...where N could be say 15?
 * time_of_last_successful_builds-{pending,running,4hr}_ingestion > (current time - N)

(The last clause to ensure a buildapi downtime doesn't cause pending jobs to get deleted)

> Yet ANOTHER approach would be to modify self-serve so that it posts
> something to pulse when a job is canceled while pending.  Or, even simpler
> would be for self-serve to transition canceled pending jobs to builds4hr.js
> as ``usercancel``.  If we were able to do that last thing, then we wouldn't
> need any modifications in Treeherder at all.

I think this approach makes things too interlinked with buildapi. buildapi isn't the most reliable & glitches could result in orphans. Plus when we start having more non-buildbot jobs in the future, we would be better served having a generic cleanup that handles all of them I think.
Flags: needinfo?(emorley)
OS: Mac OS X → All
Hardware: x86 → All
Priority: -- → P2
No longer blocks: treeherder-dev-transition
Keywords: regression
Component: Treeherder → Treeherder: Data Ingestion
Duplicate of this bug: 1131277
Blocks: 1131614
I found out that you can find cancellation requests in here:
https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs?format=json

armenzg@mozilla.com cancel_revision {'branch': 'try', 'revision': '98e9b20f9d4d'} 2015-02-10 07:21:41 	2015-02-10 07:21:45 {'body': {'msg': u'', 'errors': False}, 'request_id': 1167985}

It would probably be great if you could pass a paramater to indicate all job requests since last time we checked.
I also believe we can use the "errors" key to review requests that did not work as expected.
The other thing that occurred to me, is that it's not just self-serve cancellations we have to watch for, but general scheduler issues and manual job deletes.

Seeing as neither builds-4hr nor self-serve's API keeps any record of jobs that were cancelled when in the _pending_ state - I propose that treeherder does the same - ie: If a job in in the _pending_ state and is cancelled via any means, it gets _deleted_ from the Treeherder DB, rather than the current behaviour, which is to mark it as cancelled. 

ie:

1) For _running_ jobs that are cancelled (via any means), don't update their state directly, but let builds-4hr show the job as cancelled - which we'll ingest just like any other completed state (eg success, testfailed). This differs from the current behaviour, where we mark them as cancelled in the DB ourselves, which is risky.

2) If a _pending_ job is cancelled via treeherder, we'll perform the buildapi call, _check_ that the request was actually successful, then delete the pending job from the treeherder DB (rather than marking the job as cancelled). If for some reason the request was successful & yet the job did not get killed (eg some issue with self-serve killing the task), then that's fine, we'll re-ingest the job from builds-{pending,running,4hr} anyway.

3) For jobs cancelled elsewhere or via manual scheduler deletes when there are infra issues, we'll have to prune these by cross referencing against what's listed in builds-pending.js & last modifieds. ie the suggestion halfway through comment 4. Again, we'll be pretty safe if we accidentally deleted a job, since we'll just re-ingest when it appears back in builds-{pending,running,4hr}.

We'll also need to fix the "cancel all" case, since at the moment it is inconsistent with the cancel single job case (it doesn't have any handling at all):
https://github.com/mozilla/treeherder-ui/blob/dfc1c53797000c4dd978fbded4fa4001576e3343/webapp/app/plugins/controller.js#L259

Adjusting summary to cover all of the above (which will also fix bug 1131614).
Summary: Create pruning process for pending jobs orphaned by direct self-serve cancel calls → Rethink the way we track cancellation of pending/running jobs to avoid incorrect job states
No longer blocks: 1131614
Duplicate of this bug: 1131614
Depends on: 1093743
Though the only problem with deleting the jobs outright, is that now the UI is requesting jobs based on last modified, we'll end up with them being stuck in the UI until refresh. I guess we could just mark as active_status onhold - then handle those separately? Eugh.
Priority: P2 → P3
Duplicate of this bug: 1174240
> We currently try to manage this by actively marking the job as cancelled in Treeherder's DB, when someone
> uses the cancel button. 
> 
> However:
> 1) Using the cancel-all button (vs cancel one at a time) is broken iirc
> 2) People can circumvent this by going straight to self-serve (or say if jobs are manually pruned from the scheduler DB)
> 
> Now bug 1059909 is filed for improving our handling of this - but it's a case of finding time to fix it. It's also
> pretty wearying since it's "yet another annoying issue we're having to work around due to the awfulness that is 
> builds-{pending,running,4hr}" and also once we start using buildbot less and less this problem will go away
> on it's own - so it's hard to justify spending too much time on it really.

Once we lost track of a pending job, we can check if it has been coalesced by calling mozci's _is_coallesced() functionality:
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L209
Alice has currently written a script that tells us all coalesced jobs for a specific revision and it is accurate afaik.

Right now, I'm using buildapi/buildjson from mozci because I know I can find accurate data.
I've wanted to switch to TH as my source of status for mozci, however, I can't because of bugs like this (which are caused by the buildbot design).

I think the way to switch to TH will be by not trusting the state of pending jobs from TH and falling back to buildapi/buildjson in those cases (like we currently do).
The other option is to not switch to TH as a status source for jobs running on buildbot and keep relying on buildapi/buildjson.
Component: Treeherder: Data Ingestion → Treeherder: Job Triggering & Cancellation
Buildbot is going away and this bug isn't an issue with Taskcluster.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.