Closed Bug 1052891 Opened 7 years ago Closed 7 years ago

If a job is canceled while in the pending state, the pending job will never be removed

Categories

(Tree Management :: Treeherder, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: camd, Assigned: camd)

References

Details

When a job is running, and is canceled by the user, then a "completed" job will be generated of the result "usercancel".  However, if the job is pending, and gets canceled, then the pending job just disappears and no "usercancel" job is generated.

In TBPL, this has the effect of the pending job just disappearing as intended.  However, in Treeherder, the effect is that the pending job sticks around waiting for a running or complete job to replace it.  But that job never comes.

So we need to create a process that "audits" our pending jobs and ensures they still exist in ``pending.js``.  If they don't, and are perhaps older than some threshold, then we should delete them.  That timing threshold is necessary because the job may be in the process of changing from pending to running.

pivotal story: https://www.pivotaltracker.com/story/show/71979924
Good spot :-)

Yeah this is along the lines of one of the too many discussions to remember at the previous work weeks - we need to expire the pending jobs if they don't shortly thereafter appear in running.js

The only caveat is that it would be good to avoid pruning them too quickly - such that the pending job disappears from the UI, only to reappear 60s later (this is currently what happens in TBPL, due to there being no cross-referencing between data sets). Given that the "pending job was cancelled" case is much less frequent than "pending job started running", I think we should optimise for the latter - given that it won't hurt for a pending job to take N seconds/minutes to disappear from the UI vs making every pending job disappear for a few seconds.

Do you know how quickly jobs appear in running.js after being removed from pending.js? How about for running.js -> builds4hr? My only context is TBPL, which adds so much delay itself (5 min cron importing builds-4hr etc) that it's hard to tell.
Priority: -- → P2
Ed--  The good news here is that we are already optimized for the case of a job transitioning from pending to running.  Instead of removing the pending and adding the running, we actually switch the job status.  Same deal with transitioning to complete.  So in Treeherder you won't see the running job disappear.  It will just change state to running.

So this issue is just that we haven't, to date, gotten a trigger to remove dead pending jobs.  So, after a chat with Jeads, we think the best approach is to have the "cancel" send a trigger to the server to let it know to delete any related pending jobs (the single, or all for a resultset).  Running jobs will transition to usercancel.
That does make sense for the jobs cancelled from treeherder - but what about cancellations from buildapi/self-serve? :-)
Yeah, that's an issue.  I think for that, we may need a process that runs in the background and audits our pending jobs against pending.js.  And, if they're more than <some time period> old, then prune them.  Not quite sure what that time period is, though.  It'll have to do this for every repo, since we won't get any indication from self-serve.
iirc when we discussed this at the work week, we mentioned a last_seen (or similar) field on the pending job entry in the jobs table, which would be updated each time builds-pending.js was imported, which if not within the last N minutes would cause the row to be marked as usercancel?
Duplicate of this bug: 1044348
Ed, that's a good point. Should we transition pending to usercancel?  or remove them like tbpl does?  My personal opinion would be better to move pending and running to usercancel.  And that's safer, data-wise, in that it'll be easier to prevent re-importing from pending.js if the timing is a little off on the import data-processes.
I think transitioning both to usercancel should be fine? (and improved UX from TBPL).
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Blocks: 1059909
This is fixed with the commits Ed mentioned above.  A follow-up to this bug, which we are hoping can wait till after transition, is bug 1059909.  That bug is to address comment 4 above
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.