Cancelling jobs through Treeherder does not always work but the UI says the are cancelled

RESOLVED DUPLICATE of bug 1059909

Status

Tree Management
Treeherder: Data Ingestion
RESOLVED DUPLICATE of bug 1059909
3 years ago
3 years ago

People

(Reporter: emorley, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
Broken out of an email sent by Armen:

> For some reason, some canceled jobs through treeherder did not actually
> cancel:
> https://treeherder.mozilla.org/#/jobs?repo=ash&revision=e0f56d3f1fc0
> Treeherder marked them as canceled but I think the releng systems did
> not receive my request.
> 
> For instance, if we load:
> https://secure.pub.build.mozilla.org/buildapi/self-serve/ash/rev/e0f56d3f1fc0
> and search for "linux64-br-haz_ash_dep" we will see it as green.
> 
> If we look at all jobs, you will see no "user canceled" jobs.
> 
> In fact, if we look at the Linux64 jobs, we can see that the test jobs
> still run.
> 
> Can someone please help me figure out why this happened?
(Reporter)

Comment 1

3 years ago
Unlike TBPL, which has no concept of pending/running jobs on the backend (it just temporarily layers them over the top in the client), treeherder stores pending/running jobs in the DB and then keeps track of them as they change state from one to another. The problem is that if a job was cancelled whilst it was pending, it disappears from builds-pending but never ends up in builds-4hr. As a result Treeherder tracks the cancels performed within it's UI and proactively marks the jobs as cancelled.

In this case it sounds like for whatever reason the call to buildapi to mark the job as cancelled failed, but treeherder still marked it as cancelled internally.

Can you remember if the jobs were pending or running when you tried to cancel them?

There are many things we could do to help this:

1) Allow builds-{4hr,running,pending} to override the job result for jobs mistakenly marked as cancelled. I know previously we've said "a job can't go backwards in result, ie from complete to unfinished" but I think it's daft to not believe builds-* when we have the correct result there.

2) Make sure we wait for a success response from buildapi before marking the job as cancelled in the treeherder DB.

3) Perhaps stop proactively changing the state of the jobs. ie:
  - for a running job, display the "cancellation request sent" message, but don't visually update the job state until we see it as cancelled in builds-4hr, at which point we ingest its new cancelled state from there.
  - for a pending job, finally fix bug 1059909 and wait for the pruning to remove it.
...though for both of these, it worsens the UX for the user, since it becomes harder to keep track of what was cancelled already, since the UI takes ages to update. Perhaps as a compromise between the two, we could have a new colour or UI indicator which means "cancellation in progress".

4) Completely overhaul the way we handle cancellations - and use a combination of bug 1059909 and buildapi emitting pulse messages or somehow telling treeherder the job has been cancelled. Alternatively I wonder how hard it would be to make the "cancelled whilst pending" jobs appear in builds-4hr (which would solve the whole problem).


...and of course, it would be great to know why the buildapi call failed in the first place in comment 0.
Depends on: 1059909
(Reporter)

Updated

3 years ago
Summary: Cancelling jobs through Treeherder does not always work → Cancelling jobs through Treeherder does not always work but the UI says the are cancelled

Comment 2

3 years ago
(In reply to Ed Morley [:edmorley] from comment #1)
> 
> Can you remember if the jobs were pending or running when you tried to
> cancel them?
> 
I don't remember. I will keep it in mind.


Since I'm working on mozci, I know how difficult are the tasks you're trying to solve!

I also know that we hit buildjson dumps because hitting self-serve api thousands of times from treeherder to get live information would bring self-serve down.

Perhaps for cancellation requests it is fine to hit self-serve for the data we want.

If I knew how builds-4hr is generated (or builds-pendings, builds-running) I could give you a hand.

> ...and of course, it would be great to know why the buildapi call failed in
> the first place in comment 0.

We get a return code and a message. Do we check that?
(Reporter)

Comment 3

3 years ago
(In reply to Ed Morley [:edmorley] from comment #1)
> ... Alternatively I wonder how
> hard it would be to make the "cancelled whilst pending" jobs appear in
> builds-4hr (which would solve the whole problem).

Would it be viable to make jobs cancelled whilst they are pending, appear in builds-4hr (or another data-source)? The problem at the moment is that we have to manually keep track of these cancelled-whilst-pending jobs.
Flags: needinfo?(catlee)
(Reporter)

Comment 4

3 years ago
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #2)
> > ...and of course, it would be great to know why the buildapi call failed in
> > the first place in comment 0.
> 
> We get a return code and a message. Do we check that?

We don't :-(
https://github.com/mozilla/treeherder-ui/blob/a7aecdad13fb8203769b133c55eb884bf2c9f427/webapp/app/plugins/controller.js#L149

We also appear to change the state of jobs to cancelled, even if they were running jobs (which is unnecessary, since they will appear in builds-4hr (albeit more slowly):
https://github.com/mozilla/treeherder-service/blob/88fc50e26765bb67faed19d6f9026ac055d7c6e0/treeherder/model/sql/jobs.json#L253
(Reporter)

Comment 5

3 years ago
Actually $http returns a promise, so I guess it may work:
https://docs.angularjs.org/api/ng/service/$http
https://github.com/mozilla/treeherder-ui/blob/3ccfa7466999e8d19273caf705c54e5fea167f96/webapp/app/js/services/buildapi.js#L51

That said, we don't proactively change state for "cancel all", but do for "cancel single job", which would perhaps explain the prevalence of bug 1059909.
I bet we're also hitting bug 1093743 here.  If we use the wrong request_id to cancel, I'm sure it just won't cancel it, since it was already coalesced and not active anyway.
(Reporter)

Comment 7

3 years ago
Ah yeah great point :-0
(Reporter)

Comment 8

3 years ago
Marking this a dupe of bug 1059909, since I've broadened that bug to be about overhauling the way we handle cancellations, since all the fixes are intertwined.

catlee, I'll leave the needinfo, since I'm curious to know if builds-4hr could be made to show cancelled _pending_ jobs, since it would simplify the solution in bug 1059909 (specifically it would avoid the need for step #3 in bug 1059909 comment 8).
Status: NEW → RESOLVED
Last Resolved: 3 years ago
No longer depends on: 1059909
Resolution: --- → DUPLICATE
Duplicate of bug: 1059909
(Reporter)

Updated

3 years ago
Flags: needinfo?(catlee)
You need to log in before you can comment on or make changes to this bug.