Closed Bug 1130772 Opened 9 years ago Closed 9 years ago

scheduler: Define & Handle behavior for task graph creation failure

Categories

(Taskcluster Graveyard :: Scheduler, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlal, Unassigned)

References

Details

While running various tests I have encountered a few situations (https://papertrailapp.com/systems/taskcluster-scheduler/events?r=500802446155886597-500802483808153610) where the task graph creation fails the problem here is we still end up with a number of defined tasks (which never get scheduled).

The effect here for users sucks since we still get the defined events (and display them in the UI) but can never recover from this failed state leaving this pretty confusing (https://treeherder.allizom.org/#/jobs?repo=try&revision=ddac8e2b2b35)... 

I believe the underlying error here is some kind of bug or (gasp) collision in either case it is likely fixable (and we should) but ultimately I think we must define the behavior of what happens to these "zombie" tasks ideally they are marked into an "exception state" and not rerun (so maybe failed?) in either case our TH integration needs some hit as to what to do with these tasks.
@jonasfj any thoughts here? I can think of some ugly workarounds but as it stands we can't just leave these jobs in pending state forever (so making it blocking the overall release of gecko+tc).
Flags: needinfo?(jopsen)
This is not related to task-graph creation, this error is related to task-graph extension.
All the "failed to insert" errors in the log, or at least most of them are the result of retrying the extendTaskGraph request. It'll try to insert the same entities... Ideally it should figure out that they already
exists and just check that they match what already exists. But from looking at the code neither createTaskGraph or extendTaskGraph are idempotent. I suspect that is the issue here.

---- investigation:

The task-graph: ay8cvF2jSJ-_YeTKaKKcXQ
Seems to exist:
   https://tools.taskcluster.net/task-graph-inspector/#ay8cvF2jSJ-_YeTKaKKcXQ/
But yes, something is clearly wrong with it.

If we look at the decision task, we see that the extendTaskGraph operation failed with a
incidentId: a2f777fb-cf1c-41f6-92bc-8ff2f5fd8c3d
See the log:
    + ./mach taskcluster-graph --project=try '--message=try: -b do -p emulator -u mochitest +tc' \
                               --owner=jlal@mozilla.com --revision-hash=749f0b8f8b0b6aeaf36b3813ad8caef0cea15a39 \
                               --extend-graph
    [taskcluster] Graph server error while extending task graph id
    ay8cvF2jSJ-_YeTKaKKcXQ : Error: Internal Server Error, {"message":"Internal
    Server Error","error":{"info":"Ask administrator to lookup incidentId in
    log-file","incidentId":"a2f777fb-cf1c-41f6-92bc-8ff2f5fd8c3d"}}
    [taskcluster] Done extending graph
    [taskcluster] Successful task run with exit code: 0 completed in 498.419

Clearly the decision task should not have been successful.
It should have failed, we must propagate critical errors.
 - Though that doesn't solve the problem :)

From papertrail:
----------------
Feb 07 11:37:13 taskcluster-scheduler app/web.2:  Sat, 07 Feb 2015 19:37:13 GMT base:api Error occurred handling: /v1/task-graph/ay8cvF2jSJ-_YeTKaKKcXQ/extend, err: [object Object], as JSON: {"statusCode":409,"code":"EntityAlreadyExists","body":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:f1aba0b7-0002-0070-4cb5-8a38bd000000\nTime:2015-02-07T19:37:12.5796747Z"}},"retriesMade":0}, incidentId: a2f777fb-cf1c-41f6-92bc-8ff2f5fd8c3d undefined 
----------------

Hence, the issue is that extendTaskGraph isn't idempotent... Note I suspect createTaskGraph isn't idempotent either. Why the first attempt failed I don't know. From looking at the log I have a suspicion that the first call is actually successful.

The error log from docker-worker is here:
https://papertrailapp.com/systems/71875494/events?r=500800448215953427-500803885578104838
I think the first retry returns 503, where as the retries returns 500. So this is probably a case of blame HTTP.
Perhaps heroku restarted, or something... I had enough digging for now :)


------------------------------------- On topic:
When the task-graph scheduler returns 5xx with internal error the result is undefined behaviour. The queue will eventually get around to resolving the tasks by deadline which should make us happy (eventually).

I think that in cases such as this, it's okay that the error is evident because clearly something is wrong somewhere. In this case I suspect HTTP + non-idempotent operations is to blame. Surely, if extendTaskGraph had been idempotent this would likely not have been a problem.

Either way this kind of error is IMO not something we should try to work around. The task will eventually resolve as exception due to deadline expiration. Ensuring an eventual consistent state was exactly the purpose of the deadline.
We could do a similar thing to the task-graph, if we wanted to make sure that eventually it was resolved.

Eitherway, the bug here is making extendTaskGraph and createTaskGraph idempotent. 503 is probably a networking error of some sorts, heroku interrupting, node restart or a timeout... TC-base resolves unhandled errors as 500.
Flags: needinfo?(jopsen)
Is this bug still relevant?
Flags: needinfo?(garndt)
Jonas, is this still an issue? I haven't heard it talked about recently.
Flags: needinfo?(garndt) → needinfo?(jopsen)
createTaskGraph and extendTaskGraph should be idempotent now.

I'm pretty sure they have been for a while. I fixed it sometime early this Q or late Q1, I don't recall...
But yeah, I haven't heard about double created tasks due to lack of idempotent APIs for a while.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jopsen)
Resolution: --- → FIXED
Component: TaskCluster → Scheduler
Product: Testing → Taskcluster
Product: Taskcluster → Taskcluster Graveyard
You need to log in before you can comment on or make changes to this bug.