Closed Bug 1274929 Opened 8 years ago Closed 8 years ago

No tests executed on all branches via taskcluster (hanging queue service)

Categories

(Taskcluster :: Services, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: jonasfj)

References

Details

Any test job for linux64 debug on mozilla-inbound via taskcluster stopped working on Sunday:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=d53726702252f8dea95878c721f8b2f652accb89&filter-searchStr=linux%20x64%20debug&selectedJob=28388961

The last successful push with all the tests executed is 8978551c29be on Sunday 2:42pm UTC.
I want to note that the decision graph contains all the test tasks but somehow those are not getting executed.

https://queue.taskcluster.net/v1/task/Bp8nVESeS-eZ1Q-awNzXug/runs/0/artifacts/public%2Ftask-graph.json
Component: Task Configuration → Integration
Looks like all integration branches are affected (including fxteam, try).
Summary: No tests executed on mozilla-inbound via taskcluster for linux64 debug → No tests executed on integration brances via taskcluster for linux64 debug
Summary: No tests executed on integration brances via taskcluster for linux64 debug → No tests executed on integration branches via taskcluster for linux64 debug
Peter pointed me to the new docs for the task graphs:
http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/taskgraph.html#graph-generation

Maybe this is related to the optimized task graph, which seem to be new?
I've called Dustin (it is 6:20am for him now - poor guy!) and left a message. I think we'll need to pull in the big guns to get this one resolved swiftly.
its tree closing for try, mozilla-inbound and fx-team so a blocker :(
Severity: normal → blocker
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Maybe this is some kind of corruption? The task group ids are different for
> the decision task and the build task. Take this example:
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=fe1a7608cfd77794732ce6d1f9a2ee7cc33e213f
> 
> Decision task:
> https://tools.taskcluster.net/task-inspector/#QbKLAN91QPy0XhxdSQA0zw/
> Build task:
> https://tools.taskcluster.net/task-inspector/#WYbksZGMS-OwGgaJu2KCYQ/
> 
> https://tools.taskcluster.net/task-graph-inspector/#KW6IBRtITDKC6BgohfuHVw/
> https://tools.taskcluster.net/task-graph-inspector/#ccrp678YT-6oFADQDB8meA/
> 
> KW6IBRtITDKC6BgohfuHVw != ccrp678YT-6oFADQDB8meA
> 
> Not sure where ccrp678YT-6oFADQDB8meA is getting from.

Right now there is a transition in how task graphs are being scheduled.  Previously, they were scheduled as part of the same graph as the decision task and they were done so in bulk through the taskcluster scheduler.

The new heavior is for the decision task to create each of these tasks within the Taskcluster queue that all share the same "task group id" (no longer "graph"), and the dependencies are defined within the task definition itself rather than a larger graph.  When this happens, the decision tasks creates a task group id which is different than the task graph ID the decision task has.

For the link that wasn't working, you can view it here in the task group inspector:
https://tools.taskcluster.net/task-group-inspector/#ccrp678YT-6oFADQDB8meA/

Changes in the future will be rolled out to redefine how we schedule the decision tasks so that they are part of the same task group as the tasks which it schedules.

I also believe there is a bug open to correct the link on the task inspector to open up the task group inspector instead of the graph inspector if the task is part of the new task group system.
s/heavior/behavior
From what I can tell, the portion of taskcluster-queue that is responsible for scheduling dependent tasks was not doing so because of a connection issue it experienced on Sunday with Azure [1].  After this connection issue, the dependency resolver within the queue was still running, but not polling for tasks it should schedule correctly.  Usually there are messages that appear as [2] when things are working correctly, but those message no longer were appearing the logs after this connection issue.

I restarted the queue and it appears that it's processing the backlog of items.  I see multiple messages fetched from Azure along with many pending task messages that it is publishing.


ni? Jonas on this for him to take a look.


[1] Sun, 22 May 2016 20:49:14 GMT app:dependency-resolver Error: OperationTimedOutError: Operation could not be completed within the specified time. 
[2]  app:dependency-resolver Fetched 0 messages
Flags: needinfo?(jopsen)
Component: Integration → Queue
We fixed the queue dependency scheduling crashes yesterday, and added a bit of extra error reporting.
Let me know if this happens again, thanks.
Flags: needinfo?(jopsen)
Assignee: nobody → jopsen
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
This occurred again just now.  Garndt restarted the queue service and now jobs which were incorrectly in the "scheduled" state are in the "pending" state.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It looks like jonas is working on some changes in this PR to help with causing heroku to restart the app when this condition happens https://github.com/taskcluster/taskcluster-queue/pull/98
Summary: No tests executed on integration branches via taskcluster for linux64 debug → No tests executed on all branches via taskcluster (hanging queue service)
It looks like we are hitting this issue again. No tests are getting scheduled across branches. Here for example inbound:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc

Last tests got run 3h ago. Since then no other tests are getting triggered.
(In reply to Henrik Skupin (:whimboo) from comment #14)
> It looks like we are hitting this issue again. No tests are getting
> scheduled across branches. Here for example inbound:
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=tc
> 
> Last tests got run 3h ago. Since then no other tests are getting triggered.

fx-team and mozilla-inbound are now closed for this
Pete restarted the queue so we backfill the jobs now. Those are missing tasks from the last 5h, so we will hit AWS massively again.
@jonasfj shall we restart the queue on an hourly cron until this bug is resolved? The only downside I see is that anything that tries to hit a queue endpoint without a backoff strategy could be impacted.
Flags: needinfo?(jopsen)
trees reopen
So it seems like we have a common failure pattern in taskcluster services.  We have lots of timed iteration loops which all have roughly the same requirements.  If any of them get stuck, we get into really bad situations like this.  Right now, each tool has it's own iteration logic.  The actual iteration logic used everywhere really ought to be centralized so that we can share the fail-safes and make everything rugged.

What about making a node class which took the following options:

  maxIterations: 0, undef for infinite, positive integer for limited
  maxFailures: undef for 7, a positive integer that says how many back to back failing iterations before crashing
  maxIterationTime: positive integer of second stating the maximum amount of time an iteration can take before crashing
  minIterationTime: 0, undef for no min, positive integer for minimum amount of time an iteration can take, crash if less
  watchDog: positive integer of the number of seconds the watchdog should be set to
  waitTime: positive integer of seconds for how long to wait between iteration attempts
  waitTimeAfterFail: 0 to use waitTime, positive integer to use 
  handler: an async function that resolves when an iteration is good and rejects when bad
  dmsConfig: undef or object with a deadman's snitch to hit after each iteration completes

The class would have the methods:

  start() -- start the loop
  stop() -- stop the loop after the next iteration

The handler function would be called as:
    
  let result = await this.handler(watchDog);

Passing in the watchDog as a parameter instead of binding the function to an object is so we can pass methods from an object that can use the correct `this`.  This watch dog is an instance of https://github.com/taskcluster/aws-provisioner/blob/master/src/watchdog.js, which will abort the iteration if it hasn't been touched in the required number of seconds.  The reason we would pass it into the function instead of just doing it automatically is because some long lasting iterations will have repeating actions which can be used as a sign of non-freezing.  Example: provisioner loops can take up to 20ish minutes, but submit something every half second, it's only frozen if those half second things are happening, we should wait 20 minutes.  If they aren't happening at least once a minute, we should probably crash.

This watchdog and max iteration time system ensures that inside the process we try really really hard for things to not fail, but we'll fail as safely as possible.

The dmsConfig is used to configure an instance of a Deadman's Snitch (https://deadmanssnitch.com/) which is hit after each iteration.  This is an external service, so we'll be notified if the queue process is stuck without relying on the queue service itself to notify us.

I'd also like to provide a promise timeout function, which rejects a promise that takes too long to complete as a helper for writing robust services.

Doing these things for the provisioner has been extremely helpful to make it a lot more resiliant.  Does anyone disagree with using something like this?
I'm definitely in favor; we have some similar stuff in Buildbot because we ran into many of the same kind of problems!  Want to move that to a new bug so we can concentrate on propping the queue up here?
I just started an email conversation about enabling more coalescing, which should also help us dig out of holes faster when this sort of thing happens.  We're currently at 8000 pending desktop-test tasks, up from 6600 about 10 minutes ago, so things are not looking rosy.
Closing try again to try to get ahead of this load before the US daytime really picks up.
(In reply to Dustin J. Mitchell [:dustin] from comment #22)
> Closing try again to try to get ahead of this load before the US daytime
> really picks up.

inbound, fx-team, autoland and try are now closed and tree closing message points to this bug here
See Also: → 1283461
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> I'm definitely in favor; we have some similar stuff in Buildbot because we
> ran into many of the same kind of problems!  Want to move that to a new bug
> so we can concentrate on propping the queue up here?

Created bug 1283461
We should be able to gradually reopen trees.  We cleared out the backlog awhile ago and are waiting on new tasks.
Fixed in the queue, we should no longer see these issues...
Flags: needinfo?(jopsen)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.