1274929 - No tests executed on all branches via taskcluster (hanging queue service)

Reporter

Description

•

8 years ago

Any test job for linux64 debug on mozilla-inbound via taskcluster stopped working on Sunday:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=d53726702252f8dea95878c721f8b2f652accb89&filter-searchStr=linux%20x64%20debug&selectedJob=28388961

The last successful push with all the tests executed is 8978551c29be on Sunday 2:42pm UTC.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 1

•

8 years ago

I want to note that the decision graph contains all the test tasks but somehow those are not getting executed.

https://queue.taskcluster.net/v1/task/Bp8nVESeS-eZ1Q-awNzXug/runs/0/artifacts/public%2Ftask-graph.json

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

8 years ago

Component: Task Configuration → Integration

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 2

•

8 years ago

Looks like all integration branches are affected (including fxteam, try).

Summary: No tests executed on mozilla-inbound via taskcluster for linux64 debug → No tests executed on integration brances via taskcluster for linux64 debug

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

8 years ago

Summary: No tests executed on integration brances via taskcluster for linux64 debug → No tests executed on integration branches via taskcluster for linux64 debug

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 3

•

8 years ago

It looks like that for both the build and harzard tasks the referenced taskgroupid is invalid:

https://tools.taskcluster.net/task-inspector/#FMl8gG_5QdSBZqx1X6-tSQ/
https://tools.taskcluster.net/task-graph-inspector/#b2d8v17-QEK4O4EmpIB6DA/

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 4

•

8 years ago

Maybe this is some kind of corruption? The task group ids are different for the decision task and the build task. Take this example:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=fe1a7608cfd77794732ce6d1f9a2ee7cc33e213f

Decision task: https://tools.taskcluster.net/task-inspector/#QbKLAN91QPy0XhxdSQA0zw/
Build task: https://tools.taskcluster.net/task-inspector/#WYbksZGMS-OwGgaJu2KCYQ/

https://tools.taskcluster.net/task-graph-inspector/#KW6IBRtITDKC6BgohfuHVw/
https://tools.taskcluster.net/task-graph-inspector/#ccrp678YT-6oFADQDB8meA/

KW6IBRtITDKC6BgohfuHVw != ccrp678YT-6oFADQDB8meA

Not sure where ccrp678YT-6oFADQDB8meA is getting from.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 5

•

8 years ago

Peter pointed me to the new docs for the task graphs:
http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/taskgraph.html#graph-generation

Maybe this is related to the optimized task graph, which seem to be new?

Pete Moore [:pmoore][:pete]

Comment 6

•

8 years ago

I've called Dustin (it is 6:20am for him now - poor guy!) and left a message. I think we'll need to pull in the big guns to get this one resolved swiftly.

Carsten Book [:Tomcat]

Comment 7

•

8 years ago

its tree closing for try, mozilla-inbound and fx-team so a blocker :(

Severity: normal → blocker

Greg Arndt [:garndt]

Comment 8

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #4)
> Maybe this is some kind of corruption? The task group ids are different for
> the decision task and the build task. Take this example:
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=fe1a7608cfd77794732ce6d1f9a2ee7cc33e213f
> 
> Decision task:
> https://tools.taskcluster.net/task-inspector/#QbKLAN91QPy0XhxdSQA0zw/
> Build task:
> https://tools.taskcluster.net/task-inspector/#WYbksZGMS-OwGgaJu2KCYQ/
> 
> https://tools.taskcluster.net/task-graph-inspector/#KW6IBRtITDKC6BgohfuHVw/
> https://tools.taskcluster.net/task-graph-inspector/#ccrp678YT-6oFADQDB8meA/
> 
> KW6IBRtITDKC6BgohfuHVw != ccrp678YT-6oFADQDB8meA
> 
> Not sure where ccrp678YT-6oFADQDB8meA is getting from.

Right now there is a transition in how task graphs are being scheduled.  Previously, they were scheduled as part of the same graph as the decision task and they were done so in bulk through the taskcluster scheduler.

The new heavior is for the decision task to create each of these tasks within the Taskcluster queue that all share the same "task group id" (no longer "graph"), and the dependencies are defined within the task definition itself rather than a larger graph.  When this happens, the decision tasks creates a task group id which is different than the task graph ID the decision task has.

For the link that wasn't working, you can view it here in the task group inspector:
https://tools.taskcluster.net/task-group-inspector/#ccrp678YT-6oFADQDB8meA/

Changes in the future will be rolled out to redefine how we schedule the decision tasks so that they are part of the same task group as the tasks which it schedules.

I also believe there is a bug open to correct the link on the task inspector to open up the task group inspector instead of the graph inspector if the task is part of the new task group system.

Greg Arndt [:garndt]

Comment 9

•

8 years ago

s/heavior/behavior

Greg Arndt [:garndt]

Comment 10

•

8 years ago

From what I can tell, the portion of taskcluster-queue that is responsible for scheduling dependent tasks was not doing so because of a connection issue it experienced on Sunday with Azure [1].  After this connection issue, the dependency resolver within the queue was still running, but not polling for tasks it should schedule correctly.  Usually there are messages that appear as [2] when things are working correctly, but those message no longer were appearing the logs after this connection issue.

I restarted the queue and it appears that it's processing the backlog of items.  I see multiple messages fetched from Azure along with many pending task messages that it is publishing.


ni? Jonas on this for him to take a look.


[1] Sun, 22 May 2016 20:49:14 GMT app:dependency-resolver Error: OperationTimedOutError: Operation could not be completed within the specified time. 
[2]  app:dependency-resolver Fetched 0 messages

Flags: needinfo?(jopsen)

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Component: Integration → Queue

Jonas Finnemann Jensen (:jonasfj)

Assignee

Comment 11

•

8 years ago

We fixed the queue dependency scheduling crashes yesterday, and added a bit of extra error reporting.
Let me know if this happens again, thanks.

Flags: needinfo?(jopsen)

Greg Arndt [:garndt]

Updated

•

8 years ago

Assignee: nobody → jopsen

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

8 years ago

This occurred again just now.  Garndt restarted the queue service and now jobs which were incorrectly in the "scheduled" state are in the "pending" state.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Greg Arndt [:garndt]

Comment 13

•

8 years ago

It looks like jonas is working on some changes in this PR to help with causing heroku to restart the app when this condition happens https://github.com/taskcluster/taskcluster-queue/pull/98

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

8 years ago

Summary: No tests executed on integration branches via taskcluster for linux64 debug → No tests executed on all branches via taskcluster (hanging queue service)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 14

•

8 years ago

It looks like we are hitting this issue again. No tests are getting scheduled across branches. Here for example inbound:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc

Last tests got run 3h ago. Since then no other tests are getting triggered.

Carsten Book [:Tomcat]

Comment 15

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #14)
> It looks like we are hitting this issue again. No tests are getting
> scheduled across branches. Here for example inbound:
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=tc
> 
> Last tests got run 3h ago. Since then no other tests are getting triggered.

fx-team and mozilla-inbound are now closed for this

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 16

•

8 years ago

Pete restarted the queue so we backfill the jobs now. Those are missing tasks from the last 5h, so we will hit AWS massively again.

Pete Moore [:pmoore][:pete]

Comment 17

•

8 years ago

@jonasfj shall we restart the queue on an hourly cron until this bug is resolved? The only downside I see is that anything that tries to hit a queue endpoint without a backoff strategy could be impacted.

Flags: needinfo?(jopsen)

Carsten Book [:Tomcat]

Comment 18

•

8 years ago

trees reopen

John Ford [:jhford] CET/CEST Berlin Time

Comment 19

•

8 years ago

So it seems like we have a common failure pattern in taskcluster services.  We have lots of timed iteration loops which all have roughly the same requirements.  If any of them get stuck, we get into really bad situations like this.  Right now, each tool has it's own iteration logic.  The actual iteration logic used everywhere really ought to be centralized so that we can share the fail-safes and make everything rugged.

What about making a node class which took the following options:

  maxIterations: 0, undef for infinite, positive integer for limited
  maxFailures: undef for 7, a positive integer that says how many back to back failing iterations before crashing
  maxIterationTime: positive integer of second stating the maximum amount of time an iteration can take before crashing
  minIterationTime: 0, undef for no min, positive integer for minimum amount of time an iteration can take, crash if less
  watchDog: positive integer of the number of seconds the watchdog should be set to
  waitTime: positive integer of seconds for how long to wait between iteration attempts
  waitTimeAfterFail: 0 to use waitTime, positive integer to use 
  handler: an async function that resolves when an iteration is good and rejects when bad
  dmsConfig: undef or object with a deadman's snitch to hit after each iteration completes

The class would have the methods:

  start() -- start the loop
  stop() -- stop the loop after the next iteration

The handler function would be called as:
    
  let result = await this.handler(watchDog);

Passing in the watchDog as a parameter instead of binding the function to an object is so we can pass methods from an object that can use the correct `this`.  This watch dog is an instance of https://github.com/taskcluster/aws-provisioner/blob/master/src/watchdog.js, which will abort the iteration if it hasn't been touched in the required number of seconds.  The reason we would pass it into the function instead of just doing it automatically is because some long lasting iterations will have repeating actions which can be used as a sign of non-freezing.  Example: provisioner loops can take up to 20ish minutes, but submit something every half second, it's only frozen if those half second things are happening, we should wait 20 minutes.  If they aren't happening at least once a minute, we should probably crash.

This watchdog and max iteration time system ensures that inside the process we try really really hard for things to not fail, but we'll fail as safely as possible.

The dmsConfig is used to configure an instance of a Deadman's Snitch (https://deadmanssnitch.com/) which is hit after each iteration.  This is an external service, so we'll be notified if the queue process is stuck without relying on the queue service itself to notify us.

I'd also like to provide a promise timeout function, which rejects a promise that takes too long to complete as a helper for writing robust services.

Doing these things for the provisioner has been extremely helpful to make it a lot more resiliant.  Does anyone disagree with using something like this?

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

8 years ago

I'm definitely in favor; we have some similar stuff in Buildbot because we ran into many of the same kind of problems!  Want to move that to a new bug so we can concentrate on propping the queue up here?

Dustin J. Mitchell [:dustin] (he/him)

Comment 21

•

8 years ago

I just started an email conversation about enabling more coalescing, which should also help us dig out of holes faster when this sort of thing happens.  We're currently at 8000 pending desktop-test tasks, up from 6600 about 10 minutes ago, so things are not looking rosy.

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

8 years ago

Closing try again to try to get ahead of this load before the US daytime really picks up.

Carsten Book [:Tomcat]

Comment 23

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #22)
> Closing try again to try to get ahead of this load before the US daytime
> really picks up.

inbound, fx-team, autoland and try are now closed and tree closing message points to this bug here

Pete Moore [:pmoore][:pete]

Updated

•

8 years ago

Comment 24

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> I'm definitely in favor; we have some similar stuff in Buildbot because we
> ran into many of the same kind of problems!  Want to move that to a new bug
> so we can concentrate on propping the queue up here?

Created bug 1283461

Greg Arndt [:garndt]

Comment 25

•

8 years ago

We should be able to gradually reopen trees.  We cleared out the backlog awhile ago and are waiting on new tasks.

Jonas Finnemann Jensen (:jonasfj)

Assignee

Comment 26

•

8 years ago

Fixed in the queue, we should no longer see these issues...

Flags: needinfo?(jopsen)

Greg Arndt [:garndt]

Updated

•

8 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Queue → Services