Closed
Bug 1274929
Opened 8 years ago
Closed 8 years ago
No tests executed on all branches via taskcluster (hanging queue service)
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: whimboo, Assigned: jonasfj)
References
Details
Any test job for linux64 debug on mozilla-inbound via taskcluster stopped working on Sunday: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=d53726702252f8dea95878c721f8b2f652accb89&filter-searchStr=linux%20x64%20debug&selectedJob=28388961 The last successful push with all the tests executed is 8978551c29be on Sunday 2:42pm UTC.
Reporter | ||
Comment 1•8 years ago
|
||
I want to note that the decision graph contains all the test tasks but somehow those are not getting executed. https://queue.taskcluster.net/v1/task/Bp8nVESeS-eZ1Q-awNzXug/runs/0/artifacts/public%2Ftask-graph.json
Reporter | ||
Updated•8 years ago
|
Component: Task Configuration → Integration
Reporter | ||
Comment 2•8 years ago
|
||
Looks like all integration branches are affected (including fxteam, try).
Summary: No tests executed on mozilla-inbound via taskcluster for linux64 debug → No tests executed on integration brances via taskcluster for linux64 debug
Reporter | ||
Updated•8 years ago
|
Summary: No tests executed on integration brances via taskcluster for linux64 debug → No tests executed on integration branches via taskcluster for linux64 debug
Reporter | ||
Comment 3•8 years ago
|
||
It looks like that for both the build and harzard tasks the referenced taskgroupid is invalid: https://tools.taskcluster.net/task-inspector/#FMl8gG_5QdSBZqx1X6-tSQ/ https://tools.taskcluster.net/task-graph-inspector/#b2d8v17-QEK4O4EmpIB6DA/
Reporter | ||
Comment 4•8 years ago
|
||
Maybe this is some kind of corruption? The task group ids are different for the decision task and the build task. Take this example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=fe1a7608cfd77794732ce6d1f9a2ee7cc33e213f Decision task: https://tools.taskcluster.net/task-inspector/#QbKLAN91QPy0XhxdSQA0zw/ Build task: https://tools.taskcluster.net/task-inspector/#WYbksZGMS-OwGgaJu2KCYQ/ https://tools.taskcluster.net/task-graph-inspector/#KW6IBRtITDKC6BgohfuHVw/ https://tools.taskcluster.net/task-graph-inspector/#ccrp678YT-6oFADQDB8meA/ KW6IBRtITDKC6BgohfuHVw != ccrp678YT-6oFADQDB8meA Not sure where ccrp678YT-6oFADQDB8meA is getting from.
Reporter | ||
Comment 5•8 years ago
|
||
Peter pointed me to the new docs for the task graphs: http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/taskgraph.html#graph-generation Maybe this is related to the optimized task graph, which seem to be new?
Comment 6•8 years ago
|
||
I've called Dustin (it is 6:20am for him now - poor guy!) and left a message. I think we'll need to pull in the big guns to get this one resolved swiftly.
Comment 7•8 years ago
|
||
its tree closing for try, mozilla-inbound and fx-team so a blocker :(
Severity: normal → blocker
Comment 8•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #4) > Maybe this is some kind of corruption? The task group ids are different for > the decision task and the build task. Take this example: > > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > inbound&revision=fe1a7608cfd77794732ce6d1f9a2ee7cc33e213f > > Decision task: > https://tools.taskcluster.net/task-inspector/#QbKLAN91QPy0XhxdSQA0zw/ > Build task: > https://tools.taskcluster.net/task-inspector/#WYbksZGMS-OwGgaJu2KCYQ/ > > https://tools.taskcluster.net/task-graph-inspector/#KW6IBRtITDKC6BgohfuHVw/ > https://tools.taskcluster.net/task-graph-inspector/#ccrp678YT-6oFADQDB8meA/ > > KW6IBRtITDKC6BgohfuHVw != ccrp678YT-6oFADQDB8meA > > Not sure where ccrp678YT-6oFADQDB8meA is getting from. Right now there is a transition in how task graphs are being scheduled. Previously, they were scheduled as part of the same graph as the decision task and they were done so in bulk through the taskcluster scheduler. The new heavior is for the decision task to create each of these tasks within the Taskcluster queue that all share the same "task group id" (no longer "graph"), and the dependencies are defined within the task definition itself rather than a larger graph. When this happens, the decision tasks creates a task group id which is different than the task graph ID the decision task has. For the link that wasn't working, you can view it here in the task group inspector: https://tools.taskcluster.net/task-group-inspector/#ccrp678YT-6oFADQDB8meA/ Changes in the future will be rolled out to redefine how we schedule the decision tasks so that they are part of the same task group as the tasks which it schedules. I also believe there is a bug open to correct the link on the task inspector to open up the task group inspector instead of the graph inspector if the task is part of the new task group system.
Comment 9•8 years ago
|
||
s/heavior/behavior
Comment 10•8 years ago
|
||
From what I can tell, the portion of taskcluster-queue that is responsible for scheduling dependent tasks was not doing so because of a connection issue it experienced on Sunday with Azure [1]. After this connection issue, the dependency resolver within the queue was still running, but not polling for tasks it should schedule correctly. Usually there are messages that appear as [2] when things are working correctly, but those message no longer were appearing the logs after this connection issue. I restarted the queue and it appears that it's processing the backlog of items. I see multiple messages fetched from Azure along with many pending task messages that it is publishing. ni? Jonas on this for him to take a look. [1] Sun, 22 May 2016 20:49:14 GMT app:dependency-resolver Error: OperationTimedOutError: Operation could not be completed within the specified time. [2] app:dependency-resolver Fetched 0 messages
Flags: needinfo?(jopsen)
Updated•8 years ago
|
Component: Integration → Queue
Assignee | ||
Comment 11•8 years ago
|
||
We fixed the queue dependency scheduling crashes yesterday, and added a bit of extra error reporting. Let me know if this happens again, thanks.
Flags: needinfo?(jopsen)
Updated•8 years ago
|
Assignee: nobody → jopsen
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 12•8 years ago
|
||
This occurred again just now. Garndt restarted the queue service and now jobs which were incorrectly in the "scheduled" state are in the "pending" state.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•8 years ago
|
||
It looks like jonas is working on some changes in this PR to help with causing heroku to restart the app when this condition happens https://github.com/taskcluster/taskcluster-queue/pull/98
Reporter | ||
Updated•8 years ago
|
Summary: No tests executed on integration branches via taskcluster for linux64 debug → No tests executed on all branches via taskcluster (hanging queue service)
Reporter | ||
Comment 14•8 years ago
|
||
It looks like we are hitting this issue again. No tests are getting scheduled across branches. Here for example inbound: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc Last tests got run 3h ago. Since then no other tests are getting triggered.
Comment 15•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #14) > It looks like we are hitting this issue again. No tests are getting > scheduled across branches. Here for example inbound: > > https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter- > searchStr=tc > > Last tests got run 3h ago. Since then no other tests are getting triggered. fx-team and mozilla-inbound are now closed for this
Reporter | ||
Comment 16•8 years ago
|
||
Pete restarted the queue so we backfill the jobs now. Those are missing tasks from the last 5h, so we will hit AWS massively again.
Comment 17•8 years ago
|
||
@jonasfj shall we restart the queue on an hourly cron until this bug is resolved? The only downside I see is that anything that tries to hit a queue endpoint without a backoff strategy could be impacted.
Flags: needinfo?(jopsen)
Comment 18•8 years ago
|
||
trees reopen
Comment 19•8 years ago
|
||
So it seems like we have a common failure pattern in taskcluster services. We have lots of timed iteration loops which all have roughly the same requirements. If any of them get stuck, we get into really bad situations like this. Right now, each tool has it's own iteration logic. The actual iteration logic used everywhere really ought to be centralized so that we can share the fail-safes and make everything rugged. What about making a node class which took the following options: maxIterations: 0, undef for infinite, positive integer for limited maxFailures: undef for 7, a positive integer that says how many back to back failing iterations before crashing maxIterationTime: positive integer of second stating the maximum amount of time an iteration can take before crashing minIterationTime: 0, undef for no min, positive integer for minimum amount of time an iteration can take, crash if less watchDog: positive integer of the number of seconds the watchdog should be set to waitTime: positive integer of seconds for how long to wait between iteration attempts waitTimeAfterFail: 0 to use waitTime, positive integer to use handler: an async function that resolves when an iteration is good and rejects when bad dmsConfig: undef or object with a deadman's snitch to hit after each iteration completes The class would have the methods: start() -- start the loop stop() -- stop the loop after the next iteration The handler function would be called as: let result = await this.handler(watchDog); Passing in the watchDog as a parameter instead of binding the function to an object is so we can pass methods from an object that can use the correct `this`. This watch dog is an instance of https://github.com/taskcluster/aws-provisioner/blob/master/src/watchdog.js, which will abort the iteration if it hasn't been touched in the required number of seconds. The reason we would pass it into the function instead of just doing it automatically is because some long lasting iterations will have repeating actions which can be used as a sign of non-freezing. Example: provisioner loops can take up to 20ish minutes, but submit something every half second, it's only frozen if those half second things are happening, we should wait 20 minutes. If they aren't happening at least once a minute, we should probably crash. This watchdog and max iteration time system ensures that inside the process we try really really hard for things to not fail, but we'll fail as safely as possible. The dmsConfig is used to configure an instance of a Deadman's Snitch (https://deadmanssnitch.com/) which is hit after each iteration. This is an external service, so we'll be notified if the queue process is stuck without relying on the queue service itself to notify us. I'd also like to provide a promise timeout function, which rejects a promise that takes too long to complete as a helper for writing robust services. Doing these things for the provisioner has been extremely helpful to make it a lot more resiliant. Does anyone disagree with using something like this?
Comment 20•8 years ago
|
||
I'm definitely in favor; we have some similar stuff in Buildbot because we ran into many of the same kind of problems! Want to move that to a new bug so we can concentrate on propping the queue up here?
Comment 21•8 years ago
|
||
I just started an email conversation about enabling more coalescing, which should also help us dig out of holes faster when this sort of thing happens. We're currently at 8000 pending desktop-test tasks, up from 6600 about 10 minutes ago, so things are not looking rosy.
Comment 22•8 years ago
|
||
Closing try again to try to get ahead of this load before the US daytime really picks up.
Comment 23•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #22) > Closing try again to try to get ahead of this load before the US daytime > really picks up. inbound, fx-team, autoland and try are now closed and tree closing message points to this bug here
Comment 24•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #20) > I'm definitely in favor; we have some similar stuff in Buildbot because we > ran into many of the same kind of problems! Want to move that to a new bug > so we can concentrate on propping the queue up here? Created bug 1283461
Comment 25•8 years ago
|
||
We should be able to gradually reopen trees. We cleared out the backlog awhile ago and are waiting on new tasks.
Assignee | ||
Comment 26•8 years ago
|
||
Fixed in the queue, we should no longer see these issues...
Flags: needinfo?(jopsen)
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Component: Queue → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•