Closed Bug 1517645 Opened 5 years ago Closed 5 years ago

cancel-all action timing out

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla65

People

(Reporter: apavel, Assigned: dustin)

Details

Attachments

(3 files)

Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=usercancel%2Ctestfailed%2Cbusted%2Cexception&revision=0097778657282369dc3e4201818aff8c7ebfa9c3&selectedJob=219883495

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219883495&repo=autoland&lineNumber=28

[task 2019-01-03T23:29:35.851Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy
[task 2019-01-03T23:30:20.729Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.737Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.740Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.741Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.752Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.757Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.794Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.812Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.873Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.918Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.930Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.985Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.986Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.994Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.003Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.009Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.010Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.017Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.024Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.034Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.040Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.046Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.049Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.964Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.965Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.968Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:22.006Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:22.019Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.250Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.251Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.252Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.253Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.262Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.265Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.360Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.361Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2019-01-03 23:59:30.109Z] === Task Finished ===
[taskcluster 2019-01-03 23:59:30.109Z] Unsuccessful task run with exit code: -1 completed in 1803.258 seconds
We should add more logging.  That logging is just a warning from requests. I'm not sure what took so long.  Better logging of the tasks it's cancelling would probably show whether it was cancelling for those 30 minutes or got wedged somehow.
Assignee: nobody → dustin
Summary: Gecko Decision Task action-callback failing with Connection pool is full, discarding connection: taskcluster → cancel-all action timing out
Looking in the queue logs, that task seems to have stopped making requests at
  
Jan 04 21:27:16 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/LZmIzdo2Tw6HROn8X7T_pQ/cancel" host=queue.taskcluster.net request_id=93a6ece5-52bb-4e86-a33e-1dc4f62f6fcc fwd="34.201.30.0" dyno=web.5 connect=0ms service=1837ms status=200 bytes=1313 protocol=https 

(they're not retrying 500's or anything like that)
Comparing to the decision task's code to call queue.createTask, we're misusing the session a little bit by not attaching a big enough connection pool.  I'm surprised that Requests' behavior is to hang when that happens, though.
I can't understand why, but adding

diff --git a/taskcluster/taskgraph/actions/cancel_all.py b/taskcluster/taskgraph/actions/cancel_all.py
--- a/taskcluster/taskgraph/actions/cancel_all.py
+++ b/taskcluster/taskgraph/actions/cancel_all.py
@@ -38,10 +38,13 @@ def cancel_all_action(parameters, graph_
         logger.info('Cancelling task {}'.format(task_id))
         cancel_task(task_id, use_proxy=True)

     own_task_id = os.environ.get('TASK_ID', '')
     to_cancel = [t for t in list_task_group(task_group_id) if t != own_task_id]
     logger.info("Cancelling {} tasks".format(len(to_cancel)))
     with futures.ThreadPoolExecutor(CONCURRENCY) as e:
         cancel_futs = [e.submit(do_cancel_task, t) for t in to_cancel]
+        finished = 0
         for f in futures.as_completed(cancel_futs):
             f.result()
+            finished = finished + 1
+            logger.info("{} cancellations complete".format(finished))


causes the task to hang after only one line of logging from the mach command:

[task 2019-01-04T23:40:44.798Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy

Oh, that's because I broke list_task_group! Should have a fix shortly.

Yikes, sorry for the delay -- I thought I'd clicked Lando for this.

Oh, right, Lando couldn't rebase it but I couldn't figure out why. It still can't..

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: