Closed
Bug 1517645
Opened 5 years ago
Closed 5 years ago
cancel-all action timing out
Categories
(Taskcluster :: General, enhancement)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla65
People
(Reporter: apavel, Assigned: dustin)
Details
Attachments
(3 files)
Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=usercancel%2Ctestfailed%2Cbusted%2Cexception&revision=0097778657282369dc3e4201818aff8c7ebfa9c3&selectedJob=219883495 Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219883495&repo=autoland&lineNumber=28 [task 2019-01-03T23:29:35.851Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy [task 2019-01-03T23:30:20.729Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.737Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.740Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.741Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.752Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.757Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.794Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.812Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.873Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.918Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.930Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.985Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.986Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:20.994Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.003Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.009Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.010Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.017Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.024Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.034Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.040Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.046Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.049Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.964Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.965Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:21.968Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:22.006Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:22.019Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.250Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.251Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.252Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.253Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.262Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.265Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.360Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.361Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster [task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster [taskcluster:error] Task timeout after 1800 seconds. Force killing container. [taskcluster 2019-01-03 23:59:30.109Z] === Task Finished === [taskcluster 2019-01-03 23:59:30.109Z] Unsuccessful task run with exit code: -1 completed in 1803.258 seconds
Assignee | ||
Comment 1•5 years ago
|
||
We should add more logging. That logging is just a warning from requests. I'm not sure what took so long. Better logging of the tasks it's cancelling would probably show whether it was cancelling for those 30 minutes or got wedged somehow.
Assignee: nobody → dustin
Summary: Gecko Decision Task action-callback failing with Connection pool is full, discarding connection: taskcluster → cancel-all action timing out
Comment 2•5 years ago
|
||
The cancel-All job failed for 2 out of 3 pushes which I cancelled in less than a minute (have done similar actions in the past): https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=usercancel%2Cretry%2Ctestfailed%2Cbusted%2Cexception&group_state=expanded&tochange=c11865063e01c0f64acbcd935c82e4aeb0b3b092&fromchange=ec8667678f3a7fa4d6259b070a35a4c4a5c14209
Assignee | ||
Comment 3•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=369171679559880fb4e4f52a5545f8b5e150a1e8
Assignee | ||
Comment 4•5 years ago
|
||
I added some debugging, and it seems to hang for me too: https://tools.taskcluster.net/groups/M2rlmngqQCCoJG5o99Ol7w/tasks/SnpAcQihSXiGvq5Gb44CFw/runs/0/logs/public%2Flogs%2Flive.log
Assignee | ||
Comment 5•5 years ago
|
||
Looking in the queue logs, that task seems to have stopped making requests at Jan 04 21:27:16 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/LZmIzdo2Tw6HROn8X7T_pQ/cancel" host=queue.taskcluster.net request_id=93a6ece5-52bb-4e86-a33e-1dc4f62f6fcc fwd="34.201.30.0" dyno=web.5 connect=0ms service=1837ms status=200 bytes=1313 protocol=https (they're not retrying 500's or anything like that)
Assignee | ||
Comment 6•5 years ago
|
||
Comparing to the decision task's code to call queue.createTask, we're misusing the session a little bit by not attaching a big enough connection pool. I'm surprised that Requests' behavior is to hang when that happens, though.
Assignee | ||
Comment 7•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=34b49c1e0792d4e5942fe5e09b5da82c5197c714
Assignee | ||
Comment 8•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=49b44b91beca9b5231a7dda776b38eb9b8ccd353
Assignee | ||
Comment 9•5 years ago
|
||
I can't understand why, but adding diff --git a/taskcluster/taskgraph/actions/cancel_all.py b/taskcluster/taskgraph/actions/cancel_all.py --- a/taskcluster/taskgraph/actions/cancel_all.py +++ b/taskcluster/taskgraph/actions/cancel_all.py @@ -38,10 +38,13 @@ def cancel_all_action(parameters, graph_ logger.info('Cancelling task {}'.format(task_id)) cancel_task(task_id, use_proxy=True) own_task_id = os.environ.get('TASK_ID', '') to_cancel = [t for t in list_task_group(task_group_id) if t != own_task_id] logger.info("Cancelling {} tasks".format(len(to_cancel))) with futures.ThreadPoolExecutor(CONCURRENCY) as e: cancel_futs = [e.submit(do_cancel_task, t) for t in to_cancel] + finished = 0 for f in futures.as_completed(cancel_futs): f.result() + finished = finished + 1 + logger.info("{} cancellations complete".format(finished)) causes the task to hang after only one line of logging from the mach command: [task 2019-01-04T23:40:44.798Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 11•5 years ago
|
||
Oh, that's because I broke list_task_group! Should have a fix shortly.
Assignee | ||
Comment 12•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdefc94ac947e5f31a39ff9dc72bc9c47cb7ae18
Assignee | ||
Comment 13•5 years ago
|
||
Assignee | ||
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
Assignee | ||
Comment 16•5 years ago
|
||
Yikes, sorry for the delay -- I thought I'd clicked Lando for this.
Assignee | ||
Comment 17•5 years ago
|
||
Oh, right, Lando couldn't rebase it but I couldn't figure out why. It still can't..
Assignee | ||
Comment 18•5 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/96a7386c6e5bda914f59fd651b01ec7c03b644d9 Bug 1517645 - log each task as it is cancelled https://hg.mozilla.org/integration/mozilla-inbound/rev/53380666b5c70c68da513dc8ba0c8ff7dfca48c5 Bug 1517645 - always use a highly concurrent requests session https://hg.mozilla.org/integration/mozilla-inbound/rev/e256dba8d711838fdb983f287d8f0311241cb8ff Bug 1517645 - fix list_task_group and rename it
Assignee | ||
Comment 19•5 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/26624f7c3b4d591883f159fc2c7ee3900049a836 Bug 1517645 - fix lint issues in previous push; a=bustage CLOSED TREE
Comment 20•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/96a7386c6e5b
https://hg.mozilla.org/mozilla-central/rev/53380666b5c7
https://hg.mozilla.org/mozilla-central/rev/e256dba8d711
https://hg.mozilla.org/mozilla-central/rev/26624f7c3b4d
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Comment 21•5 years ago
|
||
bugherder uplift |
You need to log in
before you can comment on or make changes to this bug.
Description
•