Closed Bug 1517645 Opened 5 years ago Closed 5 years ago

cancel-all action timing out

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla65

People

(Reporter: apavel, Assigned: dustin)

Details

Attachments

(3 files)

Bug 1517645 - log each task as it is cancelled 5 years ago Dustin J. Mitchell [:dustin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1517645 - always use a highly concurrent requests session 5 years ago Dustin J. Mitchell [:dustin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1517645 - fix list_task_group and rename it 5 years ago Dustin J. Mitchell [:dustin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review

Andreea Pavel [:apavel]

Reporter

Description

•

5 years ago

Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=usercancel%2Ctestfailed%2Cbusted%2Cexception&revision=0097778657282369dc3e4201818aff8c7ebfa9c3&selectedJob=219883495

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219883495&repo=autoland&lineNumber=28

[task 2019-01-03T23:29:35.851Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy
[task 2019-01-03T23:30:20.729Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.737Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.740Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.741Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.752Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.757Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.794Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.812Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.873Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.918Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.930Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.985Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.986Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:20.994Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.003Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.009Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.010Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.017Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.024Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.028Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.034Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.040Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.046Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.049Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.964Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.965Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:21.968Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:22.006Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:22.019Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.250Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.251Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.252Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.253Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.262Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.265Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.360Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.361Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster
[task 2019-01-03T23:30:23.363Z] Connection pool is full, discarding connection: taskcluster
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2019-01-03 23:59:30.109Z] === Task Finished ===
[taskcluster 2019-01-03 23:59:30.109Z] Unsuccessful task run with exit code: -1 completed in 1803.258 seconds

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

5 years ago

We should add more logging.  That logging is just a warning from requests. I'm not sure what took so long.  Better logging of the tasks it's cancelling would probably show whether it was cancelling for those 30 minutes or got wedged somehow.

Assignee: nobody → dustin

Summary: Gecko Decision Task action-callback failing with Connection pool is full, discarding connection: taskcluster → cancel-all action timing out

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

5 years ago

The cancel-All job failed for 2 out of 3 pushes which I cancelled in less than a minute (have done similar actions in the past): https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=usercancel%2Cretry%2Ctestfailed%2Cbusted%2Cexception&group_state=expanded&tochange=c11865063e01c0f64acbcd935c82e4aeb0b3b092&fromchange=ec8667678f3a7fa4d6259b070a35a4c4a5c14209

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=369171679559880fb4e4f52a5545f8b5e150a1e8

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

5 years ago

I added some debugging, and it seems to hang for me too:
  https://tools.taskcluster.net/groups/M2rlmngqQCCoJG5o99Ol7w/tasks/SnpAcQihSXiGvq5Gb44CFw/runs/0/logs/public%2Flogs%2Flive.log

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

5 years ago

Looking in the queue logs, that task seems to have stopped making requests at
  
Jan 04 21:27:16 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/LZmIzdo2Tw6HROn8X7T_pQ/cancel" host=queue.taskcluster.net request_id=93a6ece5-52bb-4e86-a33e-1dc4f62f6fcc fwd="34.201.30.0" dyno=web.5 connect=0ms service=1837ms status=200 bytes=1313 protocol=https 

(they're not retrying 500's or anything like that)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

5 years ago

Comparing to the decision task's code to call queue.createTask, we're misusing the session a little bit by not attaching a big enough connection pool.  I'm surprised that Requests' behavior is to hang when that happens, though.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=34b49c1e0792d4e5942fe5e09b5da82c5197c714

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=49b44b91beca9b5231a7dda776b38eb9b8ccd353

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

5 years ago

I can't understand why, but adding

diff --git a/taskcluster/taskgraph/actions/cancel_all.py b/taskcluster/taskgraph/actions/cancel_all.py
--- a/taskcluster/taskgraph/actions/cancel_all.py
+++ b/taskcluster/taskgraph/actions/cancel_all.py
@@ -38,10 +38,13 @@ def cancel_all_action(parameters, graph_
         logger.info('Cancelling task {}'.format(task_id))
         cancel_task(task_id, use_proxy=True)

     own_task_id = os.environ.get('TASK_ID', '')
     to_cancel = [t for t in list_task_group(task_group_id) if t != own_task_id]
     logger.info("Cancelling {} tasks".format(len(to_cancel)))
     with futures.ThreadPoolExecutor(CONCURRENCY) as e:
         cancel_futs = [e.submit(do_cancel_task, t) for t in to_cancel]
+        finished = 0
         for f in futures.as_completed(cancel_futs):
             f.result()
+            finished = finished + 1
+            logger.info("{} cancellations complete".format(finished))


causes the task to hang after only one line of logging from the mach command:

[task 2019-01-04T23:40:44.798Z] Running in Taskcluster instance https://taskcluster.net with taskcluster-proxy

Comment hidden (Intermittent Failures Robot)

2 failures in 1668 pushes (0.001 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 2

Platform breakdown:
* gecko-decision: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1517645&startday=2018-12-31&endday=2019-01-06&tree=all

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

5 years ago

Oh, that's because I broke list_task_group! Should have a fix shortly.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdefc94ac947e5f31a39ff9dc72bc9c47cb7ae18

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 13

•

5 years ago

Attached file Bug 1517645 - log each task as it is cancelled — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

5 years ago

Attached file Bug 1517645 - always use a highly concurrent requests session — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 15

•

5 years ago

Attached file Bug 1517645 - fix list_task_group and rename it — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

5 years ago

Yikes, sorry for the delay -- I thought I'd clicked Lando for this.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 17

•

5 years ago

Oh, right, Lando couldn't rebase it but I couldn't figure out why. It still can't..

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 18

•

5 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/96a7386c6e5bda914f59fd651b01ec7c03b644d9
Bug 1517645 - log each task as it is cancelled

https://hg.mozilla.org/integration/mozilla-inbound/rev/53380666b5c70c68da513dc8ba0c8ff7dfca48c5
Bug 1517645 - always use a highly concurrent requests session

https://hg.mozilla.org/integration/mozilla-inbound/rev/e256dba8d711838fdb983f287d8f0311241cb8ff
Bug 1517645 - fix list_task_group and rename it

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 19

•

5 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/26624f7c3b4d591883f159fc2c7ee3900049a836
Bug 1517645 - fix lint issues in previous push; a=bustage CLOSED TREE

Daniel Varga [:dvarga]

Comment 20

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/96a7386c6e5b
https://hg.mozilla.org/mozilla-central/rev/53380666b5c7
https://hg.mozilla.org/mozilla-central/rev/e256dba8d711
https://hg.mozilla.org/mozilla-central/rev/26624f7c3b4d

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla65

Tom Prince [:tomprince]

Comment 21

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr60/rev/20db186d2867
https://hg.mozilla.org/releases/mozilla-esr60/rev/b7afce673f17
https://hg.mozilla.org/releases/mozilla-esr60/rev/2c7998d6d863
https://hg.mozilla.org/releases/mozilla-esr60/rev/207a0b6dd340

You need to log in before you can comment on or make changes to this bug.