Open Bug 1420643 Opened 7 years ago Updated 7 years ago

Intermittent gecko decision tasks failing with ReadTimeout: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Read timed out. (read timeout=5)

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: aryx, Unassigned)

References

Details

+++ This bug was initially created as a clone of Bug #1417719 +++ https://treeherder.mozilla.org/logviewer.html#?job_id=147647781&repo=mozilla-inbound [task 2017-11-26T00:00:22.910648Z] Generating target task graph [task 2017-11-26T00:00:22.917192Z] Adding 0 tasks with `always_target` attribute [task 2017-11-26T00:00:22.933443Z] Generating optimized task graph [task 2017-11-26T00:00:22.937789Z] Querying version control for metadata: https://hg.mozilla.org/integration/mozilla-inbound/json-automationrelevance/3e14872b31a7b1b207605d09b78fbaaf21f1bba7 [task 2017-11-26T00:00:22.937877Z] attempt 1/2 [task 2017-11-26T00:00:22.938124Z] retry: calling get_automationrelevance, attempt #1 [task 2017-11-26T00:00:22.945783Z] Starting new HTTPS connection (1): hg.mozilla.org [task 2017-11-26T00:00:28.039379Z] retry: Caught exception: [task 2017-11-26T00:00:28.039528Z] sleeping for 10.00s (attempt 1/2) [task 2017-11-26T00:00:38.049886Z] attempt 2/2 [task 2017-11-26T00:00:38.050028Z] retry: calling get_automationrelevance, attempt #2 [task 2017-11-26T00:00:38.051405Z] Starting new HTTPS connection (1): hg.mozilla.org [task 2017-11-26T00:00:43.081313Z] retry: Caught exception: [task 2017-11-26T00:00:43.081414Z] retry: Giving up on get_automationrelevance [task 2017-11-26T00:00:43.082160Z] Traceback (most recent call last): [task 2017-11-26T00:00:43.082345Z] File "/builds/worker/checkouts/gecko/taskcluster/mach_commands.py", line 165, in taskgraph_decision [task 2017-11-26T00:00:43.082792Z] return taskgraph.decision.taskgraph_decision(options) [task 2017-11-26T00:00:43.082852Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/decision.py", line 128, in taskgraph_decision [task 2017-11-26T00:00:43.082901Z] write_artifact('task-graph.json', tgg.morphed_task_graph.to_json()) [task 2017-11-26T00:00:43.083119Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/generator.py", line 213, in morphed_task_graph [task 2017-11-26T00:00:43.083161Z] return self._run_until('morphed_task_graph') [task 2017-11-26T00:00:43.083240Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/generator.py", line 319, in _run_until [task 2017-11-26T00:00:43.083416Z] k, v = self._run.next() [task 2017-11-26T00:00:43.083464Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/generator.py", line 306, in _run [task 2017-11-26T00:00:43.083520Z] existing_tasks=existing_tasks) [task 2017-11-26T00:00:43.083569Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/optimize.py", line 55, in optimize_task_graph [task 2017-11-26T00:00:43.083595Z] do_not_optimize=do_not_optimize) [task 2017-11-26T00:00:43.083644Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/optimize.py", line 126, in remove_tasks [task 2017-11-26T00:00:43.083678Z] if opt.should_remove_task(task, params, arg): [task 2017-11-26T00:00:43.083881Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/optimize.py", line 383, in should_remove_task [task 2017-11-26T00:00:43.083961Z] scheduled = self.scheduled_by_push(params['head_repository'], params['head_rev']) [task 2017-11-26T00:00:43.084018Z] File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/util.py", line 944, in method_call [task 2017-11-26T00:00:43.084400Z] cache[args] = self.func(instance, *args) [task 2017-11-26T00:00:43.084483Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/optimize.py", line 365, in scheduled_by_push [task 2017-11-26T00:00:43.084764Z] changed_files = files_changed.get_changed_files(repository, revision) [task 2017-11-26T00:00:43.084828Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/files_changed.py", line 33, in get_changed_files [task 2017-11-26T00:00:43.084870Z] contents = retry(get_automationrelevance, attempts=2, sleeptime=10) [task 2017-11-26T00:00:43.084914Z] File "/builds/worker/checkouts/gecko/third_party/python/redo/redo/__init__.py", line 162, in retry [task 2017-11-26T00:00:43.084938Z] return action(*args, **kwargs) [task 2017-11-26T00:00:43.085203Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/files_changed.py", line 31, in get_automationrelevance [task 2017-11-26T00:00:43.085272Z] response = requests.get(url, timeout=5) [task 2017-11-26T00:00:43.085321Z] File "/builds/worker/checkouts/gecko/third_party/python/requests/requests/api.py", line 67, in get [task 2017-11-26T00:00:43.085359Z] return request('get', url, params=params, **kwargs) [task 2017-11-26T00:00:43.085512Z] File "/builds/worker/checkouts/gecko/third_party/python/requests/requests/api.py", line 53, in request [task 2017-11-26T00:00:43.085549Z] return session.request(method=method, url=url, **kwargs) [task 2017-11-26T00:00:43.085594Z] File "/builds/worker/checkouts/gecko/third_party/python/requests/requests/sessions.py", line 468, in request [task 2017-11-26T00:00:43.085621Z] resp = self.send(prep, **send_kwargs) [task 2017-11-26T00:00:43.085666Z] File "/builds/worker/checkouts/gecko/third_party/python/requests/requests/sessions.py", line 576, in send [task 2017-11-26T00:00:43.085691Z] r = adapter.send(request, **kwargs) [task 2017-11-26T00:00:43.085736Z] File "/builds/worker/checkouts/gecko/third_party/python/requests/requests/adapters.py", line 449, in send [task 2017-11-26T00:00:43.085763Z] raise ReadTimeout(e, request=request) [task 2017-11-26T00:00:43.085806Z] ReadTimeout: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Read timed out. (read timeout=5)
I've hit this at least three times in the last two days... Greag, do you know anything about this setup, or who would?
Flags: needinfo?(gps)
The code is already retrying the connection in case of timeout. That's good. 99% of failures should be intermittent and the retry should pave over that. But the fact you are getting this more reliably is a bit concerning. Could you please link to logs where this is happening so I can correlate with server logs and see what's happening?
Flags: needinfo?(gps) → needinfo?(bzbarsky)
For what it's worth, it hasn't happened to me since comment 2. Here are the logs I see from groveling through my try pushes: https://taskcluster-artifacts.net/ADzbEzCZSEaHWtkr114_kQ/0/public/logs/live_backing.log https://taskcluster-artifacts.net/bj_UqMb0TGaEA79dZ6Ka-Q/0/public/logs/live_backing.log https://taskcluster-artifacts.net/eYrHpgvYSZ68E_Zfw5yN6w/0/public/logs/live_backing.log I have 7 total "Task failed" and "Task exception" mails for "Gecko Decision Task", but they don't clearly link back to the push, so I can't tell whether they're caused by me canceling before the D task was done, or including a nonexistent platform (which trychooser will happily do for you) or something else. Those three aboe are red D jobs on treeherder that we re definitely not canceled and not the "unknown platform" thing. Thank you for looking into this!
Flags: needinfo?(bzbarsky)
See Also: → 1473734
You need to log in before you can comment on or make changes to this bug.