Open Bug 1452927 Opened 2 years ago Updated 4 months ago

Intermittent partials concurrent.futures._base.TimeoutError

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: nataliaCs, Assigned: sfraser)

References

Details

(Keywords: leave-open)

Attachments

(1 file)

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=172849870&repo=mozilla-central

2018-04-10 12:05:33,157 - INFO - Downloading https://hg.mozilla.org/mozilla-central/raw-file/default/tools/update-packaging/unwrap_full_update.pl to /tmp/tmp1eqix4v3/unwrap_full_update.pl
2018-04-10 12:06:46,980 - INFO - Downloading https://hg.mozilla.org/mozilla-central/raw-file/default/tools/update-packaging/unwrap_full_update.pl to /tmp/tmp1eqix4v3/unwrap_full_update.pl
2018-04-10 12:08:12,510 - INFO - Downloading https://hg.mozilla.org/mozilla-central/raw-file/default/tools/update-packaging/unwrap_full_update.pl to /tmp/tmp1eqix4v3/unwrap_full_update.pl
2018-04-10 12:10:09,551 - INFO - Downloading https://hg.mozilla.org/mozilla-central/raw-file/default/tools/update-packaging/unwrap_full_update.pl to /tmp/tmp1eqix4v3/unwrap_full_update.pl
2018-04-10 12:12:49,807 - INFO - Downloading https://hg.mozilla.org/mozilla-central/raw-file/default/tools/update-packaging/unwrap_full_update.pl to /tmp/tmp1eqix4v3/unwrap_full_update.pl
2018-04-10 12:13:50,865 - WARNING - retry_async: <function download at 0x7f21cc158bf8>: too many retries!
Traceback (most recent call last):
  File "/home/worker/bin/funsize.py", line 504, in <module>
    main()
  File "/home/worker/bin/funsize.py", line 478, in main
    manifest = loop.run_until_complete(async_main(args, signing_certs))
  File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete
    return future.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
    result = coro.throw(exc)
  File "/home/worker/bin/funsize.py", line 405, in async_main
    await workenv.setup()
  File "/home/worker/bin/funsize.py", line 245, in setup
    await retry_download(url, dest=self.paths[filename], mode=0o755)
  File "/home/worker/bin/funsize.py", line 104, in retry_download
    kwargs=kwargs
  File "/usr/local/lib/python3.5/dist-packages/scriptworker/utils.py", line 252, in retry_async
    return await func(*args, **kwargs)
  File "/home/worker/bin/funsize.py", line 113, in download
    async with session.get(url, timeout=60) as resp:
  File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 690, in __aenter__
    self._resp = yield from self._coro
  File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 341, in _request
    break
  File "/usr/local/lib/python3.5/dist-packages/aiohttp/helpers.py", line 727, in __exit__
    raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
[taskcluster 2018-04-10 12:13:51.172Z] === Task Finished ===
[taskcluster 2018-04-10 12:13:51.236Z] Artifact "public/build/gu-IN/target.partial-2.mar" not found at "/home/worker/artifacts/target.partial-2.mar"
[taskcluster 2018-04-10 12:13:51.312Z] Artifact "public/build/gu-IN/target.partial-1.mar" not found at "/home/worker/artifacts/target.partial-1.mar"
[taskcluster 2018-04-10 12:13:51.368Z] Artifact "public/build/gu-IN/manifest.json" not found at "/home/worker/artifacts/manifest.json"
[taskcluster 2018-04-10 12:13:51.424Z] Artifact "public/build/gu-IN/target.partial-4.mar" not found at "/home/worker/artifacts/target.partial-4.mar"
[taskcluster 2018-04-10 12:13:51.488Z] Artifact "public/build/gu-IN/target.partial-3.mar" not found at "/home/worker/artifacts/target.partial-3.mar"
[taskcluster 2018-04-10 12:13:52.011Z] Unsuccessful task run with exit code: 1 completed in 527.45 seconds
See Also: → 1430600
Assignee: nobody → sfraser
These seem to be legitimate errors. The timing is spread out a bit, so I'm not convinced it's the number of concurrent tasks. I'll add more retry catching, but it feels like a band-aid over an underlying issue.
Attempt to get more information about download timeouts, and
also retry the partial generation if download timeouts happen too often.
Comment on attachment 8981797 [details]
Bug 1452927 Improve logging and retries for partials r=mtabara

Mihai Tabara [:mtabara]⌚️GMT has approved the revision.

https://phabricator.services.mozilla.com/D1467
Attachment #8981797 - Flags: review+
Pushed by sfraser@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9afdede8e0f
Improve logging and retries for partials r=mtabara
Keywords: leave-open

Beetmover occurrence: https://tools.taskcluster.net/groups/eeyI-m9lS8yCOvYPdgX6hA/tasks/HXYUD_-WTPiKpP-JDAIzWA/runs/0/logs/public%2Flogs%2Fchain_of_trust.log

2019-09-12T11:46:21 DEBUG - Reclaim task response:
{'credentials': '{********}',
'runId': 0,
'status': {'deadline': '2019-09-13T10:06:13.077Z',
'expires': '2020-09-11T10:06:13.077Z',
'provisionerId': 'scriptworker-k8s',
'retriesLeft': 5,
'runs': [{'reasonCreated': 'scheduled',
'runId': 0,
'scheduled': '2019-09-12T11:40:09.834Z',
'started': '2019-09-12T11:41:19.627Z',
'state': 'running',
'takenUntil': '2019-09-12T12:06:21.099Z',
'workerGroup': 'gecko-3-beetmover',
'workerId': 'gecko-3-beetmover-zp6byvbxt9krrnqsaxsb'}],
'schedulerId': 'gecko-level-3',
'state': 'running',
'taskGroupId': 'eeyI-m9lS8yCOvYPdgX6hA',
'taskId': 'HXYUD_-WTPiKpP-JDAIzWA',
'workerType': 'gecko-3-beetmover'},
'takenUntil': '2019-09-12T12:06:21.099Z',
'workerGroup': 'gecko-3-beetmover',
'workerId': 'gecko-3-beetmover-zp6byvbxt9krrnqsaxsb'}
2019-09-12T11:46:21 DEBUG - waiting 300 seconds before reclaiming...
2019-09-12T11:46:31 ERROR - SCRIPTWORKER_UNEXPECTED_EXCEPTION task
Traceback (most recent call last):
File "/app/lib/python3.7/site-packages/scriptworker/worker.py", line 55, in do_run_task
await run_cancellable(verify_chain_of_trust(chain))
File "/app/lib/python3.7/site-packages/scriptworker/worker.py", line 158, in _run_cancellable
result = await self.future
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 2063, in verify_chain_of_trust
task_count = await verify_task_types(chain)
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 1813, in verify_task_types
await valid_task_types[task_type](chain, obj)
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 1712, in verify_parent_task
await verify_parent_task_definition(chain, link)
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 1572, in verify_parent_task_definition
chain, parent_link, decision_link, tasks_for
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 1534, in get_jsone_context_and_template
tmpl = await get_in_tree_template(decision_link)
File "/app/lib/python3.7/site-packages/scriptworker/cot/verify.py", line 1377, in get_in_tree_template
context.config["work_dir"], "{}_taskcluster.yml".format(link.name)
File "/app/lib/python3.7/site-packages/scriptworker/utils.py", line 633, in load_json_or_yaml_from_url
retry_exceptions=(DownloadError, aiohttp.ClientError),
File "/app/lib/python3.7/site-packages/scriptworker/utils.py", line 261, in retry_async
return await func(*args, **kwargs)
File "/app/lib/python3.7/site-packages/scriptworker/utils.py", line 565, in download_file
chunk = await resp.content.read(chunk_size)
File "/app/lib/python3.7/site-packages/aiohttp/streams.py", line 369, in read
await self._wait('read')
File "/app/lib/python3.7/site-packages/aiohttp/streams.py", line 297, in _wait
await waiter
File "/app/lib/python3.7/site-packages/aiohttp/helpers.py", line 585, in exit
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
2019-09-12T11:46:31 DEBUG - "/app/artifacts/public/logs/chain_of_trust.log" is encoded with "None" and has mime/type "text/plain"
2019-09-12T11:46:31 INFO - "/app/artifacts/public/logs/chain_of_trust.log" can be gzip'd. Compressing...

I think the beetmover error is going to be unrelated, since it's different code, and they just moved beetmover to GCP. I'll see what we can dig up.

I think that'll be a separate infrastructure issue, and worth raising a new bug about. Perhaps ping :aki
Thanks!

Flags: needinfo?(sfraser)

Filed Bug 1624643,
Thank you!

See Also: → 1624643
Duplicate of this bug: 1624643
You need to log in before you can comment on or make changes to this bug.