Closed
Bug 1450075
Opened 7 years ago
Closed 2 years ago
partials and repackage tasks are flaky
Categories
(Release Engineering :: Release Automation, defect)
Release Engineering
Release Automation
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: mozilla, Unassigned)
References
Details
Attachments
(1 obsolete file)
We regularly need to rerun failing repackage and partials tasks during a release; it's become the norm to have to do 1-10 reruns of these tasks, rather than the exception.
We can attack this in two directions:
- in task:
- if the task is failing in i/o, it should use retries with exponential backoff whenever possible.
- if something else is failing, we should find and address the problem
- in taskcluster:
- if the task is failing in a way that's detectable, we should exit with a specific exit code. Then the task should retry on that exit code.
https://taskcluster-artifacts.net/TVoDXGX6TJi15rfz-VoNMQ/0/public/logs/live_backing.log is an example of a dying repackage task. It's dying on retrieving artifacts for the artifact build.
[task 2018-03-29T18:33:05.643Z] 18:33:05 INFO - Error running mach:
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - ['artifact', 'toolchain', '-v', '--retry', '4', '--artifact-manifest', '/builds/worker/workspace/build/src/toolchains.json', '--cache-dir', '/builds/worker/tooltool-cache', 'public/build/dmg.tar.xz@Qtt6ENCrTXqwsW_Lz3cJBA', 'public/build/hfsplus-tools.tar.xz@Zc4cUPVqQ1a66IBeSqkY_g']
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - The error occurred in code that was called by the mach command. This is either
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - a bug in the called code itself or in the way that mach is calling it.
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - You should consider filing a bug for this issue.
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - If filing a bug, please include the full output of mach, including this error
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - message.
[task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - The details of the failure are as follows:
[task 2018-03-29T18:33:05.645Z] 18:33:05 INFO - ConnectionError: HTTPSConnectionPool(host='cloud-mirror-production-us-east-1.s3.amazonaws.com', port=443): Max retries exceeded with url: /https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQtt6ENCrTXqwsW_Lz3cJBA%2F0%2Fpublic%2FchainOfTrust.json.asc (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f6dc091f350>: Failed to establish a new connection: [Errno 110] Connection timed out',))
Reporter | ||
Comment 1•7 years ago
|
||
For the artifact build issue above, we could either detect timeouts and retry this command [1] in mozharness, or we could have `./mach artifact toolchain` itself retry on timeouts.
[1] https://searchfox.org/mozilla-central/rev/7e663b9fa578d425684ce2560e5fa2464f504b34/testing/mozharness/scripts/repackage.py#82
Reporter | ||
Comment 2•7 years ago
|
||
Partial task also hit a timeout; we may want to add a retry:
https://taskcluster-artifacts.net/PIu7BemuTfO_h1EFior_iA/0/public/logs/live_backing.log
2018-03-29 19:07:15,197 - DEBUG - target-60.0b5.partial.mar: Finished
2018-03-29 19:07:15,197 - DEBUG - target-60.0b5.partial.mar:
Traceback (most recent call last):
File "/home/worker/bin/funsize.py", line 511, in <module>
main()
File "/home/worker/bin/funsize.py", line 485, in main
manifest = loop.run_until_complete(async_main(args, signing_certs))
File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete
return future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
result = coro.throw(exc)
File "/home/worker/bin/funsize.py", line 420, in async_main
manifest = await asyncio.gather(*tasks)
File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
yield self # This tells Task to wait for completion.
File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
result = coro.throw(exc)
File "/home/worker/bin/funsize.py", line 295, in manage_partial
await retry_download(f, dest)
File "/home/worker/bin/funsize.py", line 103, in retry_download
kwargs=kwargs
File "/usr/local/lib/python3.5/dist-packages/scriptworker/utils.py", line 252, in retry_async
return await func(*args, **kwargs)
File "/home/worker/bin/funsize.py", line 116, in download
chunk = await resp.content.read(4096)
File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 607, in read
return (yield from super().read(n))
File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 330, in read
yield from self._wait('read')
File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 259, in _wait
yield from waiter
File "/usr/local/lib/python3.5/dist-packages/aiohttp/helpers.py", line 727, in __exit__
raise asyncio.TimeoutError from None
Reporter | ||
Comment 3•7 years ago
|
||
Hm, another repackage:
[task 2018-03-29T18:24:20.502Z] 18:24:20 INFO - /builds/worker/workspace/build/src/configure
[task 2018-03-29T18:24:20.582Z] 18:24:20 INFO - Creating Python environment
[task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - New python executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python2.7
[task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - Also creating executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python
[task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - Installing setuptools, pip, wheel...done.
[task 2018-03-29T18:24:22.023Z] 18:24:22 INFO - WARNING: Python.h not found. Install Python development headers.
[task 2018-03-29T18:24:22.023Z] 18:24:22 INFO - Error processing command. Ignoring because optional. (optional:setup.py:third_party/python/psutil:build_ext:--inplace)
[task 2018-03-29T18:24:22.024Z] 18:24:22 INFO - Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)[taskcluster 2018-03-29 19:24:09.729Z] === Task Finished ===
[taskcluster 2018-03-29 19:24:09.730Z] Unsuccessful task run with exit code: -1 completed in 3601.732 seconds
This one appears to have hung for an hour with no output.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 8•7 years ago
|
||
Updated•7 years ago
|
Attachment #8983454 -
Attachment is obsolete: true
Comment 9•2 years ago
|
||
Looks like this was fixed at some point or started getting starred somewhere else. Let's resolve and open a new bug for any future flakiness.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•