Closed Bug 1450075 Opened 7 years ago Closed 2 years ago

partials and repackage tasks are flaky

Categories

(Release Engineering :: Release Automation, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(1 obsolete file)

We regularly need to rerun failing repackage and partials tasks during a release; it's become the norm to have to do 1-10 reruns of these tasks, rather than the exception. We can attack this in two directions: - in task: - if the task is failing in i/o, it should use retries with exponential backoff whenever possible. - if something else is failing, we should find and address the problem - in taskcluster: - if the task is failing in a way that's detectable, we should exit with a specific exit code. Then the task should retry on that exit code. https://taskcluster-artifacts.net/TVoDXGX6TJi15rfz-VoNMQ/0/public/logs/live_backing.log is an example of a dying repackage task. It's dying on retrieving artifacts for the artifact build. [task 2018-03-29T18:33:05.643Z] 18:33:05 INFO - Error running mach: [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - ['artifact', 'toolchain', '-v', '--retry', '4', '--artifact-manifest', '/builds/worker/workspace/build/src/toolchains.json', '--cache-dir', '/builds/worker/tooltool-cache', 'public/build/dmg.tar.xz@Qtt6ENCrTXqwsW_Lz3cJBA', 'public/build/hfsplus-tools.tar.xz@Zc4cUPVqQ1a66IBeSqkY_g'] [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - The error occurred in code that was called by the mach command. This is either [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - a bug in the called code itself or in the way that mach is calling it. [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - You should consider filing a bug for this issue. [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - If filing a bug, please include the full output of mach, including this error [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - message. [task 2018-03-29T18:33:05.644Z] 18:33:05 INFO - The details of the failure are as follows: [task 2018-03-29T18:33:05.645Z] 18:33:05 INFO - ConnectionError: HTTPSConnectionPool(host='cloud-mirror-production-us-east-1.s3.amazonaws.com', port=443): Max retries exceeded with url: /https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQtt6ENCrTXqwsW_Lz3cJBA%2F0%2Fpublic%2FchainOfTrust.json.asc (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f6dc091f350>: Failed to establish a new connection: [Errno 110] Connection timed out',))
For the artifact build issue above, we could either detect timeouts and retry this command [1] in mozharness, or we could have `./mach artifact toolchain` itself retry on timeouts. [1] https://searchfox.org/mozilla-central/rev/7e663b9fa578d425684ce2560e5fa2464f504b34/testing/mozharness/scripts/repackage.py#82
Partial task also hit a timeout; we may want to add a retry: https://taskcluster-artifacts.net/PIu7BemuTfO_h1EFior_iA/0/public/logs/live_backing.log 2018-03-29 19:07:15,197 - DEBUG - target-60.0b5.partial.mar: Finished 2018-03-29 19:07:15,197 - DEBUG - target-60.0b5.partial.mar: Traceback (most recent call last): File "/home/worker/bin/funsize.py", line 511, in <module> main() File "/home/worker/bin/funsize.py", line 485, in main manifest = loop.run_until_complete(async_main(args, signing_certs)) File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete return future.result() File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step result = coro.throw(exc) File "/home/worker/bin/funsize.py", line 420, in async_main manifest = await asyncio.gather(*tasks) File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__ yield self # This tells Task to wait for completion. File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup future.result() File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step result = coro.throw(exc) File "/home/worker/bin/funsize.py", line 295, in manage_partial await retry_download(f, dest) File "/home/worker/bin/funsize.py", line 103, in retry_download kwargs=kwargs File "/usr/local/lib/python3.5/dist-packages/scriptworker/utils.py", line 252, in retry_async return await func(*args, **kwargs) File "/home/worker/bin/funsize.py", line 116, in download chunk = await resp.content.read(4096) File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 607, in read return (yield from super().read(n)) File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 330, in read yield from self._wait('read') File "/usr/local/lib/python3.5/dist-packages/aiohttp/streams.py", line 259, in _wait yield from waiter File "/usr/local/lib/python3.5/dist-packages/aiohttp/helpers.py", line 727, in __exit__ raise asyncio.TimeoutError from None
Hm, another repackage: [task 2018-03-29T18:24:20.502Z] 18:24:20 INFO - /builds/worker/workspace/build/src/configure [task 2018-03-29T18:24:20.582Z] 18:24:20 INFO - Creating Python environment [task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - New python executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python2.7 [task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - Also creating executable in /builds/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python [task 2018-03-29T18:24:21.846Z] 18:24:21 INFO - Installing setuptools, pip, wheel...done. [task 2018-03-29T18:24:22.023Z] 18:24:22 INFO - WARNING: Python.h not found. Install Python development headers. [task 2018-03-29T18:24:22.023Z] 18:24:22 INFO - Error processing command. Ignoring because optional. (optional:setup.py:third_party/python/psutil:build_ext:--inplace) [task 2018-03-29T18:24:22.024Z] 18:24:22 INFO - Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)[taskcluster 2018-03-29 19:24:09.729Z] === Task Finished === [taskcluster 2018-03-29 19:24:09.730Z] Unsuccessful task run with exit code: -1 completed in 3601.732 seconds This one appears to have hung for an hour with no output.
Blocks: 1461919
Attachment #8983454 - Attachment is obsolete: true

Looks like this was fixed at some point or started getting starred somewhere else. Let's resolve and open a new bug for any future flakiness.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: