Closed Bug 1832206 Opened 1 year ago Closed 9 months ago

Intermittent beetmover asyncio.exceptions.TimeoutError: Operation did not complete within the designated timeout.

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Filed by: sstanca [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=415163147&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KDWRBjwOQK2_Loo6jXAyCw/runs/0/artifacts/public/logs/live_backing.log


parent: "projects/moz-fx-productdelivery-pr-38b5/locations/us/repositories/mozilla"

2023-05-10 00:42:13,967 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 0.9s ...
2023-05-10 00:42:15,083 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 0.5s ...
2023-05-10 00:42:15,845 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 3.6s ...
2023-05-10 00:42:19,708 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 6.5s ...
2023-05-10 00:42:26,471 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 0.3s ...
2023-05-10 00:42:27,008 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 12.4s ...
2023-05-10 00:42:39,585 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 21.7s ...
2023-05-10 00:43:01,501 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 18.6s ...
2023-05-10 00:43:20,344 - google.api_core.retry_async - DEBUG - Retrying due to , sleeping 53.4s ...
Traceback (most recent call last):
  File "/app/lib/python3.9/site-packages/google/api_core/future/async_future.py", line 76, in _done_or_raise
    async def _done_or_raise(self):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/lib/python3.9/site-packages/google/api_core/retry_async.py", line 121, in retry_target
    return await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/lib/python3.9/site-packages/google/api_core/future/async_future.py", line 101, in _blocking_poll
    await retry_(self._done_or_raise)()
  File "/app/lib/python3.9/site-packages/google/api_core/retry_async.py", line 223, in retry_wrapped_func
    return await retry_target(
  File "/app/lib/python3.9/site-packages/google/api_core/retry_async.py", line 140, in retry_target
    raise exceptions.RetryError(
google.api_core.exceptions.RetryError: Timeout of 120.0s exceeded while calling target function, last exception: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/bin/beetmoverscript", line 8, in <module>
    sys.exit(main())
  File "/app/lib/python3.9/site-packages/beetmoverscript/script.py", line 693, in main
    client.sync_main(async_main, config_path=config_path, default_config=default_config, should_validate_task=False)
  File "/app/lib/python3.9/site-packages/scriptworker/client.py", line 166, in sync_main
    loop.run_until_complete(_handle_asyncio_loop(async_main, context))
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/app/lib/python3.9/site-packages/scriptworker/client.py", line 205, in _handle_asyncio_loop
    await async_main(context)
  File "/app/lib/python3.9/site-packages/beetmoverscript/script.py", line 345, in async_main
    await action_map[context.action](context)
  File "/app/lib/python3.9/site-packages/beetmoverscript/gcloud.py", line 172, in import_from_gcs_to_artifact_registry
    result = await async_operation.result()
  File "/app/lib/python3.9/site-packages/google/api_core/future/async_future.py", line 122, in result
    await self._blocking_poll(timeout=timeout)
  File "/app/lib/python3.9/site-packages/google/api_core/future/async_future.py", line 103, in _blocking_poll
    raise asyncio.TimeoutError(
asyncio.exceptions.TimeoutError: Operation did not complete within the designated timeout.
exit code: 1

Update:

There have been 33 total failures within the last 7 days, all of them Firefox Release Tasks opt.

The failure first appeared on the 9th of May, with increased frequency starting from the 18th of May.

Ben, any chance this was triggered by Bug 1831881, as it is around the same time the first failure of this kind has appeared?
Thank you.

Flags: needinfo?(bhearsum)
Summary: Intermittent asyncio.exceptions.TimeoutError: Operation did not complete within the designated timeout. → Intermittent beetmover asyncio.exceptions.TimeoutError: Operation did not complete within the designated timeout.
Whiteboard: [stockwell needswork:owner]

(In reply to Natalia Csoregi [:nataliaCs] from comment #6)

Update:

There have been 33 total failures within the last 7 days, all of them Firefox Release Tasks opt.

The failure first appeared on the 9th of May, with increased frequency starting from the 18th of May.

Ben, any chance this was triggered by Bug 1831881, as it is around the same time the first failure of this kind has appeared?
Thank you.

I doubt that increasing the overall timeout of push-to-releases has affected this. The linked job is a beetmover-apt one - which is new. It's more likely that there's something in that job configuration, or the server it pushes to, that is causing this.

Gabriel - can you take a look?

Flags: needinfo?(bhearsum) → needinfo?(gabriel)

Ah. These APT tasks are busted because of a change in the structure of beetmover permissions. I'd expect some sort of 403 and Retrying due to <empty string>, is not very informational... I need to patch beetmover so these apt task stop failing but since these are leaf tasks it hasn't been an issue for releases.

Flags: needinfo?(gabriel)

Never-mind. I see some tasks worked. It must be something else. I will do some digging.

I did some digging. I think we could make this more reliable by sending 1 request to the GCP API with all the .debs in storage to import into the APT repo. Right now we do a few concurrent requests to the Artifact Registry API because of the Taskcluster dependency limit (the beetmover APT task can't depend on all the repackage beetmover tasks.) I wonder how I could get around the dependency limit to consolidate all the .deb GCP storage paths for the Artifact Registry payload.

I am moving this to the back-log ("P3") because I think I can improve the situation.

I bumped the severity to "S3" because to releng "S3" means "Issues that are either intermittent or not too cumbersome to manually work around."

Severity: S4 → S3
Priority: P5 → P3
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.