[meta] High frequency [taskcluster:error] Task timeout after 2700 seconds / 1800 seconds / 3600 seconds / 7200 seconds. Force killing container. -> Hang during various stages of the build
Categories
(Firefox Build System :: Task Configuration, defect, P1)
Tracking
(firefox-esr102 fixed, firefox112 fixed, firefox113 fixed)
People
(Reporter: intermittent-bug-filer, Assigned: jcristau)
References
(Depends on 1 open bug)
Details
(Keywords: intermittent-failure, meta, Whiteboard: [stockwell infra])
Attachments
(10 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=408420613&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SxIBvUG0T0aoVj9ZBkrtwg/runs/0/artifacts/public/logs/live_backing.log
[taskcluster 2023-03-09 17:21:13.355Z] === Task Starting ===
* Starting system message bus dbus
[setup 2023-03-09T17:21:16.027Z] run-task started in /builds/worker
[cache 2023-03-09T17:21:16.030Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[cache 2023-03-09T17:21:16.030Z] cache /builds/worker/tooltool-cache is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2023-03-09T17:21:16.030Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2023-03-09T17:21:16.030Z] volume /builds/worker/checkouts is a cache
[volume 2023-03-09T17:21:16.030Z] volume /builds/worker/tooltool-cache is a cache
[volume 2023-03-09T17:21:16.030Z] changing ownership of volume /builds/worker/workspace to 1000:1000
[setup 2023-03-09T17:21:16.030Z] running as worker:worker
[vcs 2023-03-09T17:21:16.030Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2023-03-09T17:21:16.127Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2023-03-09T17:21:16.127Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '8aea0e78341420231309f814d26a9819e600c94f', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/checkouts/gecko']
[vcs 2023-03-09T17:21:16.479Z] (using Mercurial 5.8.1)
[vcs 2023-03-09T17:21:16.480Z] ensuring https://hg.mozilla.org/integration/autoland@8aea0e78341420231309f814d26a9819e600c94f is available at /builds/worker/checkouts/gecko
[vcs 2023-03-09T17:21:17.007Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2023-03-09T17:21:17.724Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2023-03-09T17:21:18.377Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/4d5e640759776d678d17822c70cece3e395e72c1.stream-v2.hg
[vcs 2023-03-09T17:21:18.460Z] 703336 files to transfer, 4.39 GB of data
[vcs 2023-03-09T17:21:20.466Z]
[vcs 2023-03-09T17:21:21.463Z] clone [=> ] 269189773/4713057607 34s
[vcs 2023-03-09T17:21:22.463Z] clone [==> ] 359632057/4713057607 37s
<...>
[vcs 2023-03-09T17:22:25.470Z] clone [==================================> ] 3673829845/4713057607 19s
[vcs 2023-03-09T17:22:26.470Z] clone [==================================> ] 3762565975/4713057607 17s
[vcs 2023-03-09T17:22:27.502Z] clone [===================================> ] 3823083326/4713057607 16s
[taskcluster:error] Task timeout after 2700 seconds. Force killing container.
[taskcluster 2023-03-09 18:06:13.764Z] === Task Finished ===
[taskcluster 2023-03-09 18:06:13.765Z] Unsuccessful task run with exit code: -1 completed in 2779.496 seconds
Comment 1•1 year ago
•
|
||
Hi Andrew! Can you please take a look at this? In the last few days we saw an increase in frequency in "run" build bustages.
They fail like this or with failure lines that would go to Bug 1705852.
Thank you!
Comment 2•1 year ago
•
|
||
From what I can see, so far, the ones that fail because of the hang during cloning are all on Android 5.0 x86 pgo.
And the ones that fail with
[task 2023-03-09T13:43:26.860Z] adb Using adb 1.0.41
[task 2023-03-09T13:48:26.863Z] 13:48:26 INFO - Sleeping 10 seconds
[task 2023-03-09T13:48:36.867Z] 13:48:36 INFO - >> Verify Android boot completed: Attempt #2 of 30
[task 2023-03-09T13:48:36.872Z] adb Using adb 1.0.41
[taskcluster:error] Task timeout after 2700 seconds. Force killing container.
[taskcluster 2023-03-09 13:53:02.734Z] === Task Finished ===
[taskcluster 2023-03-09 13:53:02.735Z] Unsuccessful task run with exit code: -1 completed in 2780.163 seconds
are on Android 5.0 x86-64 pgo.
Comment 3•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 7•1 year ago
|
||
Could this hangs be explained by the upgrade of Python to version 3.7 in bug 1734402? The failures got initially classified against other, existing bugs for intermittent task timeouts.
Comment hidden (Intermittent Failures Robot) |
Comment 9•1 year ago
|
||
It's possible. The build script for mercurial 5.8.1
was slightly altered to work with python 3.7
for ubuntu1804
. There could just be an unknown incompatibility there causing the hangs. I will investigate. I'll start by trying to build different (newer) versions of mercurial
to see if that changes the behavior here.
Updated•1 year ago
|
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 14•1 year ago
|
||
The severity field for this bug is set to S4
. However, the following bug duplicate has higher severity:
- Bug 1821537: S2
:ahal, could you consider increasing the severity of this bug to S2
?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 15•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Comment 16•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/1e48ac78e98c add worker image/pool for testing. r=MasterWayZ
Assignee | ||
Comment 17•1 year ago
|
||
For some reason b-linux-gcp works fine with a 20gb disk but we were getting errors:
Invalid value for field 'resource.disks[0].initializeParams.diskSizeGb': '20'. Requested disk size cannot be smaller than the image size (75 GB)
Comment 18•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/da908765ecd3 bump disk size for test pool.
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 20•1 year ago
|
||
Comment 21•1 year ago
|
||
Pushed by mgoossens@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/d444721848b0 Revert disk size change and use new image r=jcristau
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 26•1 year ago
|
||
Comment 27•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/90efbd04c604 update l1 docker-worker image. r=MasterWayZ
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 28•1 year ago
|
||
Imported from version-control-tools revision 0698b5050920.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 30•1 year ago
|
||
So far I've been able to reproduce hangs during robustcheckout; a combination of updating robustcheckout to include the change from bug 1822044 and updating mercurial per bug 1822044 comment 3 will hopefully resolve that.
I haven't yet been able to root-cause these connections hanging, or reproduce other causes of hangs, though.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 32•1 year ago
|
||
Assignee | ||
Comment 33•1 year ago
|
||
Cherry-pick from https://www.mercurial-scm.org/repo/hg/rev/3bb7c56e8fe6
Comment 34•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/5e146d52ea54 update l3 docker-worker image. r=MasterWayZ
Assignee | ||
Comment 35•1 year ago
|
||
Setting a timeout on the download means we can retry if the connection
hangs, instead of sitting around idle until the task itself hits its
maxRunTime.
Comment hidden (Intermittent Failures Robot) |
Comment 37•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b3d278d4a6d8 update robustcheckout hg extension. r=releng-reviewers,bhearsum
Comment 38•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/528f46a1abf8 patch mercurial debian package to make http.timeout option work r=releng-reviewers,bhearsum
Comment 39•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c7a5550cc22f set a timeout for fetch-content downloads. r=releng-reviewers,gbrown
Assignee | ||
Comment 40•1 year ago
|
||
Assignee | ||
Comment 41•1 year ago
|
||
The patches from comments 37-39 should help with the most common issues we've seen here. If that's confirmed I should be able to move on to remaining places we see timeouts next week.
Comment 42•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ebb447880cf2 bump decision docker image version. r=releng-reviewers,gbrown DONTBUILD
Comment 43•1 year ago
|
||
bugherder |
Comment 44•1 year ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 46•1 year ago
|
||
It looks like this is pretty much gone now. I see 2 tasks starred with this bug so far this week, both of which don't actually show a hang, just tests that legitimately did run longer than their maxRunTime.
Assignee | ||
Comment 47•1 year ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-esr102/rev/13caf33f49e6
https://hg.mozilla.org/releases/mozilla-esr102/rev/f814b6f367ea
https://hg.mozilla.org/releases/mozilla-esr102/rev/e402ffc75cb5
Assignee | ||
Comment 48•1 year ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/af658fd3f204
https://hg.mozilla.org/releases/mozilla-beta/rev/3bfad2337556
https://hg.mozilla.org/releases/mozilla-beta/rev/3a45228e40b6
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 50•1 year ago
|
||
Comment 51•1 year ago
|
||
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/981118187641 remove test pool, no longer necessary. r=MasterWayZ
Assignee | ||
Comment 52•1 year ago
|
||
Calling this done.
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Description
•