Closed Bug 1821513 Opened 1 year ago Closed 1 year ago

[meta] High frequency [taskcluster:error] Task timeout after 2700 seconds / 1800 seconds / 3600 seconds / 7200 seconds. Force killing container. -> Hang during various stages of the build

Categories

(Firefox Build System :: Task Configuration, defect, P1)

defect

Tracking

(firefox-esr102 fixed, firefox112 fixed, firefox113 fixed)

RESOLVED FIXED
113 Branch
Tracking Status
firefox-esr102 --- fixed
firefox112 --- fixed
firefox113 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jcristau)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, meta, Whiteboard: [stockwell infra])

Attachments

(10 files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=408420613&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SxIBvUG0T0aoVj9ZBkrtwg/runs/0/artifacts/public/logs/live_backing.log


[taskcluster 2023-03-09 17:21:13.355Z] === Task Starting ===
 * Starting system message bus dbus       
[setup 2023-03-09T17:21:16.027Z] run-task started in /builds/worker
[cache 2023-03-09T17:21:16.030Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[cache 2023-03-09T17:21:16.030Z] cache /builds/worker/tooltool-cache is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2023-03-09T17:21:16.030Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2023-03-09T17:21:16.030Z] volume /builds/worker/checkouts is a cache
[volume 2023-03-09T17:21:16.030Z] volume /builds/worker/tooltool-cache is a cache
[volume 2023-03-09T17:21:16.030Z] changing ownership of volume /builds/worker/workspace to 1000:1000
[setup 2023-03-09T17:21:16.030Z] running as worker:worker
[vcs 2023-03-09T17:21:16.030Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2023-03-09T17:21:16.127Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2023-03-09T17:21:16.127Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '8aea0e78341420231309f814d26a9819e600c94f', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/checkouts/gecko']
[vcs 2023-03-09T17:21:16.479Z] (using Mercurial 5.8.1)
[vcs 2023-03-09T17:21:16.480Z] ensuring https://hg.mozilla.org/integration/autoland@8aea0e78341420231309f814d26a9819e600c94f is available at /builds/worker/checkouts/gecko
[vcs 2023-03-09T17:21:17.007Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2023-03-09T17:21:17.724Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2023-03-09T17:21:18.377Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/4d5e640759776d678d17822c70cece3e395e72c1.stream-v2.hg
[vcs 2023-03-09T17:21:18.460Z] 703336 files to transfer, 4.39 GB of data
[vcs 2023-03-09T17:21:20.466Z] 
[vcs 2023-03-09T17:21:21.463Z] clone [=>                                           ]  269189773/4713057607 34s
[vcs 2023-03-09T17:21:22.463Z] clone [==>                                          ]  359632057/4713057607 37s
<...>
[vcs 2023-03-09T17:22:25.470Z] clone [==================================>          ] 3673829845/4713057607 19s
[vcs 2023-03-09T17:22:26.470Z] clone [==================================>          ] 3762565975/4713057607 17s
[vcs 2023-03-09T17:22:27.502Z] clone [===================================>         ] 3823083326/4713057607 16s
[taskcluster:error] Task timeout after 2700 seconds. Force killing container.
[taskcluster 2023-03-09 18:06:13.764Z] === Task Finished ===
[taskcluster 2023-03-09 18:06:13.765Z] Unsuccessful task run with exit code: -1 completed in 2779.496 seconds

Hi Andrew! Can you please take a look at this? In the last few days we saw an increase in frequency in "run" build bustages.
They fail like this or with failure lines that would go to Bug 1705852.
Thank you!

Flags: needinfo?(ahal)

From what I can see, so far, the ones that fail because of the hang during cloning are all on Android 5.0 x86 pgo.
And the ones that fail with

[task 2023-03-09T13:43:26.860Z] adb Using adb 1.0.41
[task 2023-03-09T13:48:26.863Z] 13:48:26     INFO - Sleeping 10 seconds
[task 2023-03-09T13:48:36.867Z] 13:48:36     INFO - >> Verify Android boot completed: Attempt #2 of 30
[task 2023-03-09T13:48:36.872Z] adb Using adb 1.0.41
[taskcluster:error] Task timeout after 2700 seconds. Force killing container.
[taskcluster 2023-03-09 13:53:02.734Z] === Task Finished ===
[taskcluster 2023-03-09 13:53:02.735Z] Unsuccessful task run with exit code: -1 completed in 2780.163 seconds

are on Android 5.0 x86-64 pgo.

I guess this is not just Android related, found the same hang on OSX and Windows.

Summary: Frequent Android [taskcluster:error] Task timeout after 2700 seconds. Force killing container. -> Hang during cloning phase. → Frequent [taskcluster:error] Task timeout after 2700 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. -> Hang during cloning phase.
See Also: → 1821537
Summary: Frequent [taskcluster:error] Task timeout after 2700 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. -> Hang during cloning phase. → Frequent [taskcluster:error] Task timeout after 2700 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. / Task timeout after 3600 seconds. -> Hang during cloning phase.
Summary: Frequent [taskcluster:error] Task timeout after 2700 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. / Task timeout after 3600 seconds. -> Hang during cloning phase. → Frequent [taskcluster:error] Task timeout after 2700 seconds. / Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. -> Hang during cloning phase.
See Also: → 1821586

Could this hangs be explained by the upgrade of Python to version 3.7 in bug 1734402? The failures got initially classified against other, existing bugs for intermittent task timeouts.

Flags: needinfo?(ahochheiden)

It's possible. The build script for mercurial 5.8.1 was slightly altered to work with python 3.7 for ubuntu1804. There could just be an unknown incompatibility there causing the hangs. I will investigate. I'll start by trying to build different (newer) versions of mercurial to see if that changes the behavior here.

Flags: needinfo?(ahochheiden)
Flags: needinfo?(ahal)
See Also: → 1822427
Summary: Frequent [taskcluster:error] Task timeout after 2700 seconds. / Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container. -> Hang during cloning phase. → High frequency [taskcluster:error] Task timeout after 2700 seconds / 1800 seconds / 3600 seconds / 7200 seconds. Force killing container. -> Hang during various stages of the build
Duplicate of this bug: 1821537
Duplicate of this bug: 1821586
Duplicate of this bug: 1822427
Summary: High frequency [taskcluster:error] Task timeout after 2700 seconds / 1800 seconds / 3600 seconds / 7200 seconds. Force killing container. -> Hang during various stages of the build → [meta] High frequency [taskcluster:error] Task timeout after 2700 seconds / 1800 seconds / 3600 seconds / 7200 seconds. Force killing container. -> Hang during various stages of the build

The severity field for this bug is set to S4. However, the following bug duplicate has higher severity:

:ahal, could you consider increasing the severity of this bug to S2?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ahal)
Assignee: nobody → jcristau
Status: NEW → ASSIGNED
Severity: S4 → S2
Flags: needinfo?(ahal)
Priority: P5 → P1
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/1e48ac78e98c
add worker image/pool for testing. r=MasterWayZ

For some reason b-linux-gcp works fine with a 20gb disk but we were getting errors:
Invalid value for field 'resource.disks[0].initializeParams.diskSizeGb': '20'. Requested disk size cannot be smaller than the image size (75 GB)

Whiteboard: [stockwell disable-recommended]
See Also: → 1822798
Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/d444721848b0
Revert disk size change and use new image r=jcristau
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/90efbd04c604
update l1 docker-worker image. r=MasterWayZ
Keywords: leave-open

Imported from version-control-tools revision 0698b5050920.

So far I've been able to reproduce hangs during robustcheckout; a combination of updating robustcheckout to include the change from bug 1822044 and updating mercurial per bug 1822044 comment 3 will hopefully resolve that.
I haven't yet been able to root-cause these connections hanging, or reproduce other causes of hangs, though.

Depends on: 1822044
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/5e146d52ea54
update l3 docker-worker image. r=MasterWayZ
See Also: → 1823936

Setting a timeout on the download means we can retry if the connection
hangs, instead of sitting around idle until the task itself hits its
maxRunTime.

Depends on: 1823992
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b3d278d4a6d8
update robustcheckout hg extension. r=releng-reviewers,bhearsum
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/528f46a1abf8
patch mercurial debian package to make http.timeout option work r=releng-reviewers,bhearsum
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c7a5550cc22f
set a timeout for fetch-content downloads. r=releng-reviewers,gbrown

The patches from comments 37-39 should help with the most common issues we've seen here. If that's confirmed I should be able to move on to remaining places we see timeouts next week.

Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ebb447880cf2
bump decision docker image version. r=releng-reviewers,gbrown DONTBUILD
Depends on: 1824937
See Also: → 1823120

It looks like this is pretty much gone now. I see 2 tasks starred with this bug so far this week, both of which don't actually show a hang, just tests that legitimately did run longer than their maxRunTime.

Depends on: 1825503
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/981118187641
remove test pool, no longer necessary. r=MasterWayZ

Calling this done.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: