Closed Bug 1749266 Opened 2 years ago Closed 2 years ago

Intermittent pup [taskcluster:error] Task timeout after 1800 seconds. Force killing container

Categories

(Remote Protocol :: CDP, defect, P5)

Firefox 97
defect

Tracking

(firefox-esr91 unaffected, firefox96 unaffected, firefox97 fixed, firefox98 fixed)

RESOLVED FIXED
98 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox96 --- unaffected
firefox97 --- fixed
firefox98 --- fixed

People

(Reporter: whimboo, Assigned: jdescottes)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

For bug 1580652 we currently see a lot of task timeouts and nearly all are coming from our Puppeteer job. It looks like this is a regression from bug 1732958.

Checking some of them we seem to stall in the following test:

Target Browser.waitForTarget should wait for a target
Target should create a worker from a shared worker
Target should report when a target url changes
waittask specs Page.waitFor should allow you to select an element with parenthesis-starting xpath
waittask specs Page.waitFor should wait for selector
waittask specs Frame.waitForSelector should respond to node attribute mutation
waittask specs Frame.waitForXPath should allow you to select an element with single slash
waittask specs Frame.waitForXPath should throw when frame is detached

Julian, could you please have a look?

Flags: needinfo?(jdescottes)

I requested from the sheriffs the failures to be re-classified for this bug.

Summary: Intermittent Pup [taskcluster:error] Task timeout after 1800 seconds. Force killing container → Intermittent pup [taskcluster:error] Task timeout after 1800 seconds. Force killing container

As discussed on elements, failing runs seem to waste around 10 minutes on a "clone" step:

[vcs 2022-01-08T11:39:54.869Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2022-01-08T11:39:55.013Z] hgmointernal rate miss; using public hg.mozilla.org service
[vcs 2022-01-08T11:39:55.013Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2022-01-08T11:39:55.091Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:4D:EB:21:6E:35:2F:99:C6:8F:C3:47:9B:57:B8:6C:17:15:8F:86:09:D4:6C:17:1D:87:B0:DE:F9:0E:51:70:FC,sha256:FF:E7:8D:93:E9:56:3C:C0:19:FC:00:4C:18:B9:86:E5:08:E5:10:F5:E2:EA:48:E8:22:D3:A3:3A:CA:99:C3:4C', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '4427e203c432ae8a2db310112bbc86a598ca1203', 'https://hg.mozilla.org/mozilla-central', '/builds/worker/checkouts/gecko']
[vcs 2022-01-08T11:39:55.249Z] (using Mercurial 5.8.1)
[vcs 2022-01-08T11:39:55.249Z] ensuring https://hg.mozilla.org/mozilla-central@4427e203c432ae8a2db310112bbc86a598ca1203 is available at /builds/worker/checkouts/gecko
[vcs 2022-01-08T11:39:56.046Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2022-01-08T11:39:56.216Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2022-01-08T11:39:57.010Z] applying clone bundle from https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-unified/4427e203c432ae8a2db310112bbc86a598ca1203.stream-v2.hg

https://firefoxci.taskcluster-artifacts.net/Jft1NJWuTm2TmGnWknp8dQ/0/public/logs/live_backing.log

A successful run will not have this 10mn clone: https://firefoxci.taskcluster-artifacts.net/GCjlo76NQ9G09xrDrQyK0w/0/public/logs/live_backing.log

Since a successful run already takes close to 30mn, the 10mn delay could be enough to make the test exceed the 1800s (=30mn) timeout. Since the last sync added new tests, I imagine we run into this issue more consistently anytime this clone happens.

My suggestion is to increase the timeout and see if it solves the issue. Alternatively we could skip failing/timeout tests, which would reduce the runtime to 10mn ish.

Flags: needinfo?(jdescottes)

Slow cloning is known via bug 1487587. So yes, lets increase the timeout for the puppeteer job to circumvent this problem for now. Thanks!

Depends on: 1487587

Slow cloning is about clone taking 15-30mn instead of 10mn, however we have the issue even if the clone is fast.
The tests are only successful when we are able to start without cloning, by using a cached version of the repo.

That's what I assume the following log line is about:

(existing repository shared store: /builds/worker/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)

The test itself takes around 25 minutes. If the task hits the clone step, it will take an additional 10minutes.
Note that there is also a slow clone issue where clone can take between 15 and 30 mn.

Assignee: nobody → jdescottes
Status: NEW → ASSIGNED
Pushed by jdescottes@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f106f16199d0
[remote] Increase timeout for puppeteer tasks to 3600 seconds r=jmaher,whimboo
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch

Set release status flags based on info from the regressing bug 1732958

Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: