Closed Bug 1904963 Opened 3 months ago Closed 21 days ago

Perma linux asan canvas [tier 2] [taskcluster:error] Task timeout after 1800 seconds. Force killing container. | single tracking bug

Categories

(Core :: DOM: Content Processes, defect, P5)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox127 --- unaffected
firefox128 --- unaffected
firefox129 --- wontfix
firefox130 --- wontfix
firefox131 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

(Depends on 1 open bug, Blocks 1 open bug, Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=464147572&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/bMi-owgmS5a7mCdjaeQWUQ/runs/0/artifacts/public/logs/live_backing.log


[task 2024-06-26T19:03:18.467Z] 19:03:18     INFO - TEST-START | /html/canvas/offscreen/path-objects/2d.path.clip.basic.2.html
[task 2024-06-26T19:03:18.545Z] 19:03:18     INFO - Closing window 3ab6258d-39ca-4ed1-8f41-80cd56bf4685
[task 2024-06-26T19:03:18.619Z] 19:03:18     INFO - PID 1012 | -----------------------------------------------------
[task 2024-06-26T19:03:18.620Z] 19:03:18     INFO - PID 1012 | Suppressions used:
[task 2024-06-26T19:03:18.621Z] 19:03:18     INFO - PID 1012 |   count      bytes template
[task 2024-06-26T19:03:18.621Z] 19:03:18     INFO - PID 1012 |      31      16288 nsComponentManagerImpl
[task 2024-06-26T19:03:18.622Z] 19:03:18     INFO - PID 1012 |       2        288 libfontconfig.so
[task 2024-06-26T19:03:18.623Z] 19:03:18     INFO - PID 1012 |       1       9240 style::sharing::SHARING_CACHE_KEY
[task 2024-06-26T19:03:18.624Z] 19:03:18     INFO - PID 1012 |       1       4104 style::bloom::BLOOM_KEY
[task 2024-06-26T19:03:18.624Z] 19:03:18     INFO - PID 1012 | -----------------------------------------------------
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2024-06-26 19:03:21.518Z] === Task Finished ===
[taskcluster 2024-06-26 19:03:21.518Z] Unsuccessful task run with exit code: -1 completed in 1804.01 seconds

Hi Andrew! Can you please take a look at this? Could this be something regressed by the recent changes from Bug 1901076?
Thank you!

Flags: needinfo?(aosmond)
Summary: Perma linux asan canvas [taskcluster:error] Task timeout after 1800 seconds. Force killing container. | single tracking bug → Perma linux asan canvas [tier 2] [taskcluster:error] Task timeout after 1800 seconds. Force killing container. | single tracking bug

The regressor cannot be identified on autoland as the job will fail with No checks run.
It first started with this pushlog merged to central: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=653f0dc8442dd9cae70845896eb3c0a0252677b3

Component: Graphics: ImageLib → Graphics: Canvas2D
Flags: needinfo?(aosmond)

I bisected this on try to bug 1728331.

Component: Graphics: Canvas2D → DOM: Content Processes
Keywords: regression
Regressed by: 1728331

Set release status flags based on info from the regressing bug 1728331

:nika, since you are the author of the regressor, bug 1728331, could you take a look?

For more information, please visit BugBot documentation.

It appears that this timeout is happening because the test is taking too long to run, rather than because of a hang or similar while running the test.

Given that this is a nofis test, my guess is that the change to remove E10S-only process recycling is negatively interacting with this particular test suite in some way, causing it to run much longer than previously. In the log linked from comment 0, there are Suppressions used: outputs relatively frequently, which I am guessing correspond to a content process shutting down. Looking at a non-failing run from the backlog, I see much less frequent Suppressions used: logs (e.g. https://treeherder.mozilla.org/logviewer?job_id=464147617&repo=mozilla-central&lineNumber=6899).

We probably don't want to try to restore the E10S recycling logic, as we don't support e10s on desktop anymore, and it's a significant amount of complexity which would only be used for test code. It might be possible to mitigate the process recycling by setting dom.ipc.keepProcesses.web to a non-zero number in this test suite when running with Fission disabled, which may reduce process shutdown/startup cycles.

Flags: needinfo?(nika)

It might be possible to mitigate the process recycling by setting dom.ipc.keepProcesses.web to a non-zero number in this test suite when running with Fission disabled, which may reduce process shutdown/startup cycles.

Artur, would you mind to try this ?

Flags: needinfo?(aiunusov)

Set release status flags based on info from the regressing bug 1728331

Just changed .ini file, and then questioned myself, if this pref is actually used anywhere:

https://searchfox.org/mozilla-central/search?q=keepProcessesAlive.web&path=&case=false&regexp=false

(still looking)

Assignee: nobody → aiunusov
Flags: needinfo?(aiunusov)
Attachment #9411796 - Attachment description: Bug 1904963 - Set dom.ipc.keepProcessesAlive.web to non zero value, r=jstutte → WIP: Bug 1904963 - canvas wpt: set dom.ipc.keepProcessesAlive.web = 1
Attachment #9411796 - Attachment description: WIP: Bug 1904963 - canvas wpt: set dom.ipc.keepProcessesAlive.web = 1 → Bug 1904963 - canvas wpt: set dom.ipc.keepProcessesAlive.web = 1 when fission is disabled, r=jstutte
Pushed by aiunusov@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2bfa2e58766c canvas wpt: set dom.ipc.keepProcessesAlive.web = 1 when fission is disabled, r=jstutte
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 130 Branch

Hi Artur! Can you please take another look at this? the issue is still happening -> check here
Thank you!

Status: RESOLVED → REOPENED
Flags: needinfo?(aiunusov)
Resolution: FIXED → ---
Target Milestone: 130 Branch → ---

(looking)

https://bugzilla.mozilla.org/show_bug.cgi?id=1891526 - see also

(going to try them locally in asan environment)

We noticed a bug in Marionette lately that I'm going to fix on bug 1761634. With it's landing each test will need some more milliseconds to complete because now we will correctly wait for the initial about:blank to be loaded. Based on that I'm going ahead and split the canvas jobs into 3 chunks and also increase the task timeout from 1800 to 2700, which brings us to the same settings as for other jobs. That means that this should also help here so that we no longer see these task timeouts.

Depends on: 1761634

The failures seem to have stopped after my patch landed. I'll re-check later this week and will mark the bug as fixed if there will be still no failures reported.

Flags: needinfo?(aiunusov) → needinfo?(hskupin)

I can verify that ASAN builds do no longer cause a timeout of the task since bug 1761634 landed. Marking this bug as fixed.

Assignee: aiunusov → hskupin
Blocks: 1891526
Status: REOPENED → RESOLVED
Closed: 2 months ago21 days ago
Flags: needinfo?(hskupin)
Resolution: --- → FIXED

The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox130 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)
Flags: needinfo?(hskupin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: