Open Bug 1954000 Opened 1 month ago Updated 5 days ago

High frequency Linux [tier 2] wpt TEST-UNEXPECTED-CRASH <test-name> | expected OK when Gecko 138 merges to beta on 2025-03-31

Categories

(Testing :: web-platform-tests, defect)

x86
Linux
defect

Tracking

(firefox-esr115 unaffected, firefox-esr128 unaffected, firefox136 unaffected, firefox137 unaffected, firefox138+ affected)

Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox136 --- unaffected
firefox137 --- unaffected
firefox138 + affected

People

(Reporter: abutkovits, Unassigned, NeedInfo)

References

Details

(Keywords: intermittent-failure)

Central-as-beta-simulation

How to run these simulations

Failure log

[task 2025-03-13T15:36:18.966Z] 15:36:18     INFO - PID 9779 | [9779] Sandbox: SandboxBroker: thread creation failed: ENOMEM
[task 2025-03-13T15:36:18.967Z] 15:36:18     INFO - PID 9779 | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2025-03-13T15:36:18.974Z] 15:36:18     INFO - PID 9779 | [Parent 9779, IPC I/O Parent] WARNING: process 20721 exited on signal 15: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc:132
[task 2025-03-13T15:36:18.984Z] 15:36:18     INFO - PID 9779 | [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
[task 2025-03-13T15:36:19.375Z] 15:36:19     INFO - Closing window 064e1624-b26c-41b2-ab8e-862686a57712
[task 2025-03-13T15:36:19.383Z] 15:36:19     INFO - PID 9779 | 1741880179382	Marionette	INFO	Stopped listening on port 37321
[task 2025-03-13T15:36:19.420Z] 15:36:19     INFO - NoSuchWindowException on command, setting status to CRASH
[task 2025-03-13T15:36:19.422Z] 15:36:19     INFO - TEST-UNEXPECTED-CRASH | /css/css-grid/alignment/grid-row-axis-alignment-positioned-items-014.html | expected OK
[task 2025-03-13T15:36:19.422Z] 15:36:19     INFO - TEST-INFO took 472ms
[task 2025-03-13T15:36:19.426Z] 15:36:19     INFO - PID 9779 | JavaScript error: chrome://remote/content/marionette/cert.sys.mjs, line 47: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
[task 2025-03-13T15:36:19.509Z] 15:36:19     INFO - Browser exited with return code -15
[task 2025-03-13T15:36:19.511Z] 15:36:19     INFO - Closing logging queue
[task 2025-03-13T15:36:19.511Z] 15:36:19     INFO - queue closed
[task 2025-03-13T15:36:19.551Z] 15:36:19     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmppran9exy
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | Gtk-Message: 15:35:10.646: Failed to load module "canberra-gtk-module"
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | Gtk-Message: 15:35:10.647: Failed to load module "canberra-gtk-module"
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | [GFX1-]: glxtest: libpci missing
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | [GFX1-]: glxtest: libEGL missing
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | [GFX1-]: glxtest: libGL.so.1 missing
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | [GFX1-]: No GPUs detected via PCI
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | 1741880111084	Marionette	INFO	Marionette enabled
[task 2025-03-13T15:36:19.564Z] 15:36:19     INFO - PID 10911 | 1741880111316	Marionette	INFO	Listening on port 39140
[task 2025-03-13T15:36:19.565Z] 15:36:19     INFO - PID 10911 | [GFX1-]: Failed GL context creation for WebRender: 0
[task 2025-03-13T15:36:19.565Z] 15:36:19     INFO - PID 10911 | [GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED
[task 2025-03-13T15:36:19.565Z] 15:36:19     INFO - PID 10911 | [GFX1-]: Failed to connect WebRenderBridgeChild. isParent=true
[task 2025-03-13T15:36:19.565Z] 15:36:19     INFO - PID 10911 | [GFX1-]: Fallback WR to SW-WR
[task 2025-03-13T15:36:19.566Z] 15:36:19     INFO - PID 10911 | console.error: ({})
[task 2025-03-13T15:36:19.567Z] 15:36:19     INFO - PID 10911 | [ERROR fog_control] Boo, couldn't open serverknobs file at /builds/worker/workspace/build/application/firefox/interesting_serverknobs.json
[task 2025-03-13T15:36:19.567Z] 15:36:19     INFO - PID 10911 | GLib-GIO-Message: 15:35:28.903: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
[task 2025-03-13T15:36:19.567Z] 15:36:19     INFO - Starting runner
[task 2025-03-13T15:36:20.132Z] 15:36:20     INFO - TEST-START | /css/css-grid/alignment/grid-row-axis-alignment-positioned-items-015.html```

The bug is marked as tracked for firefox138 (nightly). We have limited time to fix this, the soft freeze is in 10 days. However, the bug still isn't assigned.

:Honza, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(odvarko)

Henrik, could you please take a look, thank you.

Flags: needinfo?(odvarko) → needinfo?(hskupin)

This is not Marionette related but looks strongly like we hit out of memory situations for all those failing cases. Here the related line from the log:

[task 2025-03-13T15:51:13.362Z] 15:51:13 INFO - PID 18494 | [18494] Sandbox: SandboxBroker: thread creation failed: ENOMEM

Maybe Jed knows more in case something changed for the SandboxBroker recently.

Flags: needinfo?(hskupin) → needinfo?(jld)

The most recent thing that seems relevant is bug 1553850, but that's in 137. And this isn't an issue with exceeding the thread limit (RLIMIT_NPROC), because that would fail with EAGAIN, not ENOMEM.

There were a couple of issues with memory leaks or increased memory use from bug 1942129, but I think those were resolved before this test run.

I notice that the failing run is 32-bit, so we're probably running out of address space rather than memory per se. Two things I can think of:

  1. Keep a count of the number of extant SandboxBroker instances and log that in the crashing case, to see if it looks unreasonably large.
  2. If it's large but doesn't seem to be a leak or otherwise fixable, the per-broker address space consumption can be optimized somewhat.

Leaving needinfo to myself to look into this a little more.

I can't remember exactly but it's very much likely something that rings a bell in my mind

This is a reminder regarding comment #1!

The bug is marked as tracked for firefox138 (nightly). We have limited time to fix this, the soft freeze is in 3 days. However, the bug still isn't assigned.

I couldn't reproduce this on Try, even by using the same mach try release as the original failing run. I'll see if I can improve log messages to narrow it down a little from “nonspecific 32-bit OOM”.

Flags: needinfo?(jld)
OS: Unspecified → Linux
Hardware: Unspecified → x86

Jed, here three failures from the latest beta simulation:

What is the threshold when we are talking about too many brokers? Is that the case for those runs?

Flags: needinfo?(jld)

Last I recall from investigating this we were talking about just a few. That much is not expected, I'll take a look today

 8:23.51 TEST_START: /service-workers/service-worker/xsl-base-url.https.html
 8:23.51 INFO Closing window f5c2853c-3753-408b-8ba1-b26d6d791d75
 8:23.52 pid:4422 [4422] Sandbox: SandboxBroker: socketpair success (362 brokers)
 8:23.52 pid:4422 [4422] Sandbox: SandboxBroker: thread creation success (362 brokers)
 8:23.60 pid:4422 [4422] Sandbox: SandboxBroker: socketpair success (363 brokers)
 8:23.60 pid:4422 [4422] Sandbox: SandboxBroker: thread creation success (363 brokers)
 8:23.74 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
 8:23.74 INFO No more tests

Just running amd64 opt build and wpt service workers tests. I'm wondering if the wpt service workers tests are not just keeping too many service workers alive and thus we keep the sandbox broker references alive as well

So I came up with a bit of a hack, maybe we should be releasing something earlier and I could improve the number of brokers down to ~8, yet we still crash: https://treeherder.mozilla.org/logviewer?job_id=501882332&repo=try&lineNumber=11568 we still hit ENOMEM

(In reply to :gerard-majax from comment #11)

So I came up with a bit of a hack, maybe we should be releasing something earlier and I could improve the number of brokers down to ~8, yet we still crash: https://treeherder.mozilla.org/logviewer?job_id=501882332&repo=try&lineNumber=11568 we still hit ENOMEM

That’s a nice dropdown of brokers! Do you think having a patch for that in a separate bug — like you suggested — and landing it once it’s polished could help at least address one symptom? It might even reduce the number of crashes as a result; even though it doesn't fix it completely.

Flags: needinfo?(lissyx+mozillians)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #13)

(In reply to :gerard-majax from comment #11)

So I came up with a bit of a hack, maybe we should be releasing something earlier and I could improve the number of brokers down to ~8, yet we still crash: https://treeherder.mozilla.org/logviewer?job_id=501882332&repo=try&lineNumber=11568 we still hit ENOMEM

That’s a nice dropdown of brokers! Do you think having a patch for that in a separate bug — like you suggested — and landing it once it’s polished could help at least address one symptom? It might even reduce the number of crashes as a result; even though it doesn't fix it completely.

That was really a hack to check if the theory was holding. I think Jed mentionned the fact that we also depend on CC to happen: https://bugzilla.mozilla.org/show_bug.cgi?id=1936938#c10

Flags: needinfo?(lissyx+mozillians)
You need to log in before you can comment on or make changes to this bug.