Closed Bug 1890113 Opened 2 years ago Closed 1 year ago

Intermittent [tier 2] /css/css-grid/alignment/grid-row-axis-alignment-sticky-positioned-items-002.html | single tracking bug

Categories

(Core :: Layout: Grid, defect, P5)

x86
Linux
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox125 --- unaffected
firefox126 --- unaffected

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, intermittent-testcase, regression, Whiteboard: [retriggered])

Filed by: sstanca [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=453745674&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ANUBZeQPSZOBO6kqxSV9hg/runs/0/artifacts/public/logs/live_backing.log


[task 2024-04-06T10:55:34.452Z] 10:55:34     INFO - TEST-OK | /css/css-grid/alignment/grid-row-axis-alignment-sticky-positioned-items-001.html | took 338ms
[task 2024-04-06T10:55:34.453Z] 10:55:34     INFO - TEST-START | /css/css-grid/alignment/grid-row-axis-alignment-sticky-positioned-items-002.html
[task 2024-04-06T10:55:34.457Z] 10:55:34     INFO - Closing window e31cb419-d2d8-498e-adf7-e97e48432f69
[task 2024-04-06T10:55:34.476Z] 10:55:34     INFO - PID 32485 | [GFX1-]: wr_renderer_render: OutOfMemory
[task 2024-04-06T10:55:34.499Z] 10:55:34     INFO - PID 32485 | [GFX1-]: Handling webrender error 2
[task 2024-04-06T10:55:34.500Z] 10:55:34     INFO - PID 32485 | [GFX1-]: Fallback remains SW-WR
[task 2024-04-06T10:55:34.647Z] 10:55:34     INFO - PID 32485 | [32485] Sandbox: SandboxBroker: thread creation failed: ENOMEM
[task 2024-04-06T10:55:34.650Z] 10:55:34     INFO - PID 32485 | [32485] Sandbox: SandboxBroker: thread creation failed: ENOMEM
[task 2024-04-06T10:55:34.650Z] 10:55:34     INFO - PID 32485 | [Parent 32485, IPC I/O Parent] WARNING: process 12541 exited on signal 15: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:335
[task 2024-04-06T10:55:34.651Z] 10:55:34     INFO - PID 32485 | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2024-04-06T10:55:34.653Z] 10:55:34     INFO - PID 32485 | [Parent 32485, IPC I/O Parent] WARNING: process 12543 exited on signal 15: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:335
[task 2024-04-06T10:55:34.661Z] 10:55:34     INFO - PID 32485 | [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
[task 2024-04-06T10:55:35.033Z] 10:55:35     INFO - PID 32485 | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2024-04-06T10:55:35.038Z] 10:55:35     INFO - PID 32485 | 1712400935037	Marionette	INFO	Stopped listening on port 51613
[task 2024-04-06T10:55:35.077Z] 10:55:35     INFO - NoSuchWindowException on command, setting status to CRASH
[task 2024-04-06T10:55:35.078Z] 10:55:35     INFO - TEST-UNEXPECTED-CRASH | /css/css-grid/alignment/grid-row-axis-alignment-sticky-positioned-items-002.html | expected OK
[task 2024-04-06T10:55:35.078Z] 10:55:35     INFO - TEST-INFO took 626ms
[task 2024-04-06T10:55:35.082Z] 10:55:35     INFO - PID 32485 | JavaScript error: chrome://remote/content/marionette/cert.sys.mjs, line 47: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
[task 2024-04-06T10:55:35.083Z] 10:55:35     INFO - PID 32485 | JavaScript error: chrome://remote/content/marionette/cert.sys.mjs, line 47: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
[task 2024-04-06T10:55:45.096Z] 10:55:45     INFO - PID 32485 | WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"ShieldRecipeClient: Cleaning up","state":"(none)","filename":"resource://normandy/lib/CleanupManager.sys.mjs","lineNumber":39,"stack":["resource://normandy/lib/CleanupManager.sys.mjs:cleanup:39","resource://normandy/Normandy.sys.mjs:uninit:161","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted/tasks<:2135","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted:2153","resource:///modules/BrowserGlue.sys.mjs:BG_observe:1120","resource:///modules/ContentCrashHandlers.sys.mjs:observe:165"]}] Barrier: profile-before-change
[task 2024-04-06T10:55:50.095Z] 10:55:50     INFO - PID 32485 | console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.sys.mjs", 438))
[task 2024-04-06T10:56:36.096Z] 10:56:36     INFO - PID 32485 | FATAL ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"ShieldRecipeClient: Cleaning up","state":"(none)","filename":"resource://normandy/lib/CleanupManager.sys.mjs","lineNumber":39,"stack":["resource://normandy/lib/CleanupManager.sys.mjs:cleanup:39","resource://normandy/Normandy.sys.mjs:uninit:161","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted/tasks<:2135","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted:2153","resource:///modules/BrowserGlue.sys.mjs:BG_observe:1120","resource:///modules/ContentCrashHandlers.sys.mjs:observe:165"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.
[task 2024-04-06T10:56:36.097Z] 10:56:36     INFO - PID 32485 | [Parent 32485, Main Thread] ###!!! ABORT: file resource://normandy/lib/CleanupManager.sys.mjs:39
[task 2024-04-06T10:56:36.100Z] 10:56:36     INFO - PID 32485 | ExceptionHandler::GenerateDump cloned child 12549
[task 2024-04-06T10:56:36.101Z] 10:56:36     INFO - PID 32485 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2024-04-06T10:56:36.101Z] 10:56:36     INFO - PID 32485 | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2024-04-06T10:56:36.263Z] 10:56:36     INFO - Browser exited with return code 11
[task 2024-04-06T10:56:36.264Z] 10:56:36     INFO - Closing logging queue
[task 2024-04-06T10:56:36.265Z] 10:56:36     INFO - queue closed
[task 2024-04-06T10:56:36.284Z] 10:56:36     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpea84wgwc
[task 2024-04-06T10:56:36.293Z] 10:56:36     INFO - PID 2348 | Gtk-Message: 10:54:36.773: Failed to load module "canberra-gtk-module"
[task 2024-04-06T10:56:36.294Z] 10:56:36     INFO - PID 2348 | Gtk-Message: 10:54:36.775: Failed to load module "canberra-gtk-module"
[task 2024-04-06T10:56:36.294Z] 10:56:36     INFO - PID 2348 | [GFX1-]: glxtest: libpci missing
[task 2024-04-06T10:56:36.294Z] 10:56:36     INFO - PID 2348 | [GFX1-]: glxtest: libEGL missing
[task 2024-04-06T10:56:36.295Z] 10:56:36     INFO - PID 2348 | [GFX1-]: glxtest: libGL.so.1 missing
[task 2024-04-06T10:56:36.295Z] 10:56:36     INFO - PID 2348 | [GFX1-]: No GPUs detected via PCI
[task 2024-04-06T10:56:36.295Z] 10:56:36     INFO - PID 2348 | 1712400877170	Marionette	INFO	Marionette enabled
[task 2024-04-06T10:56:36.296Z] 10:56:36     INFO - PID 2348 | 1712400877301	Marionette	INFO	Listening on port 38192
[task 2024-04-06T10:56:36.296Z] 10:56:36     INFO - PID 2348 | [GFX1-]: Failed GL context creation for WebRender: 0
[task 2024-04-06T10:56:36.297Z] 10:56:36     INFO - PID 2348 | [GFX1-]: FEATURE_FAILURE_WEBRENDER_INITIALIZE_UNSPECIFIED
[task 2024-04-06T10:56:36.297Z] 10:56:36     INFO - PID 2348 | [GFX1-]: Failed to connect WebRenderBridgeChild. isParent=true
[task 2024-04-06T10:56:36.298Z] 10:56:36     INFO - PID 2348 | [GFX1-]: Fallback WR to SW-WR
[task 2024-04-06T10:56:36.298Z] 10:56:36     INFO - PID 2348 | GLib-GIO-Message: 10:54:39.375: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
[task 2024-04-06T10:56:36.299Z] 10:56:36     INFO - PID 2348 | console.error: ({})
[task 2024-04-06T10:56:36.299Z] 10:56:36     INFO - PID 2348 | [ERROR error_support::handling] suggest-unexpected: Error from Remote Settings: Error parsing URL: relative URL with a cannot-be-a-base base
[task 2024-04-06T10:56:36.300Z] 10:56:36     INFO - PID 2348 | console.error: URLBar - QuickSuggest.SuggestBackendRust: "Ingest error: Error from Remote Settings: Error parsing URL: relative URL with a cannot-be-a-base base"
[task 2024-04-06T10:56:36.300Z] 10:56:36     INFO - Starting runner
[task 2024-04-06T10:56:36.899Z] 10:56:36     INFO - TEST-START | /css/css-grid/alignment/grid-row-axis-self-baseline-synthesized-001.html

Hi Kelly! Can you please take a look at this? I think this might be something regressed by the recent changes from Bug 1889769.
Thank you!

Flags: needinfo?(kcochrane)

:emilio, do you know what might have caused this? kelly's patch was just to firefoxview CSS and is not a potential cause. But knowing what is might get her off the hook and give the sherifs something to work with.

Flags: needinfo?(kcochrane) → needinfo?(emilio)

[GFX1-]: wr_renderer_render: OutOfMemory

So a graphics issue? Can we get this bisected if it's frequent enough?

Flags: needinfo?(emilio) → needinfo?(imoraru)

This is the way the backfills and retriggers look like. First time it appeared, from 14 runs, it only failed 2 times. I can do a more in depth bisection if you want.

Flags: needinfo?(imoraru) → needinfo?(emilio)

I don't see much there that would obviously increase the chances of OOM in webrender in those patches. I guess the culprint could be anywhere really, so if you could bisect the intermediate patches it'd be great....

Flags: needinfo?(emilio) → needinfo?(imoraru)
See Also: → 1890183
See Also: → 1891288

Hi Henrik! Can you please take a look at this and the bugs in the see also section? It seems that they are regressed by Bug 1522790.

Backfills and retriggers

Emilio if you want to take a look at the in depth bisection you can take a look at the link above!

Thank you!

Flags: needinfo?(imoraru) → needinfo?(hskupin)
Keywords: regression
Regressed by: 1522790
Whiteboard: [retriggered]

The patch for bug 1522790 cannot have caused it but made it potentially more prominent and easier to reproduce. Please note that the original patch got backed out but we will re-land shortly again. The number of failures is still low so I would probably advice to not mark the test as intermittent crash.

Also it looks like that only Linux 18.04 WebRender Shippable is affected here. But what's strange is that we do not have any generated minidump file for all of these crashes. Gabriele, is that something you have seen before?

Here the related lines from the log:

[task 2024-04-12T23:01:01.145Z] 23:01:01     INFO - PID 32290 | [GFX1-]: wr_renderer_render: OutOfMemory
[task 2024-04-12T23:01:01.152Z] 23:01:01     INFO - PID 32290 | [GFX1-]: Handling webrender error 2
[task 2024-04-12T23:01:01.153Z] 23:01:01     INFO - PID 32290 | [GFX1-]: Fallback remains SW-WR
[task 2024-04-12T23:01:01.332Z] 23:01:01     INFO - PID 32290 | [32290] Sandbox: SandboxBroker: thread creation failed: ENOMEM
[task 2024-04-12T23:01:01.332Z] 23:01:01     INFO - PID 32290 | [32290] Sandbox: SandboxBroker: thread creation failed: ENOMEM
[task 2024-04-12T23:01:01.333Z] 23:01:01     INFO - PID 32290 | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2024-04-12T23:01:01.338Z] 23:01:01     INFO - PID 32290 | [Parent 32290, IPC I/O Parent] WARNING: process 12280 exited on signal 15: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:335
[task 2024-04-12T23:01:01.338Z] 23:01:01     INFO - PID 32290 | [Parent 32290, IPC I/O Parent] WARNING: process 12282 exited on signal 15: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:335
[task 2024-04-12T23:01:01.348Z] 23:01:01     INFO - PID 32290 | [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
[task 2024-04-12T23:01:01.720Z] 23:01:01     INFO - PID 32290 | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2024-04-12T23:01:01.725Z] 23:01:01     INFO - PID 32290 | 1712962861723	Marionette	INFO	Stopped listening on port 42181
[task 2024-04-12T23:01:01.767Z] 23:01:01     INFO - NoSuchWindowException on command, setting status to CRASH
[task 2024-04-12T23:01:01.767Z] 23:01:01     INFO - TEST-UNEXPECTED-CRASH | /css/css-grid/alignment/grid-row-axis-alignment-sticky-positioned-items-002.html | expected OK
[task 2024-04-12T23:01:01.767Z] 23:01:01     INFO - TEST-INFO took 633ms
[task 2024-04-12T23:01:01.768Z] 23:01:01     INFO - PID 32290 | JavaScript error: chrome://remote/content/marionette/cert.sys.mjs, line 47: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
[task 2024-04-12T23:01:01.816Z] 23:01:01     INFO - PID 32290 | JavaScript error: chrome://remote/content/marionette/cert.sys.mjs, line 47: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsICertOverrideService.setDisableAllSecurityChecksAndLetAttackersInterceptMyData]
[task 2024-04-12T23:01:11.832Z] 23:01:11     INFO - PID 32290 | WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"ShieldRecipeClient: Cleaning up","state":"(none)","filename":"resource://normandy/lib/CleanupManager.sys.mjs","lineNumber":39,"stack":["resource://normandy/lib/CleanupManager.sys.mjs:cleanup:39","resource://normandy/Normandy.sys.mjs:uninit:161","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted/tasks<:2135","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted:2153","resource:///modules/BrowserGlue.sys.mjs:BG_observe:1120","resource:///modules/ContentCrashHandlers.sys.mjs:observe:165"]}] Barrier: profile-before-change
[task 2024-04-12T23:01:16.832Z] 23:01:16     INFO - PID 32290 | console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.sys.mjs", 438))
[task 2024-04-12T23:02:02.833Z] 23:02:02     INFO - PID 32290 | FATAL ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"ShieldRecipeClient: Cleaning up","state":"(none)","filename":"resource://normandy/lib/CleanupManager.sys.mjs","lineNumber":39,"stack":["resource://normandy/lib/CleanupManager.sys.mjs:cleanup:39","resource://normandy/Normandy.sys.mjs:uninit:161","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted/tasks<:2135","resource:///modules/BrowserGlue.sys.mjs:_onQuitApplicationGranted:2153","resource:///modules/BrowserGlue.sys.mjs:BG_observe:1120","resource:///modules/ContentCrashHandlers.sys.mjs:observe:165"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.
[task 2024-04-12T23:02:02.833Z] 23:02:02     INFO - PID 32290 | [Parent 32290, Main Thread] ###!!! ABORT: file resource://normandy/lib/CleanupManager.sys.mjs:39
[task 2024-04-12T23:02:02.836Z] 23:02:02     INFO - PID 32290 | ExceptionHandler::GenerateDump cloned child 12318
[task 2024-04-12T23:02:02.837Z] 23:02:02     INFO - PID 32290 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2024-04-12T23:02:02.837Z] 23:02:02     INFO - PID 32290 | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2024-04-12T23:02:02.977Z] 23:02:02     INFO - Browser exited with return code 11

Maybe it's also due to the delayed shutdown for ShieldRecipeClient?

Flags: needinfo?(hskupin) → needinfo?(gsvelto)
OS: Unspecified → Linux
Hardware: Unspecified → x86

I can see that the crash machinery kicked in from the log, so I would expect a minidump to have been generated... however there's signs that the machine was running out of memory at that point. It's possible that minidump generation failed silently because of that.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #9)

I can see that the crash machinery kicked in from the log, so I would expect a minidump to have been generated... however there's signs that the machine was running out of memory at that point. It's possible that minidump generation failed silently because of that.

But that is for the shutdown hang that we experience here as well, or isn't it? Given that there was no such bug yet I filed bug 1891418 for Normandy client.

The resource usage of the machine doesn't actually see a OOM but maybe it's not seen by the resource usage profile?

There's a cascade of OOM failures here though they appear to be gracefully handled. Maybe it's not memory per se we're running out of, maybe it's some hard limit? Without knowing this code in detail it's hard to tell.

Hm, maybe there is a problem with GFX code that causes our crash reporter to miss the creation of minidump files? I just saw something similar on bug 1785367 comment 28 which is not a OOM situation.

This seems to have been fixed-by-backout of bug 1522790, which is still backed out at the moment (most recently as of bug 1522790 comment 69), though presumably it's re-landing soonish.

I see there's been various activity on that bug after the initial landing to avoid various test failures; it's not clear to me at this point whether that's going to fix this when bug 1522790 re-lands or not.

[setting status to 'unaffected' for 125 and 126 to get this off the bugdash "important regressions needing decision" list, since this doesn't actually seem to be a gecko regression per se, and the thing that made it start happening on automation is backed out for the time being]

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #12)

Hm, maybe there is a problem with GFX code that causes our crash reporter to miss the creation of minidump files? I just saw something similar on bug 1785367 comment 28 which is not a OOM situation.

When checking the results of this job it can be seen in the profile resource usage (profiler data) that the memory usage indeed didn't went up. So maybe some code tried to allocate too much memory? Or the error code is misleading? Not sure how much memory it costs to create a new SandboxBroker thread but I also don't see how NOMEM gets set when logging.

Jed, do you have an idea?

Blocks: 1522790
Flags: needinfo?(jld)
No longer regressed by: 1522790

The error for exceeding RLIMIT_NPROC is EAGAIN. From a quick look at kernel/fork.c in the kernel source, ENOMEM from clone really does mean that a kernel memory allocation failed.

We are doing something slightly wrong here, because pthread_create is specified to return an error code and not (necessarily) set errno, but from a quick look at nptl/pthread_create.c in the glibc source it does both, at least for the call to clone. So the ENOMEM in the error message really does look like it's indicating an actual OOM, especially combined with the OOM error from WebRender right before it.

Flags: needinfo?(jld)

Daniel, do you remember which platforms were affected here? Sadly there is no intermittent failure accessible (and listed) anymore in the above table even not with the full history. As of now the only expected meta data that is set is for Android with Fission enabled. Here we have [OK, TIMEOUT]. Someone could try to remove and push to try and check if the test is stable now.

Flags: needinfo?(dholbert)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #16)

Daniel, do you remember which platforms were affected here? Sadly there is no intermittent failure accessible (and listed) anymore in the above table even not with the full history.

Comment 8 says this only affected Linux 18.04 WebRender Shippable.

Flags: needinfo?(dholbert)

Ah I missed that. In that case lets close this bug as incomplete.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.