Crash in [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped]
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox87 | --- | unaffected |
firefox88 | --- | unaffected |
firefox89 | --- | wontfix |
firefox90 | --- | fix-optional |
People
(Reporter: calixte, Unassigned)
References
(Depends on 1 open bug, Blocks 2 open bugs, Regression)
Details
(Keywords: crash, regression, Whiteboard: [not-a-fission-bug])
Crash Data
Maybe Fission related. (DOMFissionEnabled=1)
Crash report: https://crash-stats.mozilla.org/report/index/b621fe01-471f-4b1a-a0bb-ca7ce0210331
MOZ_CRASH Reason: MOZ_CRASH(OOM)
Top 10 frames of crashing thread:
0 xul.dll NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:618
1 xul.dll mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped gfx/layers/SourceSurfaceSharedData.cpp:76
2 xul.dll mozilla::gfx::SourceSurfaceSharedDataWrapper::Init gfx/layers/SourceSurfaceSharedData.cpp:48
3 xul.dll static mozilla::layers::SharedSurfacesParent::Add gfx/layers/ipc/SharedSurfacesParent.cpp:215
4 xul.dll mozilla::layers::CompositorManagerParent::RecvAddSharedSurface gfx/layers/ipc/CompositorManagerParent.cpp:265
5 xul.dll mozilla::layers::PCompositorManagerParent::OnMessageReceived ipc/ipdl/PCompositorManagerParent.cpp:291
6 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2078
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1149
8 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:332
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:328
There is 1 crash in nightly 89 with buildid 20210330035059. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1699224.
[1] https://hg.mozilla.org/mozilla-central/rev?node=cd7cfc8fc140
Comment 1•4 years ago
|
||
Adding [not-a-fission-bug]
whiteboard tag because this is not a Fission-related crash, even though the crash report in comment 0 has DOMFissionEnabled=1. Only 2 of the 7 crash reports (29%) have Fission enabled.
Comment 2•4 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #1)
Adding
[not-a-fission-bug]
whiteboard tag because this is not a Fission-related crash, even though the crash report in comment 0 has DOMFissionEnabled=1. Only 2 of the 7 crash reports (29%) have Fission enabled.
Turns out those 2 Fission crash reports are from me! If you would like me to test anything, just ping me.
Comment 3•4 years ago
|
||
Adding to gfx-triage.
Comment 4•4 years ago
|
||
My GPU process crashes with this signature about 2-4 times a day. Here are my SourceSurfaceSharedDataWrapper::EnsureMapped crash reports from today. I am running a 32-bit Firefox build on 64-bit Windows 10 with Fission and gfx.webrender.software = true (and gfx.webrender.software.d3d11 = true).
bp-3c548121-1332-4abf-b40d-780b80210402
bp-603eb2fb-adae-4f4d-b961-12af80210402
bp-b0174f6e-2d45-4037-ae1c-00b4b0210402
bp-2cbd17c4-6546-4e63-9d5c-5493c0210402
Comment 5•4 years ago
|
||
To a certain extent, I expect to see this crash rate go up and consolidate from other signatures.
However all of the crashes in comment 4 are situations with high virtual memory (>2GB) and low allocation sizes (~14MB). It seems unlikely to me that the cache is so fragmented that we can't fit in 14MB somewhere, so something else went wrong.
Comment 6•4 years ago
|
||
The other pattern is ~1/3 of the crashes are very likely due to rasterized SVGs since they are nearly 200 MB, our rasterization size limit in the content process. Since deferrable blobs allocates in the GPU process, and is able to tile (thus reducing the individual allocation sizes, and the total since not all the tiles are needed), this should be fixed by bug 1673653.
Comment 7•4 years ago
|
||
A special build with extra logging produced this crash report:
https://crash-stats.mozilla.org/report/index/f481bddf-993d-482e-9a03-3eb3c0210406#tab-details
The extra logs are:
|[G0][GFX1-]: SharedMemory::Map 8 (t=231977) |[G1][GFX1-]: SharedMemory::Map 8 (t=231977) |[G2][GFX1-]: SharedMemory::Map 8 (t=231977) |[G3][GFX1-]: SharedMemory::Map 8 (t=231977) |[G4][GFX1-]: SharedMemory::Map 8 (t=231977)
It tried to map several times and fails with ERROR_NOT_ENOUGH_MEMORY (8) each time. When it finally crashes, there is tons of virtual memory. I've done a respin with extra logging to give a better picture of the state at each map attempt.
This error is sort of expected if there isn't enough virtual memory, but given the state of how much virtual memory we have, it is either a bug or it is highly fragmented to the point where there is no 14.8 MB chunk in 2.57 GB. At this point I am assuming each map message is for the same sized buffer, given the pattern from previous reports.
Comment 8•4 years ago
|
||
A follow up build with extra logging produced this crash report:
https://crash-stats.mozilla.org/report/index/1190a550-754c-4b20-863f-098610210407#tab-annotations
|[G0][GFX1-]: Shm 14802944 8 (t=33770.5)
|[G1][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5)
|[G2][GFX1-]: Shm 14802944 8 (t=33770.5)
|[G3][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5)
|[G4][GFX1-]: Shm 14802944 8 (t=33770.5)
|[G5][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5)
|[G6][GFX1-]: Shm 14802944 8 (t=33770.5)
|[G7][GFX1-]: V 318734336 PH 21325021184 PA 21276147712 L 37 (t=33770.5)
|[G8][GFX1-]: Shm 14802944 8 (t=33772.4)
|[G9][GFX1-]: V 318734336 PH 21467959296 PA 21215850496 L 37 (t=33772.4)
This paints a more complete picture. The low virtual memory state, ~310MB, persists even as we unmap shared memory images from the expiration tracker. When the report is generated, we get >2GB result.
The documentation of UnmapViewOfFileEx and MEM_UNMAP_WITH_TRANSIENT_BOOST in particular:
https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-unmapviewoffileex
suggests that there may be a delay between our unmap and its actual removal from virtual memory. This could explain what we are seeing here.
Comment 9•4 years ago
|
||
The probably relevant thread documentation at:
Memory priority helps to determine how long pages remain in the working set of a process before they are trimmed. A thread's memory priority determines the minimum priority of the physical pages that are added to the process working set by that thread. When the memory manager trims the working set, it trims lower priority pages before higher priority pages. This improves overall system performance because higher priority pages are less likely to be trimmed from the working set and then trigger a page fault when they are accessed again.
Comment 10•4 years ago
|
||
I would note that this particular pattern we only see on Windows, where we have a relatively small allocation and a large virtual memory available pool. On Fenix 32-bit ARM we only see OOMs when the available virtual memory was low relative to the allocation size. This suggests it is indeed related to the above API descriptions.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 11•4 years ago
|
||
We'll have a better idea once we hit 89 beta, but it is looking like bug 1703839 may have mostly resolved this. I don't except the crash rate to be zero but it seems deferring to OOM crashing to the latest possible point, as well as fixing the bug where the first map wasn't added to the unmapping tracker (it would eventually get added once WR uses the surface in the next frame), got us to a good spot.
Comment 12•4 years ago
|
||
At this point, the volume is low for an OOM, and all of the crashes are genuine cases of little available virtual memory. The patches let us unmap as much as possible from the image cache to reclaim some in a timely basis, so I think we can decrease the priority to the backlog.
Comment 13•4 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #12)
At this point, the volume is low for an OOM, and all of the crashes are genuine cases of little available virtual memory. The patches let us unmap as much as possible from the image cache to reclaim some in a timely basis, so I think we can decrease the priority to the backlog.
Andrew, that's a medium crasher on 89 beta and we had no crashes in 88 beta, it seems to me that the crash situation is worse than a month ago.
Comment 14•4 years ago
•
|
||
Pascal, the signature is new in 89 because I added the explicit crash into 89. These used to show up as other crashes, such as bug 1645841. Likely it also showed up in other crash signatures I haven't identified.
There is really nothing more I can do aside from bug 1704792 which changes how we rasterize SVG images. It may help a bit. At some point, we have to live with OOMs because the user has too little memory, and too much using what they have. That is what the existing crash reports show. If they have < 1 GB of virtual memory available after we unmapped every single buffer we can with WR, and they want to map in 200+ MB or more, there is little to be done.
Comment 15•4 years ago
•
|
||
Reviewing:
https://sql.telemetry.mozilla.org/queries/78915?p_channel=beta#196121
https://sql.telemetry.mozilla.org/queries/78743?p_channel=beta#195731
We can see, yes, OOMs have gone up (probably because more crashes are classified as such due to my work), but overall the crash rate has not increased, indeed it looks like it has declined. Too early to say for sure (but I'm also clearly not causing all the crashes in 89 ;)).
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•3 years ago
|
Description
•