Open Bug 1702132 Opened 6 months ago Updated 5 months ago

Crash in [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped]

Categories

(Core :: Graphics: WebRender, defect, P3)

Firefox 89
x86
Windows 10
defect

Tracking

()

Tracking Status
firefox-esr78 --- unaffected
firefox87 --- unaffected
firefox88 --- unaffected
firefox89 --- wontfix
firefox90 --- fix-optional

People

(Reporter: calixte, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs, Regression)

Details

(Keywords: crash, regression, Whiteboard: [not-a-fission-bug])

Crash Data

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/b621fe01-471f-4b1a-a0bb-ca7ce0210331

MOZ_CRASH Reason: MOZ_CRASH(OOM)

Top 10 frames of crashing thread:

0 xul.dll NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:618
1 xul.dll mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped gfx/layers/SourceSurfaceSharedData.cpp:76
2 xul.dll mozilla::gfx::SourceSurfaceSharedDataWrapper::Init gfx/layers/SourceSurfaceSharedData.cpp:48
3 xul.dll static mozilla::layers::SharedSurfacesParent::Add gfx/layers/ipc/SharedSurfacesParent.cpp:215
4 xul.dll mozilla::layers::CompositorManagerParent::RecvAddSharedSurface gfx/layers/ipc/CompositorManagerParent.cpp:265
5 xul.dll mozilla::layers::PCompositorManagerParent::OnMessageReceived ipc/ipdl/PCompositorManagerParent.cpp:291
6 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2078
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1149
8 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:332
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:328

There is 1 crash in nightly 89 with buildid 20210330035059. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1699224.

[1] https://hg.mozilla.org/mozilla-central/rev?node=cd7cfc8fc140

Flags: needinfo?(aosmond)

Adding [not-a-fission-bug] whiteboard tag because this is not a Fission-related crash, even though the crash report in comment 0 has DOMFissionEnabled=1. Only 2 of the 7 crash reports (29%) have Fission enabled.

Whiteboard: [not-a-fission-bug]

(In reply to Chris Peterson [:cpeterson] from comment #1)

Adding [not-a-fission-bug] whiteboard tag because this is not a Fission-related crash, even though the crash report in comment 0 has DOMFissionEnabled=1. Only 2 of the 7 crash reports (29%) have Fission enabled.

Turns out those 2 Fission crash reports are from me! If you would like me to test anything, just ping me.

Adding to gfx-triage.

Blocks: gfx-triage
Severity: -- → S2
Priority: -- → P2
No longer blocks: 1699224

My GPU process crashes with this signature about 2-4 times a day. Here are my SourceSurfaceSharedDataWrapper::EnsureMapped crash reports from today. I am running a 32-bit Firefox build on 64-bit Windows 10 with Fission and gfx.webrender.software = true (and gfx.webrender.software.d3d11 = true).

bp-3c548121-1332-4abf-b40d-780b80210402
bp-603eb2fb-adae-4f4d-b961-12af80210402
bp-b0174f6e-2d45-4037-ae1c-00b4b0210402
bp-2cbd17c4-6546-4e63-9d5c-5493c0210402

To a certain extent, I expect to see this crash rate go up and consolidate from other signatures.

However all of the crashes in comment 4 are situations with high virtual memory (>2GB) and low allocation sizes (~14MB). It seems unlikely to me that the cache is so fragmented that we can't fit in 14MB somewhere, so something else went wrong.

Flags: needinfo?(aosmond)

The other pattern is ~1/3 of the crashes are very likely due to rasterized SVGs since they are nearly 200 MB, our rasterization size limit in the content process. Since deferrable blobs allocates in the GPU process, and is able to tile (thus reducing the individual allocation sizes, and the total since not all the tiles are needed), this should be fixed by bug 1673653.

Depends on: deferrable-blobs

A special build with extra logging produced this crash report:

https://crash-stats.mozilla.org/report/index/f481bddf-993d-482e-9a03-3eb3c0210406#tab-details

The extra logs are:

|[G0][GFX1-]: SharedMemory::Map 8 (t=231977) |[G1][GFX1-]: SharedMemory::Map 8 (t=231977) |[G2][GFX1-]: SharedMemory::Map 8 (t=231977) |[G3][GFX1-]: SharedMemory::Map 8 (t=231977) |[G4][GFX1-]: SharedMemory::Map 8 (t=231977)

It tried to map several times and fails with ERROR_NOT_ENOUGH_MEMORY (8) each time. When it finally crashes, there is tons of virtual memory. I've done a respin with extra logging to give a better picture of the state at each map attempt.

This error is sort of expected if there isn't enough virtual memory, but given the state of how much virtual memory we have, it is either a bug or it is highly fragmented to the point where there is no 14.8 MB chunk in 2.57 GB. At this point I am assuming each map message is for the same sized buffer, given the pattern from previous reports.

A follow up build with extra logging produced this crash report:

https://crash-stats.mozilla.org/report/index/1190a550-754c-4b20-863f-098610210407#tab-annotations

|[G0][GFX1-]: Shm 14802944 8 (t=33770.5) 
|[G1][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5) 
|[G2][GFX1-]: Shm 14802944 8 (t=33770.5) 
|[G3][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5) 
|[G4][GFX1-]: Shm 14802944 8 (t=33770.5) 
|[G5][GFX1-]: V 318734336 PH 21323972608 PA 21276147712 L 37 (t=33770.5) 
|[G6][GFX1-]: Shm 14802944 8 (t=33770.5) 
|[G7][GFX1-]: V 318734336 PH 21325021184 PA 21276147712 L 37 (t=33770.5) 
|[G8][GFX1-]: Shm 14802944 8 (t=33772.4) 
|[G9][GFX1-]: V 318734336 PH 21467959296 PA 21215850496 L 37 (t=33772.4)

This paints a more complete picture. The low virtual memory state, ~310MB, persists even as we unmap shared memory images from the expiration tracker. When the report is generated, we get >2GB result.

The documentation of UnmapViewOfFileEx and MEM_UNMAP_WITH_TRANSIENT_BOOST in particular:

https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-unmapviewoffileex

suggests that there may be a delay between our unmap and its actual removal from virtual memory. This could explain what we are seeing here.

The probably relevant thread documentation at:

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadinformation

Memory priority helps to determine how long pages remain in the working set of a process before they are trimmed. A thread's memory priority determines the minimum priority of the physical pages that are added to the process working set by that thread. When the memory manager trims the working set, it trims lower priority pages before higher priority pages. This improves overall system performance because higher priority pages are less likely to be trimmed from the working set and then trigger a page fault when they are accessed again.

I would note that this particular pattern we only see on Windows, where we have a relatively small allocation and a large virtual memory available pool. On Fenix 32-bit ARM we only see OOMs when the available virtual memory was low relative to the allocation size. This suggests it is indeed related to the above API descriptions.

Depends on: 1703839
No longer blocks: gfx-triage
Crash Signature: [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped] → [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped] [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::Map ]
Crash Signature: [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped] [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::Map ] → [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::EnsureMapped] [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper::Map ] [@ OOM | large | NS_ABORT_OOM | mozilla::gfx::SourceSurfaceSharedDataWrapper:…

We'll have a better idea once we hit 89 beta, but it is looking like bug 1703839 may have mostly resolved this. I don't except the crash rate to be zero but it seems deferring to OOM crashing to the latest possible point, as well as fixing the bug where the first map wasn't added to the unmapping tracker (it would eventually get added once WR uses the surface in the next frame), got us to a good spot.

Blocks: wr-oom

At this point, the volume is low for an OOM, and all of the crashes are genuine cases of little available virtual memory. The patches let us unmap as much as possible from the image cache to reclaim some in a timely basis, so I think we can decrease the priority to the backlog.

Severity: S2 → S3
Priority: P2 → P3

(In reply to Andrew Osmond [:aosmond] from comment #12)

At this point, the volume is low for an OOM, and all of the crashes are genuine cases of little available virtual memory. The patches let us unmap as much as possible from the image cache to reclaim some in a timely basis, so I think we can decrease the priority to the backlog.

Andrew, that's a medium crasher on 89 beta and we had no crashes in 88 beta, it seems to me that the crash situation is worse than a month ago.

Flags: needinfo?(aosmond)

Pascal, the signature is new in 89 because I added the explicit crash into 89. These used to show up as other crashes, such as bug 1645841. Likely it also showed up in other crash signatures I haven't identified.

There is really nothing more I can do aside from bug 1704792 which changes how we rasterize SVG images. It may help a bit. At some point, we have to live with OOMs because the user has too little memory, and too much using what they have. That is what the existing crash reports show. If they have < 1 GB of virtual memory available after we unmapped every single buffer we can with WR, and they want to map in 200+ MB or more, there is little to be done.

Flags: needinfo?(aosmond)

Reviewing:

https://sql.telemetry.mozilla.org/queries/78915?p_channel=beta#196121
https://sql.telemetry.mozilla.org/queries/78743?p_channel=beta#195731

We can see, yes, OOMs have gone up (probably because more crashes are classified as such due to my work), but overall the crash rate has not increased, indeed it looks like it has declined. Too early to say for sure (but I'm also clearly not causing all the crashes in 89 ;)).

You need to log in before you can comment on or make changes to this bug.