Closed Bug 1681107 Opened 5 years ago Closed 5 years ago

[Wayland][Webrender] Webrender fails on popup opening because wl_surface is not ready to draw

Tracking

()

Status:

RESOLVED FIXED

Milestone:

86 Branch

Tracking Flags:

Tracking

Status

firefox86

---

fixed

People

(Reporter: rmader, Assigned: rmader)

References

(Blocks 1 open bug)

Details

Crash Data

Attachments

(3 files, 1 obsolete file)

Bug 1681107 - Resume WR syncronously, r=stransky 5 years ago Robert Mader [:rmader] 48 bytes, text/x-phabricator-request		Details \| Review
WR crash 5 years ago Robert Mader [:rmader] 7.69 MB, video/webm		Details
temporal patch - Add logs around pause/resume 5 years ago Sotaro Ikeda [:sotaro] 15.40 KB, patch		Details \| Diff \| Splinter Review
Bug 1681107 - Fix race condition of calling CompositorBridgeChild::SendPause() 5 years ago Sotaro Ikeda [:sotaro] 48 bytes, text/x-phabricator-request		Details \| Review

Robert Mader [:rmader]

Assignee

Description

•

5 years ago

STR:

open Firefox with WR/Wayland enabled
click the "Firefox Account" button very fast for some 10-20 times
Observed results:
Webrender crashes, fails back to basic with the message:

Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) [GFX1-]: Failed to create EGLSurface
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) |[2][GFX1-]: Compositors might be mixed (5,1) (t=6.21808) [GFX1-]: Compositors might be mixed (5,1)

I could track this down to https://searchfox.org/mozilla-central/source/widget/gtk/MozContainerWayland.cpp#377 if (!wl_container->ready_to_draw) (via https://searchfox.org/mozilla-central/source/widget/gtk/nsWindow.cpp#2457-2458).

So this looks like a race between frame callbacks arriving soon enough and Webrender starting.

Robert Mader [:rmader]

Assignee

Comment 1

•

5 years ago

Martin, maybe you have an idea how to solve this best? Is there a way to delay WR?

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Comment 2

•

5 years ago

I think we solved that somehow already:

https://searchfox.org/mozilla-central/rev/1843375acbbca68127713e402be222350ac99301/widget/gtk/nsWindow.cpp#4575
https://searchfox.org/mozilla-central/rev/1843375acbbca68127713e402be222350ac99301/widget/gtk/nsWindow.cpp#5076

Is the initial callback set?

Flags: needinfo?(stransky)

Robert Mader [:rmader]

Assignee

Comment 3

•

5 years ago

•

Edited

Thanks! So what happens is that moz_container_wayland_unmap_internal is called before the egl surface is even created. Thus ready_to_draw is false again - and we end up crashing WR. We maybe need some kind of softfail in RenderCompositorEGL::Resume() for this case I guess :/

P.S.: and we should probably make sure moz_container_wayland_surface_lock and moz_container_wayland_get_egl_window return nullptr instead of trying to create a new surface in moz_container_wayland_get_surface_locked after unmapping?

Robert Mader [:rmader]

Assignee

Comment 4

•

5 years ago

FTR: I do have a theory that this is actually what's causing occasional crashes with the following message:

WL: unknown object (3857049061), message get_subsurface(noo)
WL: error in client communication (pid 281076)

It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.

Robert Mader [:rmader]

Assignee

Updated

•

5 years ago

Updated

•

5 years ago

Comment 5

•

5 years ago

(In reply to Robert Mader [:rmader] from comment #4)

...
It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.

FTR: Wouldn't be surprised if the other crash is the one from bug 1648698

Robert Mader [:rmader]

Assignee

Comment 6

•

5 years ago

I found a very fast reproducer for this:

open an optimized build and go to https://webglsamples.org/aquarium/aquarium.html
click on the hamburger menu button very fast (because of bug 1645695 it will close and reopen every time)

For me, after a few clicks FF usually gets into into a broken state - first getting into the 'window is null' state in RenderCompositorEGL::Resume() because moz_container_wayland_get_surface_locked returned nullptr, thus running into https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#180, falling back to basic and then sometimes running into bug 1648698

So it looks like to me like there are several bugs which we have to address one by one to make things really stable.

Robert Mader [:rmader]

Assignee

Comment 7

•

5 years ago

Attached file Bug 1681107 - Resume WR syncronously, r=stransky (obsolete) — Details

The SendResumeAsync() call in MaybeResumeCompositor() races with
SendPause() in PauseRemoteRenderer(), occasionally resulting in
WR trying to resume on an unmapped widget.

This would then break in different ways - most importantly failing in
RenderCompositorEGL::Resume(), triggering a fallback to the basic
compositor.

That again, while already bad, would occasionally crash because of
bug 1648698.

Use the syncronous method instead, preventing that and thus fixing
some popup related instabilities.

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → robert.mader

Status: NEW → ASSIGNED

Robert Mader [:rmader]

Assignee

Comment 8

•

5 years ago

The patch above appears to fix the reproducer described in comment 6 \o/

Robert Mader [:rmader]

Assignee

Comment 9

•

5 years ago

https://treeherder.mozilla.org/jobs?repo=try&revision=9f07ace949c2dfd903d1fa002d74e3016f42214a

Robert Mader [:rmader]

Assignee

Updated

•

5 years ago

Updated

•

5 years ago

Comment 10

•

5 years ago

When I tried to reproduce the problem, I saw the following error message. It was a different bug. See Bug 1685241.

We don't have EGLSurface to draw into. Called too early?

Sotaro Ikeda [:sotaro]

Comment 11

•

5 years ago

I tried to reproduce the problem. But failed to reproduce it :(

From source code, it seems not like a race of SendResumeAsync() and SendPause(). The cause of problem seems like that SendPause() is called after moz_container_wayland_unmap_internal(). SendPause() is sync IPC. Then previous SendResumeAsync() message should be completed before SendPause() message handling. IPC message handlling does not change message sent order.

D100577 could address the problem. But it blocks main thread during resuming. It is not good.

Sotaro Ikeda [:sotaro]

Comment 12

•

5 years ago

:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?

Flags: needinfo?(robert.mader)

Robert Mader [:rmader]

Assignee

Comment 13

•

5 years ago

•

Edited

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

I tried to reproduce the problem. But failed to reproduce it :(

Hm, that's odd - I'll attach a video, it very reliably reproducible for me, on release as well as nightly. Also we got lots of reports that probably hit this.

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?

AFAICS that's already the case - the culprit seems to be if a popup is opened and closed very fast, SendResumeAsync() arrives after SendPause() - i.e. we get:

popup surface gets mapped
SendResumeAsync()
Here the above call has not yet fired in the render thread - at least RenderCompositorEGL::Resume() was not yet called. Now the popup gets closed again by the following calls
SendPause()
moz_container_wayland_unmap_internal()
Only now the renderer thread calls RenderCompositorEGL::Resume() -> wl_surface is null and we fail in https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#181, resulting in a fallback to basic (printing window is null and Compositors might be mixed (5,1))

So the problem appears to be that SendPause() does not prevent SendResumeAsync() to arrive late (while SendResume() reliably gets called before SendPause()).

Flags: needinfo?(robert.mader)

Robert Mader [:rmader]

Assignee

Comment 14

•

5 years ago

Attached video WR crash — Details

Sotaro Ikeda [:sotaro]

Comment 15

•

5 years ago

I managed to reproduce the problem by adding more cpu tasks during Youtube playback with PIP. I am going to look into what is going on.

Sotaro Ikeda [:sotaro]

Comment 16

•

5 years ago

Attached patch temporal patch - Add logs around pause/resume — Details — Splinter Review

Sotaro Ikeda [:sotaro]

Comment 17

•

5 years ago

•

Edited

(In reply to Robert Mader [:rmader] from comment #13)

So the problem appears to be that SendPause() does not prevent SendResumeAsync() to arrive late (while SendResume() reliably gets called before SendPause()).

I tested with Attachment 9195772 [details] [diff]. When problem happened, SendPause() was not called since last SendResumeAsync(). Then it seems like a problem of widget side. RenderCompositorEGL expects that SendPause() is called before moz_container_wayland_unmap_internal().

Log with Attachment 9195772 [details] [diff].

nsWindow::PauseRemoteRenderer() 5 SendPause() this 0x7f08c739f800
CompositorBridgeParent::PauseComposition() this 0x7f08c8cd0c00
RenderCompositorEGL::Pause() this 0x7f08c65a36a0
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
moz_container_wayland_map() widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
moz_container_wayland_frame_callback_handler()_E
moz_container_wayland_frame_callback_handler() 5
nsWindow::PauseRemoteRenderer() 8 callback MaybeResumeCompositor()
nsWindow::MaybeResumeCompositor() 5 SendResumeAsync() this 0x7f08c739f800
moz_container_wayland_frame_callback_handler()_X
CompositorBridgeParent::ResumeComposition()_E this 0x7f08c8cd0c00
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
RenderCompositorEGL::Resume() this 0x7f08c65a36a0
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) |[1][GFX1-]: Failed to create EGLSurface (t=61.7385) [GFX1-]: Failed to create EGLSurface
RenderCompositorEGL::Resume() 5 failed to create EGLSurface *************** this 0x7f08c65a36a0
CompositorBridgeParent::ResumeComposition()_X this 0x7f08c8cd0c00

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Crash Signature: [@ libEGL_mesa.so.0@0x23831]

dano.kutka

Comment 19

•

5 years ago

That signature is for bug 1645677

Sotaro Ikeda [:sotaro]

Comment 20

•

5 years ago

I looked into more. The problem happens since moz_container_wayland_has_egl_window(mContainer) is used to check the necessity of calling "remoteRenderer->SendPause()".

https://searchfox.org/mozilla-central/rev/07342ce09126c513540c1c343476e026cfa907bf/widget/gtk/nsWindow.cpp#5093

But the check does not work when SendResumeAsync() is not completed yet.

Sotaro Ikeda [:sotaro]

Comment 21

•

5 years ago

Attached file Bug 1681107 - Fix race condition of calling CompositorBridgeChild::SendPause() — Details

Phabricator Automation

Updated

•

5 years ago

Attachment #9195021 - Attachment is obsolete: true

Pulsebot

Comment 22

•

5 years ago

Pushed by btara@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c989e16ae8d0 Fix race condition of calling CompositorBridgeChild::SendPause() r=rmader,stransky

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 23

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c989e16ae8d0

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox86: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 86 Branch

You need to log in before you can comment on or make changes to this bug.