Closed Bug 1681107 Opened 5 years ago Closed 5 years ago

[Wayland][Webrender] Webrender fails on popup opening because wl_surface is not ready to draw

Categories

(Core :: Widget: Gtk, defect)

defect

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: rmader, Assigned: rmader)

References

(Blocks 1 open bug)

Details

Crash Data

Attachments

(3 files, 1 obsolete file)

STR:

  • open Firefox with WR/Wayland enabled
  • click the "Firefox Account" button very fast for some 10-20 times
    Observed results:
    Webrender crashes, fails back to basic with the message:
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) [GFX1-]: Failed to create EGLSurface
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) |[2][GFX1-]: Compositors might be mixed (5,1) (t=6.21808) [GFX1-]: Compositors might be mixed (5,1)

I could track this down to https://searchfox.org/mozilla-central/source/widget/gtk/MozContainerWayland.cpp#377 if (!wl_container->ready_to_draw) (via https://searchfox.org/mozilla-central/source/widget/gtk/nsWindow.cpp#2457-2458).

So this looks like a race between frame callbacks arriving soon enough and Webrender starting.

Martin, maybe you have an idea how to solve this best? Is there a way to delay WR?

Flags: needinfo?(stransky)

Thanks! So what happens is that moz_container_wayland_unmap_internal is called before the egl surface is even created. Thus ready_to_draw is false again - and we end up crashing WR. We maybe need some kind of softfail in RenderCompositorEGL::Resume() for this case I guess :/

P.S.: and we should probably make sure moz_container_wayland_surface_lock and moz_container_wayland_get_egl_window return nullptr instead of trying to create a new surface in moz_container_wayland_get_surface_locked after unmapping?

FTR: I do have a theory that this is actually what's causing occasional crashes with the following message:

WL: unknown object (3857049061), message get_subsurface(noo)
WL: error in client communication (pid 281076)

It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.

See Also: → 1648698
See Also: → 1680961

(In reply to Robert Mader [:rmader] from comment #4)

...
It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.

FTR: Wouldn't be surprised if the other crash is the one from bug 1648698

I found a very fast reproducer for this:

  1. open an optimized build and go to https://webglsamples.org/aquarium/aquarium.html
  2. click on the hamburger menu button very fast (because of bug 1645695 it will close and reopen every time)

For me, after a few clicks FF usually gets into into a broken state - first getting into the 'window is null' state in RenderCompositorEGL::Resume() because moz_container_wayland_get_surface_locked returned nullptr, thus running into https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#180, falling back to basic and then sometimes running into bug 1648698

So it looks like to me like there are several bugs which we have to address one by one to make things really stable.

The SendResumeAsync() call in MaybeResumeCompositor() races with
SendPause() in PauseRemoteRenderer(), occasionally resulting in
WR trying to resume on an unmapped widget.

This would then break in different ways - most importantly failing in
RenderCompositorEGL::Resume(), triggering a fallback to the basic
compositor.

That again, while already bad, would occasionally crash because of
bug 1648698.

Use the syncronous method instead, preventing that and thus fixing
some popup related instabilities.

Assignee: nobody → robert.mader
Status: NEW → ASSIGNED

The patch above appears to fix the reproducer described in comment 6 \o/

See Also: → 1655282
See Also: → 1645677

When I tried to reproduce the problem, I saw the following error message. It was a different bug. See Bug 1685241.

  • We don't have EGLSurface to draw into. Called too early?

I tried to reproduce the problem. But failed to reproduce it :(

From source code, it seems not like a race of SendResumeAsync() and SendPause(). The cause of problem seems like that SendPause() is called after moz_container_wayland_unmap_internal(). SendPause() is sync IPC. Then previous SendResumeAsync() message should be completed before SendPause() message handling. IPC message handlling does not change message sent order.

D100577 could address the problem. But it blocks main thread during resuming. It is not good.

:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?

Flags: needinfo?(robert.mader)

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

I tried to reproduce the problem. But failed to reproduce it :(

Hm, that's odd - I'll attach a video, it very reliably reproducible for me, on release as well as nightly. Also we got lots of reports that probably hit this.

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?

AFAICS that's already the case - the culprit seems to be if a popup is opened and closed very fast, SendResumeAsync() arrives after SendPause() - i.e. we get:

  1. popup surface gets mapped
  2. SendResumeAsync()
  3. Here the above call has not yet fired in the render thread - at least RenderCompositorEGL::Resume() was not yet called. Now the popup gets closed again by the following calls
  4. SendPause()
  5. moz_container_wayland_unmap_internal()
  6. Only now the renderer thread calls RenderCompositorEGL::Resume() -> wl_surface is null and we fail in https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#181, resulting in a fallback to basic (printing window is null and Compositors might be mixed (5,1))

So the problem appears to be that SendPause() does not prevent SendResumeAsync() to arrive late (while SendResume() reliably gets called before SendPause()).

Flags: needinfo?(robert.mader)
Attached video WR crash

I managed to reproduce the problem by adding more cpu tasks during Youtube playback with PIP. I am going to look into what is going on.

(In reply to Robert Mader [:rmader] from comment #13)

So the problem appears to be that SendPause() does not prevent SendResumeAsync() to arrive late (while SendResume() reliably gets called before SendPause()).

I tested with Attachment 9195772 [details] [diff]. When problem happened, SendPause() was not called since last SendResumeAsync(). Then it seems like a problem of widget side. RenderCompositorEGL expects that SendPause() is called before moz_container_wayland_unmap_internal().


Log with Attachment 9195772 [details] [diff].

nsWindow::PauseRemoteRenderer() 5 SendPause() this 0x7f08c739f800
CompositorBridgeParent::PauseComposition() this 0x7f08c8cd0c00
RenderCompositorEGL::Pause() this 0x7f08c65a36a0
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
moz_container_wayland_map() widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
moz_container_wayland_frame_callback_handler()_E
moz_container_wayland_frame_callback_handler() 5
nsWindow::PauseRemoteRenderer() 8 callback MaybeResumeCompositor()
nsWindow::MaybeResumeCompositor() 5 SendResumeAsync() this 0x7f08c739f800
moz_container_wayland_frame_callback_handler()_X
CompositorBridgeParent::ResumeComposition()_E this 0x7f08c8cd0c00
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
RenderCompositorEGL::Resume() this 0x7f08c65a36a0
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) |[1][GFX1-]: Failed to create EGLSurface (t=61.7385) [GFX1-]: Failed to create EGLSurface
RenderCompositorEGL::Resume() 5 failed to create EGLSurface *************** this 0x7f08c65a36a0
CompositorBridgeParent::ResumeComposition()_X this 0x7f08c8cd0c00

Crash Signature: [@ libEGL_mesa.so.0@0x23831]

That signature is for bug 1645677

I looked into more. The problem happens since moz_container_wayland_has_egl_window(mContainer) is used to check the necessity of calling "remoteRenderer->SendPause()".

But the check does not work when SendResumeAsync() is not completed yet.

Attachment #9195021 - Attachment is obsolete: true
Pushed by btara@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c989e16ae8d0 Fix race condition of calling CompositorBridgeChild::SendPause() r=rmader,stransky
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: