[Wayland][Webrender] Webrender fails on popup opening because wl_surface is not ready to draw
Categories
(Core :: Widget: Gtk, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox86 | --- | fixed |
People
(Reporter: rmader, Assigned: rmader)
References
(Blocks 1 open bug)
Details
Crash Data
Attachments
(3 files, 1 obsolete file)
STR:
- open Firefox with WR/Wayland enabled
- click the "Firefox Account" button very fast for some 10-20 times
Observed results:
Webrender crashes, fails back to basic with the message:
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) [GFX1-]: Failed to create EGLSurface
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=6.20896) |[1][GFX1-]: Failed to create EGLSurface (t=6.20898) |[2][GFX1-]: Compositors might be mixed (5,1) (t=6.21808) [GFX1-]: Compositors might be mixed (5,1)
I could track this down to https://searchfox.org/mozilla-central/source/widget/gtk/MozContainerWayland.cpp#377 if (!wl_container->ready_to_draw)
(via https://searchfox.org/mozilla-central/source/widget/gtk/nsWindow.cpp#2457-2458).
So this looks like a race between frame callbacks arriving soon enough and Webrender starting.
Assignee | ||
Comment 1•5 years ago
|
||
Martin, maybe you have an idea how to solve this best? Is there a way to delay WR?
Comment 2•5 years ago
|
||
I think we solved that somehow already:
https://searchfox.org/mozilla-central/rev/1843375acbbca68127713e402be222350ac99301/widget/gtk/nsWindow.cpp#4575
https://searchfox.org/mozilla-central/rev/1843375acbbca68127713e402be222350ac99301/widget/gtk/nsWindow.cpp#5076
Is the initial callback set?
Assignee | ||
Comment 3•5 years ago
•
|
||
Thanks! So what happens is that moz_container_wayland_unmap_internal
is called before the egl surface is even created. Thus ready_to_draw
is false
again - and we end up crashing WR. We maybe need some kind of softfail in RenderCompositorEGL::Resume()
for this case I guess :/
P.S.: and we should probably make sure moz_container_wayland_surface_lock
and moz_container_wayland_get_egl_window
return nullptr
instead of trying to create a new surface in moz_container_wayland_get_surface_locked
after unmapping?
Assignee | ||
Comment 4•5 years ago
|
||
FTR: I do have a theory that this is actually what's causing occasional crashes with the following message:
WL: unknown object (3857049061), message get_subsurface(noo)
WL: error in client communication (pid 281076)
It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #4)
...
It'd kinda make sense when we try to create a subsurface in the unmapping process - which may be caused by some other crash, which then gets lost or so. Will try to come up with something.
FTR: Wouldn't be surprised if the other crash is the one from bug 1648698
Assignee | ||
Comment 6•5 years ago
|
||
I found a very fast reproducer for this:
- open an optimized build and go to https://webglsamples.org/aquarium/aquarium.html
- click on the hamburger menu button very fast (because of bug 1645695 it will close and reopen every time)
For me, after a few clicks FF usually gets into into a broken state - first getting into the 'window is null' state in RenderCompositorEGL::Resume()
because moz_container_wayland_get_surface_locked
returned nullptr
, thus running into https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#180, falling back to basic and then sometimes running into bug 1648698
So it looks like to me like there are several bugs which we have to address one by one to make things really stable.
Assignee | ||
Comment 7•5 years ago
|
||
The SendResumeAsync()
call in MaybeResumeCompositor()
races with
SendPause()
in PauseRemoteRenderer()
, occasionally resulting in
WR trying to resume on an unmapped widget.
This would then break in different ways - most importantly failing in
RenderCompositorEGL::Resume()
, triggering a fallback to the basic
compositor.
That again, while already bad, would occasionally crash because of
bug 1648698.
Use the syncronous method instead, preventing that and thus fixing
some popup related instabilities.
Updated•5 years ago
|
Assignee | ||
Comment 8•5 years ago
|
||
The patch above appears to fix the reproducer described in comment 6 \o/
Assignee | ||
Comment 9•5 years ago
|
||
Comment 10•5 years ago
|
||
When I tried to reproduce the problem, I saw the following error message. It was a different bug. See Bug 1685241.
- We don't have EGLSurface to draw into. Called too early?
Comment 11•5 years ago
|
||
I tried to reproduce the problem. But failed to reproduce it :(
From source code, it seems not like a race of SendResumeAsync() and SendPause(). The cause of problem seems like that SendPause() is called after moz_container_wayland_unmap_internal(). SendPause() is sync IPC. Then previous SendResumeAsync() message should be completed before SendPause() message handling. IPC message handlling does not change message sent order.
D100577 could address the problem. But it blocks main thread during resuming. It is not good.
Comment 12•5 years ago
|
||
:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?
Assignee | ||
Comment 13•5 years ago
•
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #12)
I tried to reproduce the problem. But failed to reproduce it :(
Hm, that's odd - I'll attach a video, it very reliably reproducible for me, on release as well as nightly. Also we got lots of reports that probably hit this.
(In reply to Sotaro Ikeda [:sotaro] from comment #12)
:rmader, is it possible to call SendPause() before moz_container_wayland_unmap_internal()?
AFAICS that's already the case - the culprit seems to be if a popup is opened and closed very fast, SendResumeAsync()
arrives after SendPause()
- i.e. we get:
- popup surface gets mapped
SendResumeAsync()
- Here the above call has not yet fired in the render thread - at least
RenderCompositorEGL::Resume()
was not yet called. Now the popup gets closed again by the following calls SendPause()
moz_container_wayland_unmap_internal()
- Only now the renderer thread calls
RenderCompositorEGL::Resume()
->wl_surface
is null and we fail in https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorEGL.cpp#181, resulting in a fallback to basic (printingwindow is null
andCompositors might be mixed (5,1)
)
So the problem appears to be that SendPause()
does not prevent SendResumeAsync()
to arrive late (while SendResume()
reliably gets called before SendPause()
).
Assignee | ||
Comment 14•5 years ago
|
||
Comment 15•5 years ago
|
||
I managed to reproduce the problem by adding more cpu tasks during Youtube playback with PIP. I am going to look into what is going on.
Comment 16•5 years ago
|
||
Comment 17•5 years ago
•
|
||
(In reply to Robert Mader [:rmader] from comment #13)
So the problem appears to be that
SendPause()
does not preventSendResumeAsync()
to arrive late (whileSendResume()
reliably gets called beforeSendPause()
).
I tested with Attachment 9195772 [details] [diff]. When problem happened, SendPause() was not called since last SendResumeAsync(). Then it seems like a problem of widget side. RenderCompositorEGL expects that SendPause() is called before moz_container_wayland_unmap_internal().
Log with Attachment 9195772 [details] [diff].
nsWindow::PauseRemoteRenderer() 5 SendPause() this 0x7f08c739f800
CompositorBridgeParent::PauseComposition() this 0x7f08c8cd0c00
RenderCompositorEGL::Pause() this 0x7f08c65a36a0
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
moz_container_wayland_map() widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
oz_container_wayland_map_event()_E widget 0x7f08d52aaf50
oz_container_wayland_map_event()_X widget 0x7f08d52aaf50
moz_container_wayland_frame_callback_handler()_E
moz_container_wayland_frame_callback_handler() 5
nsWindow::PauseRemoteRenderer() 8 callback MaybeResumeCompositor()
nsWindow::MaybeResumeCompositor() 5 SendResumeAsync() this 0x7f08c739f800
moz_container_wayland_frame_callback_handler()_X
CompositorBridgeParent::ResumeComposition()_E this 0x7f08c8cd0c00
moz_container_wayland_unmap() widget 0x7f08d52aaf50
moz_container_wayland_unmap_internal() container 0x7f08d52aaf50
RenderCompositorEGL::Resume() this 0x7f08c65a36a0
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) [GFX1-]: window is null
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: window is null (t=61.7385) |[1][GFX1-]: Failed to create EGLSurface (t=61.7385) [GFX1-]: Failed to create EGLSurface
RenderCompositorEGL::Resume() 5 failed to create EGLSurface *************** this 0x7f08c65a36a0
CompositorBridgeParent::ResumeComposition()_X this 0x7f08c8cd0c00
Updated•5 years ago
|
Comment 19•5 years ago
|
||
That signature is for bug 1645677
Comment 20•5 years ago
|
||
I looked into more. The problem happens since moz_container_wayland_has_egl_window(mContainer) is used to check the necessity of calling "remoteRenderer->SendPause()".
But the check does not work when SendResumeAsync() is not completed yet.
Comment 21•5 years ago
|
||
Updated•5 years ago
|
Comment 22•5 years ago
|
||
![]() |
||
Comment 23•5 years ago
|
||
bugherder |
Description
•