Closed Bug 1464823 Opened 2 years ago Closed 2 years ago

[Wayland] GLContextEGL::SwapBuffers() races with main thread wl_display_* calls.

Categories

(Core :: Widget: Gtk, defect)

60 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox62 --- fixed

People

(Reporter: moz, Assigned: stransky)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Build ID: 20100101

Steps to reproduce:

1. Open firefox-wayland (works from a fresh profile)
2. open about:config
3. change layers.acceleration.force-enabled to true
4. restart firefox-wayland
5. open anything which has a popover, e.g. the menu shown from sandwich button or the library menu or the downloads symbol. It is also being triggered when bookmarking a page or opening the "page actions" ("…") symbol or the lock symbol in url bar.


Actual results:

GUI freezes completely. No interaction possible. No error messages or crashes either. Firefox can be terminated cleanly by sending it a SIGTERM


Expected results:

No freeze. Just behave as without acceleration.


Affected version:

Firefox build on Fedora 28, firefox-60.0.1-3.fc28.x86_64


Additional info:
This does not affect the main menu (which is visible after pressing the Alt key) nor the context menu at any position I checked including the URL bar.
Blocks: wayland
Summary: [Wayland] with layers.acceleration.force-enabled=true, any subsurface makes firefox freeze → [Wayland] with layers.acceleration.force-enabled=true, some subsurfaces make firefox freeze
This is the bug I mentioned at bug1438144comment13.
Blocks: 1438144
Component: Untriaged → Widget: Gtk
Product: Firefox → Core
There's the backtrace from Compositor thread which causes the freeze. It's because mozilla::gl::GLContextEGL::SwapBuffers() calls wl_display_dispatch_queue() on default display/queue as well as the main thread which causes a race condition here.

#0  0x00007ffff6da6589 in poll () at /lib64/libc.so.6
#1  0x00007ffff28776a9 in poll (__timeout=-1, __nfds=1, __fds=0x7fffc6f65ce0) at /usr/include/bits/poll2.h:46
#2  0x00007ffff28776a9 in wl_display_poll (display=display@entry=0x7ffff6a61440, events=events@entry=1)
    at src/wayland-client.c:1717
#3  0x00007ffff287913c in wl_display_dispatch_queue (display=0x7ffff6a61440, queue=0x7fffb49ec380)
    at src/wayland-client.c:1790
#4  0x00007fffc19650bb in  () at /lib64/libEGL_mesa.so.0
#5  0x00007fffc19560ea in eglSwapBuffers () at /lib64/libEGL_mesa.so.0
#6  0x00007fffe18010bd in mozilla::gl::GLLibraryEGL::fSwapBuffers(void*, void*) const (this=0x7fffeb8c5780 <mozilla::gl::sEGLLibrary>, dpy=0x7fffc2db2800, surface=0x7fffade57000) at /home/komat/tmp676-trunk-gtk3/src2/gfx/gl/GLLibraryEGL.h:229
#7  0x00007fffe1811ad8 in mozilla::gl::GLContextEGL::SwapBuffers() (this=0x7fffacc87000)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/gl/GLContextProviderEGL.cpp:501
#8  0x00007fffe18e4dc2 in mozilla::layers::CompositorOGL::EndFrame() (this=0x7fffade6d400)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/opengl/CompositorOGL.cpp:1771
#9  0x00007fffe1a5c0b8 in mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&) (
    this=0x7fffaccf5af0, aInvalidRegion=..., aOpaqueRegion=...)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:995
#10 0x00007fffe1a59d8a in mozilla::layers::LayerManagerComposite::UpdateAndRender() (this=0x7fffaccf5af0)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:534
#11 0x00007fffe1a59990 in mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags) (this=0x7fffaccf5af0, aTimeStamp=..., aFlags=mozilla::layers::LayerManager::END_DEFAULT)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:464
#12 0x00007fffe1a82127 in mozilla::layers::CompositorBridgeParent::CompositeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade56800, aTarget=0x0, aRect=0x0)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:1062
#13 0x00007fffe1a8a7c1 in mozilla::layers::CompositorVsyncScheduler::ForceComposeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade6d280, aTarget=0x0, aRect=0x0)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorVsyncScheduler.cpp:279
#14 0x00007fffe1a8240b in mozilla::layers::CompositorBridgeParent::ForceComposeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade56800, aTarget=0x0, aRect=0x0)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:1135
#15 0x00007fffe1a80d93 in mozilla::layers::CompositorBridgeParent::RecvFlushRendering() (this=0x7fffade56800)
    at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:604
#16 0x00007fffe124cd88 in mozilla::layers::PCompositorBridgeParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) (this=0x7fffade56800, msg__=..., reply__=@0x7fffc6f66968: 0x0)
    at /home/komat/tmp676-trunk-gtk3/src2/objdir/ipc/ipdl/PCompositorBridgeParent.cpp:1453
#17 0x00007fffe1251365 in mozilla::layers::PCompositorManagerParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) (this=0x7ffff6adae30, msg__=..., reply__=@0x7fffc6f66968: 0x0)
    at /home/komat/tmp676-trunk-gtk3/src2/objdir/ipc/ipdl/PCompositorManagerParent.cpp:271
#18 0x00007fffe0dcc9b2 in mozilla::ipc::MessageChannel::DispatchSyncMessage(IPC::Message const&, IPC::Message*&) (this=0x7fffc3eb1100, aMsg=..., aReply=@0x7fffc6f66968: 0x0) at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:2104
#19 0x00007fffe0dcc652 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) (this=0x7fffc3eb1100, aMsg=...)
    at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:2062
#20 0x00007fffe0dcbca7 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) (this=0x7fffc3eb1100, aTask=...) at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:1912
Assignee: nobody → stransky
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: [Wayland] with layers.acceleration.force-enabled=true, some subsurfaces make firefox freeze → [Wayland] GLContextEGL::SwapBuffers() races with main thread wl_display_* calls.
Attached patch wip (obsolete) — Splinter Review
Wip workaround for the issue.
wl_display_dispatch_queue does _not_ dispatch the default queue. The queue dispatched by Mesa internally is a separate internal queue created only by Mesa and nothing else. It is safe against race conditions.

If Mozilla itself works with the Wayland connection's socket, it needs to use the wl_display_prepare_read() API in order to ensure thread safety.

There was one Mesa bug where swrast would always hang like this, which was due to a typo (dispatching the wrong queue) rather than any fundamental lack of thread safety. That is fixed in later patch release versions for 17.3 (I think 17.3.8 onwards), 18.0 (18.0.2 onwards); 18.1 and 18.2 are fine.
(In reply to Daniel Stone from comment #6)
> wl_display_dispatch_queue does _not_ dispatch the default queue. The queue
> dispatched by Mesa internally is a separate internal queue created only by
> Mesa and nothing else. It is safe against race conditions.
> 
> If Mozilla itself works with the Wayland connection's socket, it needs to
> use the wl_display_prepare_read() API in order to ensure thread safety.
> 
> There was one Mesa bug where swrast would always hang like this, which was
> due to a typo (dispatching the wrong queue) rather than any fundamental lack
> of thread safety. That is fixed in later patch release versions for 17.3 (I
> think 17.3.8 onwards), 18.0 (18.0.2 onwards); 18.1 and 18.2 are fine.

Sure, I'm not an expert there. I uploaded the requested logs at https://bugs.freedesktop.org/show_bug.cgi?id=106753 for further investigation.
Copying my comment from the FDO bug report:

So what it looks like is that:

1. Firefox creates two surfaces, wl_surface@39 and wl_surface@40
2. Firefox turn wl_surface@40 into a desynchronous subsurface on top of wl_surface@39
3. Firefox turn wl_surface@39 into a xdg_toplevel and commits the initial (empty) state
4. The compositor sends the configure event to the xdg_toplevel
5. Firefox replies immediately with ack_configure() without attaching a buffer

This will probably consume any compositor: you just mapped a window, asked to configure it (draw the first frame), but what happened is that no content was posted even though the configure event was acknowledged

6. Firefox asks for a frame callback, attaches a new buffer to the subsurface and commits it

When Firefox tries to do eglSwapBuffers() again

When it does this, it waits for the frame callback, but that will never happen because the toplevel was never mapped, meaning the subsurface never has a chance to be displayed, meaning the frame callback will never be invoked.

The xdg_surface interface states the following:

      For an xdg_surface to be mapped by the compositor, the following
      conditions must be met:
      (1) the client has assigned an xdg_surface-based role to the surface
      (2) the client has set and committed the xdg_surface state and the
	  role-dependent state to the surface
      (3) the client has committed a buffer to the surface

The third condition here was never met.

This raises the question, what is the intended content of the subsurface and what is the intended content of the toplevel?
Comment on attachment 8982142 [details]
Bug 1464823 - [Wayland] Don't paint until we have a visible wl_surface,

https://reviewboard.mozilla.org/r/248164/#review254464
Attachment #8982142 - Flags: review?(lsalzman) → review+
Just to point out, there doesn't seem to be any race condition in Mesa, so the motivation in the patch is incorrect. Mesa can handle multiple threads with a single wl_display just fine. What happens is that since eglSwapInterval(1) is used, the swap-buffers call waits for the VSync-callback of the old buffer (buffer was painted), which won't happen because it has not been mapped yet by Firefox.
(In reply to Jonas Ådahl from comment #10)
> Just to point out, there doesn't seem to be any race condition in Mesa, so
> the motivation in the patch is incorrect. Mesa can handle multiple threads
> with a single wl_display just fine. What happens is that since
> eglSwapInterval(1) is used, the swap-buffers call waits for the
> VSync-callback of the old buffer (buffer was painted), which won't happen
> because it has not been mapped yet by Firefox.

Jonas, Thanks for the info, I need to think it over.
(In reply to Jonas Ådahl from comment #10)
> Just to point out, there doesn't seem to be any race condition in Mesa, so
> the motivation in the patch is incorrect. Mesa can handle multiple threads
> with a single wl_display just fine. What happens is that since
> eglSwapInterval(1) is used, the swap-buffers call waits for the
> VSync-callback of the old buffer (buffer was painted), which won't happen
> because it has not been mapped yet by Firefox.

Yes, the swap-buffers/VSync-callback is called on subsurface owned/managed by Firefox and the parent toplevel surface is owned by Gtk. 

Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.

I also confirmed it, and can avoid it by the attached patch.
But the patch is still too ugly.
How can I know properly whether the surface is committed or not from outside of GDK?
(In reply to Takuro Ashie from comment #13)
> Created attachment 8983659 [details] [diff] [review]
> avoid-freeze-wayland-with-egl.patch
> 
> > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> 
> I also confirmed it, and can avoid it by the attached patch.
> But the patch is still too ugly.

I just updated the patch, Thanks!
Attachment #8982023 - Attachment is obsolete: true
(In reply to Martin Stránský [:stransky] from comment #15)
> (In reply to Takuro Ashie from comment #13)
> > Created attachment 8983659 [details] [diff] [review]
> > avoid-freeze-wayland-with-egl.patch
> > 
> > > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> > 
> > I also confirmed it, and can avoid it by the attached patch.
> > But the patch is still too ugly.
> 
> I just updated the patch, Thanks!

I got SEGV with your patch on my devices(RZ/G1M, R-Car M3). :'(
But SEGV detail is concealed because SEGV occurred within PowerVR driver.
(In reply to Hiroshi Hatake [:cosmo0920] from comment #16)
> (In reply to Martin Stránský [:stransky] from comment #15)
> > (In reply to Takuro Ashie from comment #13)
> > > Created attachment 8983659 [details] [diff] [review]
> > > avoid-freeze-wayland-with-egl.patch
> > > 
> > > > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> > > 
> > > I also confirmed it, and can avoid it by the attached patch.
> > > But the patch is still too ugly.
> > 
> > I just updated the patch, Thanks!
> 
> I got SEGV with your patch on my devices(RZ/G1M, R-Car M3). :'(
> But SEGV detail is concealed because SEGV occurred within PowerVR driver.

Can you attach at least some backtrace of the crash?
Flags: needinfo?(cosmo0920.oucc)
```
(gdb) bt
#0  0xa473c9b8 in ?? () from /usr/lib/libpvrPVR2D_WAYLANDWSEGL.so
#1  0x0001f65c in calloc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
```
Flags: needinfo?(cosmo0920.oucc)
(In reply to Hiroshi Hatake [:cosmo0920] from comment #18)
> ```
> (gdb) bt
> #0  0xa473c9b8 in ?? () from /usr/lib/libpvrPVR2D_WAYLANDWSEGL.so
> #1  0x0001f65c in calloc ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Please try to disable e10s (set browser.tabs.remote.autostart to false) and attach backtrace of all threads.
Flags: needinfo?(cosmo0920.oucc)
Thread 37 backtrace looks mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&) () infinity printed.

Ashie's patch can work on R-Car&RZ/G1.
Flags: needinfo?(cosmo0920.oucc)
(In reply to Hiroshi Hatake [:cosmo0920] from comment #21)
> Thread 37 backtrace looks
> mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&,
> mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&) ()
> infinity printed.

That bt does not look very useful, looks like some recursion happens on stack which is exhausted. Maybe some code fails when moz_container_get_wl_surface() / moz_container_get_wl_egl_window() return nullptr?
I obtained EGL API calling trace with apitrace.

API calling trace dump says:

---
// process.name = "/usr/lib/firefox/firefox"
0 eglGetDisplay(display_id = 0xb6a21120) = 0x1
1 eglInitialize(dpy = 0x1, major = NULL, minor = NULL) = EGL_TRUE
9 eglChooseConfig(dpy = 0x1, attrib_list = {EGL_SURFACE_TYPE, EGL_WINDOW_BIT, E
GL_RENDERABLE_TYPE, EGL_OPENGL_ES2_BIT, EGL_RED_SIZE, 8, EGL_GREEN_SIZE, 8, EGL
_BLUE_SIZE, 8, EGL_ALPHA_SIZE, 8, EGL_NONE}, configs = {0x2, 0x1, 0x3}, config_
size = 64, num_config = &3) = EGL_TRUE
14 eglCreateWindowSurface(dpy = 0x1, config = 0x2, win = NULL, attrib_list = {}
) // incomplete
---
Thanks, looks like win = NULL is the problem here, as moz_container_get_wl_egl_window() can return nullptr as an invalid surface. I'll investigate that.
Comment on attachment 8982142 [details]
Bug 1464823 - [Wayland] Don't paint until we have a visible wl_surface,

https://reviewboard.mozilla.org/r/248164/#review256980
Attachment #8982142 - Flags: review?(jhorak) → review+
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/f13d1acd73da
[Wayland] Don't paint until we have a visible wl_surface, r=jhorak,lsalzman
https://hg.mozilla.org/mozilla-central/rev/f13d1acd73da
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
(In reply to Hiroshi Hatake [:cosmo0920] from comment #23)
> I obtained EGL API calling trace with apitrace.
> 
> API calling trace dump says:
> 
> ---
> // process.name = "/usr/lib/firefox/firefox"
> 0 eglGetDisplay(display_id = 0xb6a21120) = 0x1
> 1 eglInitialize(dpy = 0x1, major = NULL, minor = NULL) = EGL_TRUE
> 9 eglChooseConfig(dpy = 0x1, attrib_list = {EGL_SURFACE_TYPE,
> EGL_WINDOW_BIT, E
> GL_RENDERABLE_TYPE, EGL_OPENGL_ES2_BIT, EGL_RED_SIZE, 8, EGL_GREEN_SIZE, 8,
> EGL
> _BLUE_SIZE, 8, EGL_ALPHA_SIZE, 8, EGL_NONE}, configs = {0x2, 0x1, 0x3},
> config_
> size = 64, num_config = &3) = EGL_TRUE
> 14 eglCreateWindowSurface(dpy = 0x1, config = 0x2, win = NULL, attrib_list =
> {}
> ) // incomplete
> ---

Please file a follow up bug for that - I'll look at it. Thanks.
Flags: needinfo?(cosmo0920.oucc)
Depends on: 1468907
I've filed a followup bug. Thanks.
Flags: needinfo?(cosmo0920.oucc)
Attachment #8983659 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.