Closed
Bug 1464823
Opened 6 years ago
Closed 6 years ago
[Wayland] GLContextEGL::SwapBuffers() races with main thread wl_display_* calls.
Categories
(Core :: Widget: Gtk, defect)
Tracking
()
RESOLVED
FIXED
mozilla62
Tracking | Status | |
---|---|---|
firefox62 | --- | fixed |
People
(Reporter: moz, Assigned: stransky)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(2 files, 2 obsolete files)
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Build ID: 20100101
Steps to reproduce:
1. Open firefox-wayland (works from a fresh profile)
2. open about:config
3. change layers.acceleration.force-enabled to true
4. restart firefox-wayland
5. open anything which has a popover, e.g. the menu shown from sandwich button or the library menu or the downloads symbol. It is also being triggered when bookmarking a page or opening the "page actions" ("…") symbol or the lock symbol in url bar.
Actual results:
GUI freezes completely. No interaction possible. No error messages or crashes either. Firefox can be terminated cleanly by sending it a SIGTERM
Expected results:
No freeze. Just behave as without acceleration.
Affected version:
Firefox build on Fedora 28, firefox-60.0.1-3.fc28.x86_64
Additional info:
This does not affect the main menu (which is visible after pressing the Alt key) nor the context menu at any position I checked including the URL bar.
Reporter | ||
Updated•6 years ago
|
Blocks: wayland
Summary: [Wayland] with layers.acceleration.force-enabled=true, any subsurface makes firefox freeze → [Wayland] with layers.acceleration.force-enabled=true, some subsurfaces make firefox freeze
Assignee | ||
Updated•6 years ago
|
Component: Untriaged → Widget: Gtk
Product: Firefox → Core
Assignee | ||
Comment 2•6 years ago
|
||
There's the backtrace from Compositor thread which causes the freeze. It's because mozilla::gl::GLContextEGL::SwapBuffers() calls wl_display_dispatch_queue() on default display/queue as well as the main thread which causes a race condition here.
#0 0x00007ffff6da6589 in poll () at /lib64/libc.so.6
#1 0x00007ffff28776a9 in poll (__timeout=-1, __nfds=1, __fds=0x7fffc6f65ce0) at /usr/include/bits/poll2.h:46
#2 0x00007ffff28776a9 in wl_display_poll (display=display@entry=0x7ffff6a61440, events=events@entry=1)
at src/wayland-client.c:1717
#3 0x00007ffff287913c in wl_display_dispatch_queue (display=0x7ffff6a61440, queue=0x7fffb49ec380)
at src/wayland-client.c:1790
#4 0x00007fffc19650bb in () at /lib64/libEGL_mesa.so.0
#5 0x00007fffc19560ea in eglSwapBuffers () at /lib64/libEGL_mesa.so.0
#6 0x00007fffe18010bd in mozilla::gl::GLLibraryEGL::fSwapBuffers(void*, void*) const (this=0x7fffeb8c5780 <mozilla::gl::sEGLLibrary>, dpy=0x7fffc2db2800, surface=0x7fffade57000) at /home/komat/tmp676-trunk-gtk3/src2/gfx/gl/GLLibraryEGL.h:229
#7 0x00007fffe1811ad8 in mozilla::gl::GLContextEGL::SwapBuffers() (this=0x7fffacc87000)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/gl/GLContextProviderEGL.cpp:501
#8 0x00007fffe18e4dc2 in mozilla::layers::CompositorOGL::EndFrame() (this=0x7fffade6d400)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/opengl/CompositorOGL.cpp:1771
#9 0x00007fffe1a5c0b8 in mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&) (
this=0x7fffaccf5af0, aInvalidRegion=..., aOpaqueRegion=...)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:995
#10 0x00007fffe1a59d8a in mozilla::layers::LayerManagerComposite::UpdateAndRender() (this=0x7fffaccf5af0)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:534
#11 0x00007fffe1a59990 in mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags) (this=0x7fffaccf5af0, aTimeStamp=..., aFlags=mozilla::layers::LayerManager::END_DEFAULT)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/composite/LayerManagerComposite.cpp:464
#12 0x00007fffe1a82127 in mozilla::layers::CompositorBridgeParent::CompositeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade56800, aTarget=0x0, aRect=0x0)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:1062
#13 0x00007fffe1a8a7c1 in mozilla::layers::CompositorVsyncScheduler::ForceComposeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade6d280, aTarget=0x0, aRect=0x0)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorVsyncScheduler.cpp:279
#14 0x00007fffe1a8240b in mozilla::layers::CompositorBridgeParent::ForceComposeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (this=0x7fffade56800, aTarget=0x0, aRect=0x0)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:1135
#15 0x00007fffe1a80d93 in mozilla::layers::CompositorBridgeParent::RecvFlushRendering() (this=0x7fffade56800)
at /home/komat/tmp676-trunk-gtk3/src2/gfx/layers/ipc/CompositorBridgeParent.cpp:604
#16 0x00007fffe124cd88 in mozilla::layers::PCompositorBridgeParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) (this=0x7fffade56800, msg__=..., reply__=@0x7fffc6f66968: 0x0)
at /home/komat/tmp676-trunk-gtk3/src2/objdir/ipc/ipdl/PCompositorBridgeParent.cpp:1453
#17 0x00007fffe1251365 in mozilla::layers::PCompositorManagerParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) (this=0x7ffff6adae30, msg__=..., reply__=@0x7fffc6f66968: 0x0)
at /home/komat/tmp676-trunk-gtk3/src2/objdir/ipc/ipdl/PCompositorManagerParent.cpp:271
#18 0x00007fffe0dcc9b2 in mozilla::ipc::MessageChannel::DispatchSyncMessage(IPC::Message const&, IPC::Message*&) (this=0x7fffc3eb1100, aMsg=..., aReply=@0x7fffc6f66968: 0x0) at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:2104
#19 0x00007fffe0dcc652 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) (this=0x7fffc3eb1100, aMsg=...)
at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:2062
#20 0x00007fffe0dcbca7 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) (this=0x7fffc3eb1100, aTask=...) at /home/komat/tmp676-trunk-gtk3/src2/ipc/glue/MessageChannel.cpp:1912
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → stransky
Assignee | ||
Updated•6 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Updated•6 years ago
|
Summary: [Wayland] with layers.acceleration.force-enabled=true, some subsurfaces make firefox freeze → [Wayland] GLContextEGL::SwapBuffers() races with main thread wl_display_* calls.
Assignee | ||
Comment 3•6 years ago
|
||
Wip workaround for the issue.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 5•6 years ago
|
||
Reported at Mesa as https://bugs.freedesktop.org/show_bug.cgi?id=106753
Comment 6•6 years ago
|
||
wl_display_dispatch_queue does _not_ dispatch the default queue. The queue dispatched by Mesa internally is a separate internal queue created only by Mesa and nothing else. It is safe against race conditions.
If Mozilla itself works with the Wayland connection's socket, it needs to use the wl_display_prepare_read() API in order to ensure thread safety.
There was one Mesa bug where swrast would always hang like this, which was due to a typo (dispatching the wrong queue) rather than any fundamental lack of thread safety. That is fixed in later patch release versions for 17.3 (I think 17.3.8 onwards), 18.0 (18.0.2 onwards); 18.1 and 18.2 are fine.
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Daniel Stone from comment #6)
> wl_display_dispatch_queue does _not_ dispatch the default queue. The queue
> dispatched by Mesa internally is a separate internal queue created only by
> Mesa and nothing else. It is safe against race conditions.
>
> If Mozilla itself works with the Wayland connection's socket, it needs to
> use the wl_display_prepare_read() API in order to ensure thread safety.
>
> There was one Mesa bug where swrast would always hang like this, which was
> due to a typo (dispatching the wrong queue) rather than any fundamental lack
> of thread safety. That is fixed in later patch release versions for 17.3 (I
> think 17.3.8 onwards), 18.0 (18.0.2 onwards); 18.1 and 18.2 are fine.
Sure, I'm not an expert there. I uploaded the requested logs at https://bugs.freedesktop.org/show_bug.cgi?id=106753 for further investigation.
Comment 8•6 years ago
|
||
Copying my comment from the FDO bug report:
So what it looks like is that:
1. Firefox creates two surfaces, wl_surface@39 and wl_surface@40
2. Firefox turn wl_surface@40 into a desynchronous subsurface on top of wl_surface@39
3. Firefox turn wl_surface@39 into a xdg_toplevel and commits the initial (empty) state
4. The compositor sends the configure event to the xdg_toplevel
5. Firefox replies immediately with ack_configure() without attaching a buffer
This will probably consume any compositor: you just mapped a window, asked to configure it (draw the first frame), but what happened is that no content was posted even though the configure event was acknowledged
6. Firefox asks for a frame callback, attaches a new buffer to the subsurface and commits it
When Firefox tries to do eglSwapBuffers() again
When it does this, it waits for the frame callback, but that will never happen because the toplevel was never mapped, meaning the subsurface never has a chance to be displayed, meaning the frame callback will never be invoked.
The xdg_surface interface states the following:
For an xdg_surface to be mapped by the compositor, the following
conditions must be met:
(1) the client has assigned an xdg_surface-based role to the surface
(2) the client has set and committed the xdg_surface state and the
role-dependent state to the surface
(3) the client has committed a buffer to the surface
The third condition here was never met.
This raises the question, what is the intended content of the subsurface and what is the intended content of the toplevel?
Comment 9•6 years ago
|
||
mozreview-review |
Comment on attachment 8982142 [details]
Bug 1464823 - [Wayland] Don't paint until we have a visible wl_surface,
https://reviewboard.mozilla.org/r/248164/#review254464
Attachment #8982142 -
Flags: review?(lsalzman) → review+
Comment 10•6 years ago
|
||
Just to point out, there doesn't seem to be any race condition in Mesa, so the motivation in the patch is incorrect. Mesa can handle multiple threads with a single wl_display just fine. What happens is that since eglSwapInterval(1) is used, the swap-buffers call waits for the VSync-callback of the old buffer (buffer was painted), which won't happen because it has not been mapped yet by Firefox.
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to Jonas Ådahl from comment #10)
> Just to point out, there doesn't seem to be any race condition in Mesa, so
> the motivation in the patch is incorrect. Mesa can handle multiple threads
> with a single wl_display just fine. What happens is that since
> eglSwapInterval(1) is used, the swap-buffers call waits for the
> VSync-callback of the old buffer (buffer was painted), which won't happen
> because it has not been mapped yet by Firefox.
Jonas, Thanks for the info, I need to think it over.
Assignee | ||
Comment 12•6 years ago
|
||
(In reply to Jonas Ådahl from comment #10)
> Just to point out, there doesn't seem to be any race condition in Mesa, so
> the motivation in the patch is incorrect. Mesa can handle multiple threads
> with a single wl_display just fine. What happens is that since
> eglSwapInterval(1) is used, the swap-buffers call waits for the
> VSync-callback of the old buffer (buffer was painted), which won't happen
> because it has not been mapped yet by Firefox.
Yes, the swap-buffers/VSync-callback is called on subsurface owned/managed by Firefox and the parent toplevel surface is owned by Gtk.
Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
Comment 13•6 years ago
|
||
> Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
I also confirmed it, and can avoid it by the attached patch.
But the patch is still too ugly.
How can I know properly whether the surface is committed or not from outside of GDK?
Comment hidden (mozreview-request) |
Assignee | ||
Comment 15•6 years ago
|
||
(In reply to Takuro Ashie from comment #13)
> Created attachment 8983659 [details] [diff] [review]
> avoid-freeze-wayland-with-egl.patch
>
> > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
>
> I also confirmed it, and can avoid it by the attached patch.
> But the patch is still too ugly.
I just updated the patch, Thanks!
Assignee | ||
Updated•6 years ago
|
Attachment #8982023 -
Attachment is obsolete: true
Comment 16•6 years ago
|
||
(In reply to Martin Stránský [:stransky] from comment #15)
> (In reply to Takuro Ashie from comment #13)
> > Created attachment 8983659 [details] [diff] [review]
> > avoid-freeze-wayland-with-egl.patch
> >
> > > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> >
> > I also confirmed it, and can avoid it by the attached patch.
> > But the patch is still too ugly.
>
> I just updated the patch, Thanks!
I got SEGV with your patch on my devices(RZ/G1M, R-Car M3). :'(
But SEGV detail is concealed because SEGV occurred within PowerVR driver.
Assignee | ||
Comment 17•6 years ago
|
||
(In reply to Hiroshi Hatake [:cosmo0920] from comment #16)
> (In reply to Martin Stránský [:stransky] from comment #15)
> > (In reply to Takuro Ashie from comment #13)
> > > Created attachment 8983659 [details] [diff] [review]
> > > avoid-freeze-wayland-with-egl.patch
> > >
> > > > Mesa is called early here and the swap-buffers/VSync-callback is called before Gtk updates content of the toplevel window - that's why we don't have a valid xdg_surface here.
> > >
> > > I also confirmed it, and can avoid it by the attached patch.
> > > But the patch is still too ugly.
> >
> > I just updated the patch, Thanks!
>
> I got SEGV with your patch on my devices(RZ/G1M, R-Car M3). :'(
> But SEGV detail is concealed because SEGV occurred within PowerVR driver.
Can you attach at least some backtrace of the crash?
Flags: needinfo?(cosmo0920.oucc)
Comment 18•6 years ago
|
||
```
(gdb) bt
#0 0xa473c9b8 in ?? () from /usr/lib/libpvrPVR2D_WAYLANDWSEGL.so
#1 0x0001f65c in calloc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
```
Flags: needinfo?(cosmo0920.oucc)
Assignee | ||
Comment 19•6 years ago
|
||
(In reply to Hiroshi Hatake [:cosmo0920] from comment #18)
> ```
> (gdb) bt
> #0 0xa473c9b8 in ?? () from /usr/lib/libpvrPVR2D_WAYLANDWSEGL.so
> #1 0x0001f65c in calloc ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Please try to disable e10s (set browser.tabs.remote.autostart to false) and attach backtrace of all threads.
Flags: needinfo?(cosmo0920.oucc)
Comment 20•6 years ago
|
||
Comment 21•6 years ago
|
||
Thread 37 backtrace looks mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&) () infinity printed.
Ashie's patch can work on R-Car&RZ/G1.
Flags: needinfo?(cosmo0920.oucc)
Assignee | ||
Comment 22•6 years ago
|
||
(In reply to Hiroshi Hatake [:cosmo0920] from comment #21)
> Thread 37 backtrace looks
> mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&,
> mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&) ()
> infinity printed.
That bt does not look very useful, looks like some recursion happens on stack which is exhausted. Maybe some code fails when moz_container_get_wl_surface() / moz_container_get_wl_egl_window() return nullptr?
Comment 23•6 years ago
|
||
I obtained EGL API calling trace with apitrace.
API calling trace dump says:
---
// process.name = "/usr/lib/firefox/firefox"
0 eglGetDisplay(display_id = 0xb6a21120) = 0x1
1 eglInitialize(dpy = 0x1, major = NULL, minor = NULL) = EGL_TRUE
9 eglChooseConfig(dpy = 0x1, attrib_list = {EGL_SURFACE_TYPE, EGL_WINDOW_BIT, E
GL_RENDERABLE_TYPE, EGL_OPENGL_ES2_BIT, EGL_RED_SIZE, 8, EGL_GREEN_SIZE, 8, EGL
_BLUE_SIZE, 8, EGL_ALPHA_SIZE, 8, EGL_NONE}, configs = {0x2, 0x1, 0x3}, config_
size = 64, num_config = &3) = EGL_TRUE
14 eglCreateWindowSurface(dpy = 0x1, config = 0x2, win = NULL, attrib_list = {}
) // incomplete
---
Assignee | ||
Comment 24•6 years ago
|
||
Thanks, looks like win = NULL is the problem here, as moz_container_get_wl_egl_window() can return nullptr as an invalid surface. I'll investigate that.
Comment 25•6 years ago
|
||
mozreview-review |
Comment on attachment 8982142 [details]
Bug 1464823 - [Wayland] Don't paint until we have a visible wl_surface,
https://reviewboard.mozilla.org/r/248164/#review256980
Attachment #8982142 -
Flags: review?(jhorak) → review+
Comment 26•6 years ago
|
||
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/f13d1acd73da
[Wayland] Don't paint until we have a visible wl_surface, r=jhorak,lsalzman
Comment 27•6 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox62:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Assignee | ||
Comment 28•6 years ago
|
||
(In reply to Hiroshi Hatake [:cosmo0920] from comment #23)
> I obtained EGL API calling trace with apitrace.
>
> API calling trace dump says:
>
> ---
> // process.name = "/usr/lib/firefox/firefox"
> 0 eglGetDisplay(display_id = 0xb6a21120) = 0x1
> 1 eglInitialize(dpy = 0x1, major = NULL, minor = NULL) = EGL_TRUE
> 9 eglChooseConfig(dpy = 0x1, attrib_list = {EGL_SURFACE_TYPE,
> EGL_WINDOW_BIT, E
> GL_RENDERABLE_TYPE, EGL_OPENGL_ES2_BIT, EGL_RED_SIZE, 8, EGL_GREEN_SIZE, 8,
> EGL
> _BLUE_SIZE, 8, EGL_ALPHA_SIZE, 8, EGL_NONE}, configs = {0x2, 0x1, 0x3},
> config_
> size = 64, num_config = &3) = EGL_TRUE
> 14 eglCreateWindowSurface(dpy = 0x1, config = 0x2, win = NULL, attrib_list =
> {}
> ) // incomplete
> ---
Please file a follow up bug for that - I'll look at it. Thanks.
Flags: needinfo?(cosmo0920.oucc)
Assignee | ||
Updated•6 years ago
|
Attachment #8983659 -
Attachment is obsolete: true
You need to log in
before you can comment on or make changes to this bug.
Description
•