Closed Bug 1661192 Opened 4 years ago Closed 4 years ago

[Wayland] Autoscroll widget crashes firefox

Categories

(Core :: Widget: Gtk, defect)

defect

Tracking

()

RESOLVED FIXED
83 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox82 --- wontfix
firefox83 --- fixed

People

(Reporter: rpigott, Assigned: jhorak, NeedInfo)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Crash Data

Attachments

(5 files)

$ uname -a
Linux rxps 5.8.3-arch1-1 #1 SMP PREEMPT Fri, 21 Aug 2020 16:54:16 +0000 x86_64 GNU/Linux
$ pacman -Q firefox
firefox 80.0-1
$ swaymsg -pt get_version
sway version 1.5-6991ac8c (Aug 18 2020, branch 'master')

This is a behavior I have observed since at least firefox 78, probably before, and on very many versions of sway.

When enabled via general.Autoscroll and the user middle clicks in an appropriate area, I would expect:

  1. An autoscroll widget appears under the cursor
  2. Moving the cursor outside the autoscroll widget scrolls the page
  3. clicking anywhere dismisses the autoscroll widget and stops scrolling the page

On sway, after middle click when it works (95~98% of clicks):

  1. A popup probably appears. If it does appear, it may be after a modest (< 1 second) delay
  2. The popup is (almost?) never placed under the cursor, instead appearing in an apparently random location. I find the location is most frequently North-West of the cursor location, and the offset typically very large. The location is frequently hugging one of the borders of the screen because of the large offset.
  3. Moving the cursor a short distance begins a scroll, as if the widget had been placed correctly.
  4. Clicking anywhere dismisses the widget and stops scrolling the page.

When it does not work (2~5% of clicks):

  1. firefox closes unexpectedly OR (even rarer) firefox crashes and offers to send a report.

I've attached a snippet of a WAYLAND_DEBUG=1 log in a case where firefox unexpectedly closes. The log reveals that sway rightfully disconnects firefox for a Wayland protocol violation. The relevant portion of the snippet being:

[1366914.119]  -> wl_compositor@53.create_surface(new id wl_surface@112)
[...]
[1366914.301]  -> wl_surface@112.destroy()
[1366914.788]  -> wl_surface@112.attach(wl_buffer@111, 0, 0)
[1366915.102] wl_display@1.delete_id(112)
[1366915.118] wl_display@1.error(wl_display@1, 0, "invalid object 112")

Firefox attempts to attach a buffer to the already deleted surface and it's connection is terminated for this transgression. Although in this case the log shows the delete_id message from the server being received after the attach call, I have other example logs where it shows before, so I don't believe that is relevant.

When the user middle clicks, firefox does create an xdg_popup, which I assume is used for the autoscroll widget. In fact, I observe that firefox repeatedly creates and destroys a new popup every frame (~16ms interval) before it creates the fatal subsurface and is forcefully disconnected. According to the log, the fatal surface is used as a subsurface, not a popup. I don't know it's purpose. When it crashes, even though firefox appears to close right away, it seems that firefox usually succeeds in creating several such popups before being disconnected.

As best I can tell the crash is random. To reproduce, I open firefox and repeatedly middle click, move the cursor to scroll a little bit, and left click to dismiss the widget. I've reproduced the crash about a dozen of times today while collecting logs, and probably more than a hundred times accidentally over the past few months. Using this method I can "reliably" reproduce the crash within 20 - 200 seconds.

See Also: → 1640651, 1638084, 1643993, 1660263

I can confirm the bug exists since firefox 77+ on Sway Wayland

Yep this appears to the same protocol error that's observed in #1660263, just with different repro steps.

I've gathered a different log, same repro steps, attached in full with a different protocol error:

[3896512.330]  -> wl_compositor@53.create_surface(new id wl_surface@84)
[3896512.496]  -> wl_surface@84.destroy()
[3896513.606] wl_display@1.delete_id(84)
[3896513.870]  -> zwp_linux_buffer_params_v1@85.create_immed(new id wl_buffer@84, 32, 32, 875713089, 0)
[3896513.886]  -> wl_surface@84.attach(wl_buffer@84, 0, 0)
[3896520.139] wl_display@1.error(wl_display@1, 1, "invalid method 1, object wl_buffer@84")

This one seems very strange to me, since apparently no method calls are recorded with the offending buffer.

Or actually somehow I missed it when reading, but in this line: wl_surface@84.attach(wl_buffer@84, 0, 0) firefox apparently thinks object 84 is both a surface and a buffer.

Attached file crash reporter log

This is an example log captured where the crash reporter dialog appeared instead of an unexpected exit. Same reproduction steps. I submitted the associated crash report here: https://crash-stats.mozilla.org/report/index/a7e68ac6-0c6f-4b55-8d75-fff1c0200831

The backtrace indicates a segfault in libEGL, and the last logged function name from firefox was:

mozilla::gl::GLContextEGL::MakeCurrentImpl()
in /build/firefox/src/firefox-80.0/gfx/gl/GLContextProviderEGL.cpp:467

So, probably caused by the same wl_surface use-after-free as before, just a matter of which thread encounters the bad object first. In this case the Renderer.

This bug is not limited to sway, it also happens on gnome shell/mutter.

I guess it's probably related to the GPU driver or to having animations disabled.

OK I've now spent a lot of time digging into this one. I have a pretty good understanding of the issues now and I think I can fix them. Below is my best understanding of the problem:

Firefox misunderstands the reported window position in xdg_popup configure events which includes the xdg_positioner offset it initially requests. As a result, it always tries to move the newly created autoscroll widget to match the reported configure coordinates when it is already correctly placed. To fulfill the move requests from firefox, gdk repeatedly recreates the wl_surface and associated xdg_popup and Firefox repeatedly requests to move them. Each new popup is moved relative to the last by the positioner offset, in my case -16px up and left, and they are recreated as rapidly as the compositor can handle the requests and Firefox can receive the configures.

If Firefox survives long enough, the popup position will reach the window boundaries. When that happens the compositor will not offset the popup to honor the required constraints. Then Firefox will be satisfied that the widget is adequately placed and draw it. But sometimes Firefox will not survive this long.

When the widget is being rapidly recreated, gtk sends map and unmap signals for each new surface. When the widget is first mapped the renderer tries to obtain an EGLSurface to render into. Unfortunately the methods in MozContainerWayland that satisfy this release the container lock too early and a very early unmap like generated above can immediately destroy the newly created wl_surface and eglwindow. The timing is random, but if Firefox receives an early unmap signal (before or during the first frame callback) and the main thread resumes and destroys these resources before they are used one of three bad things will happen:

  1. If it happens immediately, Firefox catches it as if there were an EGL error, and backs out of drawing the widget with harmless error messages like "window is null" or "Failed to create EGLSurface".

  2. If it happens later, the renderer will eventually try to call eglSwapBuffers on an EGLSurface whose native wl_surface has already been destroyed. This https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/egl/drivers/dri2/platform_wayland.c#L1102 attach call on the invalid surface prompts the compositor to disconnect firefox's Wayland connection and the window closes without a crash.

  3. If it happens sometime in the middle mesa will crash in either eglDestroySurface or eglCreateSurface since wl_egl_window_destroy in unmap frees the native eglwindow. This is the only case with a crash report.

unmap anytime after SwapBuffers is fine. It looks like number 2 is the same issue reported in 1660263, but in that case the early unmap is caused by rapid user inputs.

This is my first firefox issue and it turned out to be an interesting one! I am working on a patch to address the xdg_popup positioning logic, which I think may be sufficient to close this issue and another to address the "early unmap" crash which I hope will be sufficient to definitively close both this issue and 1660263.

In the mean time, affected users might want to try using the autoscroll widget exclusively in the far left side of their Firefox window, since that will reduce the number of chances for firefox to kill itself before the autoscroll widget settles.

Has Regression Range: --- → yes

I found the patch that introduced the popup offset issue. Here's a patch that just reverts/removes the offending stanzas from the regressed patch.

The cursorOffset bit actually misplaces the popup, which is already centered on the cursor, so I removed it.

The "no need to move" bit causes the autoscroll widget to be placed incorrectly when it intersects a window boundary, so I reverted it as well.

With these changes the autoscroll widget works fine for me.

Jan, Martin, could you take a look at the patch? Removing wayland special-cases sounds appealing though I assume they were there for a reason...

Maybe they were papering over a now-fixed bug? :)

Flags: needinfo?(stransky)
Flags: needinfo?(jhorak)

Yes, what I mean is: I also figure something should probably replace the sections removed/reverted, but I don't know what yet.

If someone knows their intended purpose, then we can replace them with something that implements their intention with the correct Wayland logic.

Sure, Thanks for the patch, we'll look at it!

So with this patch I should mention that what I actually get is the old FF71+ behavior where the Autoscroll widget is placed correctly but clicking within the Autoscroll widget does not dismiss it as it should.

I looked into it briefly and found that

  1. I can see Wayland button events being generated for the xdg_popup surface on click
  2. I don't see button_press_event_cb / nsWindow::OnButtonPressEvent triggering in firefox.

So atm I'm not sure what's up. I actually reported the same thing in a comment here: https://bugzilla.mozilla.org/show_bug.cgi?id=1565313#c8 a while back.

Anyway should I also try to resolve that here, open a new bug report, or use the old report with that comment (which originally reported a different buggy behavior)?

Summary: [Wayland] [sway] Autoscroll widget crashes firefox → [Wayland] Autoscroll widget crashes firefox

I get the same behavior too. That's still miles better than crashing :) so I hope this gets merged ASAP and then we could figure out the click-to-dismiss.

Okay. I'll make something to address the race in the WebRender compositor as well so we can actually fix the crash rather than just avoiding it in this case. Then maybe we can move forward.

Popups like autoscroll does not have anchor rectangle while it has set negative
margin to center the popup under cursor. Adding this offset to the final position
leads recursive repositioning of the popup to upper-left position under Wayland.

On the other hand we have to keep the offset for the popups with given anchor rect
like hamburger menus which use the offset to correctly position their pointing arrow.

Assignee: nobody → jhorak
Pushed by btara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/062970b20339
Don't use popup offset when we don't have anchor rect; r=stransky
Flags: needinfo?(stransky)

Thanks guys.

Towards fixing the crash, I'm thinking I will need to obtain the wl_container->container_lock no later than CreateSurfaceFromNativeWindow in gfx/gl/glContextProviderEGL.cpp and may need to hold it until after the SwapBuffers call in RenderCompositorEGL::EndFrame. GLContextEGL::CreateSurfaceForCompositorWidget looks like a good place to obtain the lock.

I'm not sure what to do with the lock though. I might need to release it on any of the error conditions in that path including but not limited to those in CreateSurfaceFromNativeWindow where (I think) I don't already have a reference from which I can reach the lock, and it seems there is a strong preference in the code to avoid explicit Lock/Unlock calls in favor of automatic locking. Any pointers here?

Status: UNCONFIRMED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch
Regressions: 1669092
Crash Signature: [@ resize_callback]
Crash Signature: [@ resize_callback] → [@ resize_callback] [@ <name omitted> | dri2_wl_query_buffer_age] [@ update_buffers]
Crash Signature: [@ resize_callback] [@ <name omitted> | dri2_wl_query_buffer_age] [@ update_buffers] → [@ resize_callback] [@ <name omitted> | dri2_wl_query_buffer_age] [@ update_buffers] [@ libEGL_mesa.so.0@0x1059f]

Thanks a lot for this fix, autoscroll is now working great, firefox was crashing before quite a lot.

Crash Signature: [@ resize_callback] [@ <name omitted> | dri2_wl_query_buffer_age] [@ update_buffers] [@ libEGL_mesa.so.0@0x1059f] → [@ resize_callback] [@ <name omitted> | dri2_wl_query_buffer_age] [@ update_buffers] [@ libEGL_mesa.so.0@0x1059f]
See Also: → 1783829
Duplicate of this bug: 1634723
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: