Closed Bug 1908825 Opened 2 months ago Closed 1 month ago

Firefox crashes on Wayland with Explicit Sync on Nvidia with egl-wayland-1.1.14 [@ wsi_wl_swapchain_queue_present]

Categories

(Core :: Widget: Gtk, defect, P3)

Firefox 130
Desktop
Linux
defect

Tracking

()

RESOLVED MOVED

People

(Reporter: toadking, Unassigned)

References

(Blocks 2 open bugs, )

Details

(Keywords: topcrash, topcrash-startup)

Crash Data

Attachments

(8 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0

Steps to reproduce:

A follow-up to #1898476, I'm on a nightly build (built from Built from https://hg.mozilla.org/mozilla-central/rev/c9dd3166c8110a3de87805260c7e4edaa27bd9d4 so it should have the fixes from #1898476 in it) and after browsing for 10-15 minutes I'm getting crashes.

Report: https://crash-stats.mozilla.org/report/index/b4775cf5-6e2b-4748-bcc6-51f6b0240719

Terminal output:
[GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1#70: error 4: explicit sync is used, but no acquire point is set

ExceptionHandler::GenerateDump attempting to generate: <minidump path>
ExceptionHandler::GenerateDump cloned child 65755ExceptionHandler::WaitForContinueSignal waiting for continue signal...

ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::GenerateDump minidump generation succeeded
Exiting due to channel error.

(the last line is repeated several times)

Actual results:

Firefox crashes

Expected results:

Firefox doesn't crash

Regressions: 1898476

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

Please run on terminal with:
WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" env variables and attach the log here when it crashes (it's enough to attach last ~ 2000 lines).
Thanks.

Blocks: wayland
Flags: needinfo?(toadking)
Priority: -- → P3

Attached the log with the last 2000 lines.

I also had another weird error when trying to reproduce the crash where the browser window froze up for a couple seconds, followed by my entire desktop freezing. Everything unfroze after about ten seconds. I immediately exited Firefox and tried logging again, but I'll attach that entire log as well in case it's helpful

Flags: needinfo?(toadking)

Attaching a new log. I looked at the first one and it didn't appear to have the actual crash lines in it so I made a new one and made sure to include enough lines for those.

Attachment #9413745 - Attachment is obsolete: true

There's the related part:

[ 982250.261] {Display Queue} wl_display#1.delete_id(89)
[ 982250.295] {Default Queue} discarded wl_buffer#68.release()
[ 982250.304] {Default Queue} wl_callback#89.done(31369649)
[ 982250.314] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982250.323] {Default Queue}  -> wl_surface#63.commit()
[ 982251.573] {Default Queue}  -> wl_surface#63.attach(wl_buffer#83, 0, 0)

[ 982251.607]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#71, 0, 877)
[ 982251.619]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#84, 0, 188)

[ 982251.630] {Default Queue}  -> wl_surface#63.damage(0, 0, 2560, 1354)
[ 982251.641] {Default Queue}  -> wl_surface#63.commit()
[ 982251.651]  -> wl_display#1.sync(new id wl_callback#59)
[ 982251.779] {Display Queue} wl_display#1.delete_id(59)
[ 982251.795] wl_callback#59.done(16132)
[ 982256.066] {Display Queue} wl_display#1.delete_id(89)
[ 982256.102] {Default Queue} discarded wl_buffer#97.release()
[ 982256.111] {Default Queue} wl_callback#89.done(31369655)
[ 982256.121] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982256.129] {Default Queue}  -> wl_surface#63.commit()
[ 982257.479] {Default Queue}  -> wl_surface#63.attach(wl_buffer#91, 0, 0)

[ 982257.524]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#71, 0, 878)
[ 982257.539]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#90, 0, 186)

[ 982257.548] {Default Queue}  -> wl_surface#63.damage(0, 0, 2560, 1354)
[ 982257.582] {Default Queue}  -> wl_surface#63.commit()
[ 982257.593]  -> wl_display#1.sync(new id wl_callback#59)
[ 982257.717] {Display Queue} wl_display#1.delete_id(59)
[ 982257.736] wl_callback#59.done(16132)
[ 982260.449] {Display Queue} wl_display#1.delete_id(89)
[ 982260.484] {Default Queue} discarded wl_buffer#83.release()
[ 982260.493] {Default Queue} wl_callback#89.done(31369655)
[ 982260.503] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982260.512] {Default Queue}  -> wl_surface#63.commit()
[ 982261.739] {Default Queue}  -> wl_surface#63.attach(wl_buffer#68, 0, 0)

[ 982269.936] {Display Queue} wl_display#1.delete_id(89)
[ 982269.974] {Default Queue} wl_callback#89.done(31369667)

[ 982269.985] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982269.992] {Default Queue}  -> wl_surface#63.commit()
[ 982270.265] {Display Queue} wl_display#1.error(wp_linux_drm_syncobj_surface_v1#70, 4, "explicit sync is used, but no acquire point is set")

So looks like we're missing the set_acquire_point/set_release_point after wl_surface::attach(). But that code doesn't look like Firefox one but rather Mesa. Firefox doesn't use wp_linux_drm_syncobj_surface_v1 at all.

Looks like the affected code comes from wsi_wl_swapchain_queue_present() at MESA:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/vulkan/wsi/wsi_common_wayland.c#L2150

So please report it at MESA project:

No longer regressions: 1898476
See Also: → 1898476
Blocks: wr-nv-linux
OS: Unspecified → Linux
Hardware: Unspecified → Desktop

Downgrading egl-wayland on Arch Linux fixes the issue for now : egl-wayland 2:1.1.14-1 -> 2:1.1.13-2

Summary: Firefox _still_ crashes on Wayland with Explicit Sync on Nvidia → Firefox crashes on Wayland with Explicit Sync on Nvidia [@ wsi_wl_swapchain_queue_present]

Isn't this the NVIDIA proprietary driver and Mesa isn't involved at all? The code using linux-drm-syncobj-v1 is in the egl-wayland library.

https://github.com/NVIDIA/egl-wayland/blob/master/src/wayland-eglsurface.c#L228-L238

Looks like the egl-wayland code always sets up explicit sync for a window but then does not set sync points if a wlStreamResource was set when creating the surface context. Looks like a bug, but I can barely grasp the code.

I am seeing crashes since updating egl-wayland to 1.14 on Fedora:
https://crash-stats.mozilla.org/report/index/914ab3e8-55e4-4f9b-8207-6cc140240721
https://crash-stats.mozilla.org/report/index/b37192d9-bf2a-47f9-90f4-ac79c0240721
https://crash-stats.mozilla.org/report/index/2297b073-fae0-4402-8bbd-38add0240721
Thunderbird is crashing too but I am not sure how to find the reports.
firefox-128.0-2.fc40.x86_64
egl-wayland-1.1.14-1.fc40.x86_64
xorg-x11-drv-nvidia-555.58.02-1.fc40.x86_64

See Also: → 1909453

After reporting back on the egl-wayland repo, a dev there claims this is still a Firefox issue.

https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2243578903
https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2245324700

Duplicate of this bug: 1909172

(In reply to Michael Lelli from comment #15)

After reporting back on the egl-wayland repo, a dev there claims this is still a Firefox issue.

https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2243578903
https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2245324700

Yeah, and the egl-wayland downgrade fix is just a coincidence :-) Will look at it anyway.

Duplicate of this bug: 1909453

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::widget::WlLogHandler]
Flags: needinfo?(stransky)
Flags: needinfo?(stransky)
Summary: Firefox crashes on Wayland with Explicit Sync on Nvidia [@ wsi_wl_swapchain_queue_present] → Firefox crashes on Wayland with Explicit Sync on Nvidia with egl-wayland-1.1.14 [@ wsi_wl_swapchain_queue_present]
Flags: needinfo?(stransky)

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release (startup)

For more information, please visit BugBot documentation.

Have in mind the egl-wayland downgrade fix is to version 1.1.13 which does not have the explicit sync patches so while it is definitely not a coincidence, it doesn't actually mean much in this case other than Firefox is not crashing without explicit sync.

I can not get Firefox to run more than 3-5 minutes if egl-wayland contains explicit sync support. And that includes anything higher than 1.1.13 and the version of egl-wayland that is bundled with the nvidia driver on 560 (Fedora currently on rpmfusion removed the external egl-wayland and egl-gbm dependencies and is bundling the official with 560).

Please download, extract and run this build with a new profile:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/L199EHvwQzKeIge0M7l3Ow/runs/0/artifacts/public/build/target.tar.bz2

Run it on terminal with WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" env variables and attach the log here when it crashes. It contains extra logging needed for debugging.

Thanks.

Duplicate of this bug: 1909059

here we go
fedora 40, xorg-x11-drv-nvidia-555.58.02-1.fc40.x86_64, egl-wayland-1.1.14-1.fc40.x86_64 ( stable with egl-wayland-1.1.13 )
about:config -> media.ffmpeg.vaapi.enabled = true
it seems it takes much longer to force a crash if media.ffmpeg.vaapi.enabled = false

https://crash-stats.mozilla.org/report/index/6dd4e12a-33bb-4dfd-8543-eb5510240731
https://crash-stats.mozilla.org/report/index/ef019403-7f7d-45a1-9c10-dc3000240731
https://crash-stats.mozilla.org/report/index/6909a7d6-3451-4104-8f0d-211250240731 <= vaapi disabled
https://crash-stats.mozilla.org/report/index/4c5eee95-65d5-45e0-81d7-f2a650240731

will attach the logfiles

Attached file firefox-crash1.log.gz
Attached file firefox-crash2.log.gz
Attached file firefox-crash4.log.gz

Attached a crash with Martin's test build and egl-wayland 1.1.14.

Crash report: https://crash-stats.mozilla.org/report/index/8ec0dee7-c83b-47ab-a561-141df0240731

Thanks. According to the log it doesn't look like a Firefox bug. Let's see:

wl_buffer#227 is created here by GL as dmabuf buffer (Firefox doesn't create dmabuf buffers at all) and it has explicit sync set (again - Firefox doesn't set explicit points). This looks like MESA or egl-wayland code:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1025134.831]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#228, fd 247)
[1025134.850]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#214, fd 249)
[1025134.870]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#208, fd 250)
[1025134.889]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#201, fd 251)
[1025136.884] {Default Queue}  -> zwp_linux_dmabuf_v1#49.create_params(new id zwp_linux_buffer_params_v1#236)
[1025136.894] {Default Queue}  -> zwp_linux_buffer_params_v1#236.add(fd 252, 0, 0, 10240, 50331648, 6316052)
[1025136.900] {Default Queue}  -> zwp_linux_buffer_params_v1#236.create_immed(new id wl_buffer#237, 2560, 1382, 875713089, 0)
[1025136.904] {Default Queue}  -> zwp_linux_buffer_params_v1#236.destroy()
[1025136.911] {Default Queue}  -> wl_surface#67.attach(wl_buffer#237, 0, 0)
[1025136.923]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 508)
[1025136.927]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#201, 0, 1)
[1025136.931] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382)
[1025136.934] {Default Queue}  -> wl_surface#67.commit()

So wl_buffer#237 is internal GL buffer used for front/back buffer, the size indicates it too (2560, 1382):

[1024915.497] {Default Queue}  -> zwp_linux_buffer_params_v1#114.create_immed(new id wl_buffer#227, 2560, 1440, 875713089, 0)

Note that similar buffer (#221) is also allocated for rendering with the same size:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1025149.994] {Default Queue}  -> zwp_linux_dmabuf_v1#49.create_params(new id zwp_linux_buffer_params_v1#193)
[1025150.005] {Default Queue}  -> zwp_linux_buffer_params_v1#193.add(fd 249, 0, 0, 10240, 50331648, 6316052)
[1025150.011] {Default Queue}  -> zwp_linux_buffer_params_v1#193.create_immed(new id wl_buffer#221, 2560, 1382, 875713089, 0)
[1025150.016] {Default Queue}  -> zwp_linux_buffer_params_v1#193.destroy()
[1025150.022] {Default Queue}  -> wl_surface#67.attach(wl_buffer#221, 0, 0)
[1025150.036]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 509)
[1025150.040]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#208, 0, 1)
[1025150.044] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382)
[1025150.047] {Default Queue}  -> wl_surface#67.commit()

And now the error sequence:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1063978.115] {Default Queue}  -> wl_surface#67.attach(wl_buffer#221, 0, 0)  << attach
[1063978.134]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 5097)
[1063978.140]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#208, 0, 1151) << set sync
[1063978.145] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382) << Set damage
[1063978.149] {Default Queue}  -> wl_surface#67.commit() << commit
[...]
[1063982.605] {Default Queue}  -> wl_surface#67.frame(new id wl_callback#241) << frame callback request
[1063982.609] {Default Queue}  -> wl_surface#67.commit()  << frame callback request commit

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1063989.819] {Default Queue}  -> wl_surface#67.attach(wl_buffer#237, 0, 0) << attach

<< missing sync point set, damage and commit. 

[1063994.651] {Default Queue}  -> wl_surface#67.frame(new id wl_callback#241) << frame callback request
[1063994.656] {Default Queue}  -> wl_surface#67.commit()  << frame callback request commit. 

But it also takes attach(wl_buffer#237) but without sync point so kaboom:

[1063994.793] {Display Queue} wl_display#1.error(wp_linux_drm_syncobj_surface_v1#70, 4, "explicit sync is used, but no acquire point is set")
[GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1#70: error 4: explicit sync is used, but no acquire point is set

As we see, for wl_buffer#221 it's attached, set damage size damage(0, 0, 2560, 1382) and commited.
wl_buffer#237 it's only attached and nothing else. As we see we're missing damage set here and commit.
Only frame callback commit is performed and that leads to protocol error as it also use already attached buffer without sync.

From the log it looks like egl-wayland (or someone else) attaches wl_buffer#237 even if it's not going to be committed which leads to the missing sync point error.

There's the Firefox core where it happens in Render Thread:

https://searchfox.org/mozilla-central/rev/669fac9888b173c02baa4c036e980c0c204dfe02/gfx/webrender_bindings/RenderCompositorEGL.cpp#164

#ifdef MOZ_WIDGET_GTK
  // Rendering on Wayland has to be atomic (buffer attach + commit) and
  // wayland surface is also used by main thread so lock it before
  // we paint at SwapBuffers().
  UniquePtr<MozContainerSurfaceLock> lock;
  if (auto* gtkWidget = mWidget->AsGTK()) {
    lock = gtkWidget->LockSurface();
  }
#endif
  gl()->SwapBuffers();

As you can see we just call eglSwapBuffers().

So definitely not a Firefox bug (unfortunately as it looks like a clear/simple one and may be quickly fixed if it's in Firefox).

Flags: needinfo?(stransky)

Please report back at NVIDIA/egl-wayland and they hopefully will fix that.

And here is the log of the bad fd crash I am getting. Unsure if it is the same thing.

Attached file bad_fd_crash.tar.gz

NVIDIA developer here, thanks for looking into the wl_surface locking. I do think it's possible this latest crash is a variant of another issue we are looking at recently, which could cause us to attach a surface but fail to set any sync points or commit. This could look like what you're running into above.

The problem is that I can't seem to reproduce the protocol error you're seeing myself. Maybe it's just something about the timing on my machine but I can't trigger the case I mentioned above to see if it matches the protocol error you all see. Instead I get the bad fd crash linked, or some variation of it. I've also seen warnings from IPDL complaining about bad fds too, although I can't seem to trigger that again to copy it here.

I have a prototype fix for this, but due to the above I'm unable to confirm it. Could someone with a proper repro please give it a try and let me know if it avoids the protocol error? This is still a bugfix we will want anyway but given that it could theoretically account for the symptoms in the previous comments I think it would be useful to test.
https://github.com/amshafer/egl-wayland/commit/a5182c7390a78ca2f7986cbcd2e1bf38f6be5f47

I have no clue what the source of the bad fd issues could be, I don't see an obvious way it could affect egl-wayland but I'm not very familiar with firefox internals.

Thanks!

@Austin: I was able to run that egl-wayland commit for about a half hour. It didn't crash on me but I did notice a couple cases where Firefox would lock up for about ten seconds, following by the whole screen locking up for a couple seconds later, and then everything resuming like normal. So this specific issue seems like it may be fixed but still more issues remaining. I also encountered a complete Firefox lockup and was forced to terminate it but I don't know if that's Wayland related or not.

[Parent 334218, Compositor] WARNING: Call to mmap failed: Bad file descriptor: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:515

I don't think it's related to this issue at all - looks like bug in IPC code where SHM is mapped between processes so looks like this one comes from completely different Firefox part. Please file a new bug for it (also please attach possible crashes from about:crashes).

Thanks.

Flags: needinfo?(nodensntt)

(In reply to Michael Lelli from comment #36)

@Austin: I was able to run that egl-wayland commit for about a half hour. It didn't crash on me but I did notice a couple cases where Firefox would lock up for about ten seconds, following by the whole screen locking up for a couple seconds later, and then everything resuming like normal. So this specific issue seems like it may be fixed but still more issues remaining. I also encountered a complete Firefox lockup and was forced to terminate it but I don't know if that's Wayland related or not.

You can run on terminal with logging to terminal - you'll see if Firefox waits to any Wayland/widget event and where the potential lockup is. Something like:

WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" ./firefox

may be enough.

I also tried that egl-wayland commit on openSUSE tumbleweed KDE (git build) with nvidia RTX 3060 Ti on driver v555.58.02, and tested both Firefox stable (v128.0.3) and latest nightly build.

there was no improvements, stable version keeps crashing frequently but randomly and with nightly I it's even worse with all the constant freezing issues

I added logfiles for Firefox-nightly version (the link below, as files where to too big), used the mentioned command to debug

https://www.swisstransfer.com/d/f2db2e21-c9b9-410a-926c-a87fb647a28a

According to the crashes, can you please attach crash data from about:crashes?
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Using_Mozilla_crash_reporter
Thanks.

An important note that when testing new egl-wayland builds is you need to reboot your computer after installing it, or at least restart your DE session/window manager.

Thanks. Looking at the crashes there are various errors/aborts/crashes across different Firefox component, I see only one VSync crash.
Crashes from https://bugzilla.mozilla.org/show_bug.cgi?id=1908825#c13 looks similar - random crashes after update to egl-wayland to 1.14.
I wonder if there's any fd management bug or so involved.

Can you run Firefox with strace, make it crash and attach the log here?

strace -f ./firefox > run.txt 2>&1

note that the log may be huge and Firefox may be slow while running under trace.
Thanks.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #37)

[Parent 334218, Compositor] WARNING: Call to mmap failed: Bad file descriptor: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:515

I don't think it's related to this issue at all - looks like bug in IPC code where SHM is mapped between processes so looks like this one comes from completely different Firefox part. Please file a new bug for it (also please attach possible crashes from about:crashes).

Thanks.

Created bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1911281 for it as requested and listed related crash ids.

@Austin:
Now that you mention it, it looks like there may be something timing related involved (race condition maybe?) as the only instance I had that did not get the bad file descriptor crash was when I run Martin's extra debug logging build while piping stdout/stderr to tee (2>&1 | tee crash.log). This happened for a singular execution and I was unable to replicate this again. Could be completely coincidental but I thought I'd mention it.

Flags: needinfo?(nodensntt)

Thanks for testing! Regarding the freezes I think I may have reproduced that once but it's so intermittent that it's hard to tell. From what I've seen freezes like that can come from the compositor as well, sometimes one side doesn't signal their timeline point for various reasons and the other side is stuck waiting. iirc I only saw my one freeze on KDE, so comparing behavior across compositors may also be needed. I certainly make no promise that egl-wayland is 100% bug-free especially since explicit sync is such a foundational change, so please continue reporting issues and let me know if there's anything I should test.

Fwiw something we wondered internally is if there is indeed a fd management issue. We aren't sure exactly why that function from the commit (send_explicit_sync_points' eglDupNativeFenceFD) fails in this case but theoretically it could be from running out of available fds. So that might also be something worth checking, how many open fds are laying around. Reproducing the badfd issue happened around 10% of the time for me and wasn't very easy to trigger.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #43)

Thanks. Looking at the crashes there are various errors/aborts/crashes across different Firefox component, I see only one VSync crash.
Crashes from https://bugzilla.mozilla.org/show_bug.cgi?id=1908825#c13 looks similar - random crashes after update to egl-wayland to 1.14.
I wonder if there's any fd management bug or so involved.

Can you run Firefox with strace, make it crash and attach the log here?

strace -f ./firefox > run.txt 2>&1

note that the log may be huge and Firefox may be slow while running under trace.
Thanks.

well it froze up immediately when I launched with that command and the took a while until it crashed, here is a 193.5MB log file: https://www.swisstransfer.com/d/a801c73b-9466-4cce-9f65-94b6fa3c5225

(this is with latest tumbleweed updates and latest nightly build and still using the egl-wayland commit, and a fresh reboot)

(In reply to ahjolinna from comment #46)

well it froze up immediately when I launched with that command and the took a while until it crashed, here is a 193.5MB log file: https://www.swisstransfer.com/d/a801c73b-9466-4cce-9f65-94b6fa3c5225

It's not really frozen it's just very slow so it looks like frozen/unresponsive. This is a side-effect of the overhead while running under strace and is normal.

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

(In reply to Faaris from comment #48)

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

128 is still affected by https://bugzilla.mozilla.org/show_bug.cgi?id=1898476, so you need a nightly/dev build to properly test

In my case, using the NFB v555 NVIDIA driver doesn't causes any crashes using the latest nightly public build released daily by the AUR's maintainer heftig https://aur.archlinux.org/firefox-nightly.git and it's completely usable for me when it comes to usability...

BUT i got a different problem in my case. Which is performance issues. Using MOZ_ENABLE_WAYLAND=1 (Wayland) on Firefox-nightly causes huge performance issues on heavy applications (For example: YouTube home page, or Twitter's for you page) accompanied by high CPU usage and Frame drops and stutters. Even when using decoding hardware acceleration (VAAPI-NVDEC) the performance issues are notorious and causes a bad experience using Firefox.

Using the environment variable MOZ_ENABLE_WAYLAND=0 and using XWayland removes this performance issues and the frame drops dissapear. (YouTube home page completely smooth and Twitter for you page completely smooth as well).

So i assume it's a problem related to Wayland obviously and maybe Explicit sync as well. I'll create a new bug report regarding this issue.
I'll leave a link to Google Drive to download a demonstration video of my problem.
https://drive.google.com/file/d/1jWR8TWwJ7YjSM5eMkqNBmYBVljT9xOsS/view?usp=sharing

(In reply to morguldir from comment #49)

(In reply to Faaris from comment #48)

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

128 is still affected by https://bugzilla.mozilla.org/show_bug.cgi?id=1898476, so you need a nightly/dev build to properly test

Thanks for the info. I've tested nightly and am still experiencing the same issue haha.

Any reason why those explicit sync fixes aren't being backported? It's causing a lot of crashes for quite a few users.

(In reply to Juan from comment #50)

In my case, using the NFB v555 NVIDIA driver doesn't causes any crashes using the latest nightly public build released daily by the AUR's maintainer heftig https://aur.archlinux.org/firefox-nightly.git and it's completely usable for me when it comes to usability...

You don't mention what egl-wayland version you are using though because as far as I know Arch has reverted 1.1.14 package back to 1.1.13 (like all cutting edge distros did), which does not have explicit sync support and firefox does not crash.

I have been effected by the problem and have reported https://bugzilla.mozilla.org/show_bug.cgi?id=1909172

On Fedora 40 with new package:

$ rpm -qa egl-wayland
egl-wayland-1.1.14-2.20240805gitc439cd5.fc40.x86_64

and firefox from getfirefox.net the problem seems to be solved on my computer.

Thank you very much for support!

I have to revert my last comment, as I have encountered several crashes (but less than before, firefox seems to be useable for 10 minutes or so now):

bp-ed12ddf9-b5d3-42d2-a698-fce980240807 07.08.24, 15:24
bp-cb628faf-517b-4b31-a529-b20b90240807 07.08.24, 15:24
bp-61c465f9-1d71-4aef-b5f2-747ab0240807 07.08.24, 15:24
bp-ac86cbef-dab7-4e34-8850-40a040240807 07.08.24, 14:54
bp-7a3e329f-63ea-4c58-9c02-db7fb0240807 07.08.24, 14:07
bp-96a02672-97a5-4ae7-8604-f25a10240807 07.08.24, 14:07

I updated my openSUSE Tumbleweed (KDE-git) to the latest NVIDIA v560.31.02 driver (with the same egl-wayland commit), my experience has been much more stable with Firefox v128.0.3—I’ve only had one crash. https://crash-stats.mozilla.org/report/index/6aad24ef-0225-4d1d-bcc3-457170240808

(In reply to ahjolinna from comment #55)

I updated my openSUSE Tumbleweed (KDE-git) to the latest NVIDIA v560.31.02 driver (with the same egl-wayland commit), my experience has been much more stable with Firefox v128.0.3—I’ve only had one crash. https://crash-stats.mozilla.org/report/index/6aad24ef-0225-4d1d-bcc3-457170240808

I can confirm. Firefox 128 doesn't crash anymore for me with the latest NVIDIA drivers.
Writing this on Firefox 128 at the time.

Fedora 40 is now with akmods nvidia driver 560.31.02. I also still encounter crashs with firefox 129.0 from getfirefox.net:

bp-36b1df09-f69e-4cf7-a52a-b80ec0240808 08.08.24, 09:16
bp-df85a9dc-28d4-4119-80c8-178270240808 08.08.24, 09:16

Latest egl-wayland git master (commit 4480345, previously PR#124) seems fix this issue. No more crashes for me in a whole day.

See Also: 1909453

On my Fedora 40 (desktop) system, with egl-wayland 1.1.15, nvidia akmods version 560.31.02 and firefox from getfirefox.net version 129.0.1, I still encounter crashes but substantial less frequent:

bp-8cf6629c-cb12-49a7-aec3-4b7640240814 14.08.24, 10:20
bp-99a77787-d1de-4d2b-b6e3-acdb40240814 14.08.24, 13:38

Installed Packages
Name : egl-wayland
Version : 1.1.15
Release : 1.fc40
Architecture : x86_64

I think #1863047 is related.

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes with the following error:

Wayland protocol error: wp_linux_drm_syncobj_surface_v1@72: error 4: No Acquire point provided

Installed Packages
Name : egl-wayland
Version : 1.1.15
Release : 1.fc40
Architecture : x86_64

Installed Packages
Name : akmod-nvidia
Epoch : 3
Version : 555.58.02
Release : 1.fc40

(In reply to Julius Bairaktaris from comment #60)

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes

I don't think ESR 128 has backported the fixes for bug 1898476?

(In reply to Thomas Pasch from comment #59)

On my Fedora 40 (desktop) system, with egl-wayland 1.1.15, nvidia akmods version 560.31.02 and firefox from getfirefox.net version 129.0.1, I still encounter crashes

And neither has Firefox 129. The decision here was "wontfix".

(In reply to Jan Alexander Steffens [:heftig] from comment #61)

(In reply to Julius Bairaktaris from comment #60)

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes

I don't think ESR 128 has backported the fixes for bug 1898476?

ESR 128 backport was hold until proven it's working.

Duplicate of this bug: 1908816

Firefox side of this is fixed by Bug 1898476.
egl-wayland part is fixed by https://github.com/NVIDIA/egl-wayland/commit/448034502fdabc4ec60fb94051f981c5a901103b

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → MOVED

(In reply to Julius Bairaktaris from comment #60)

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes with the following error:
Wayland protocol error: wp_linux_drm_syncobj_surface_v1@72: error 4: No Acquire point provided

Flatpak packages don't use system egl-wayland but has its own runtime with different egl-wayland version.
So you may not run fixed egl-wayland-1.1.15 with Thunderbird/flatpak and you see the crash from old egl-wayland here.

See Also: 1908816
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: