Bugzilla

Reporter

Updated

•

2 months ago

Regressions: 1898476

Comment 1

•

2 months ago

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk

Product: Firefox → Core

https://crash-stats.mozilla.org/report/index/1281ff0e-cc87-4a6d-8cf9-e29960240718
https://crash-stats.mozilla.org/report/index/2ff93945-580c-4a67-86c2-ba5b50240718

Reporter

Comment 2

•

2 months ago

Two other crashes with different traces:

Comment 3

•

2 months ago

Please run on terminal with:
WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" env variables and attach the log here when it crashes (it's enough to attach last ~ 2000 lines).
Thanks.

Blocks: wayland

Flags: needinfo?(toadking)

Priority: -- → P3

Reporter

Comment 4

•

2 months ago

Attached file WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" MOZ_ENABLE_WAYLAND=1 firefox-nightly (obsolete) — Details

Attached the log with the last 2000 lines.

I also had another weird error when trying to reproduce the crash where the browser window froze up for a couple seconds, followed by my entire desktop freezing. Everything unfroze after about ten seconds. I immediately exited Firefox and tried logging again, but I'll attach that entire log as well in case it's helpful

Flags: needinfo?(toadking)

Reporter

Comment 5

•

2 months ago

Attached file Log capturing Firefox + Desktop freeze — Details

Reporter

Comment 6

•

2 months ago

Attached file WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" MOZ_ENABLE_WAYLAND=1 firefox-nightly V2 — Details

Attaching a new log. I looked at the first one and it didn't appear to have the actual crash lines in it so I made a new one and made sure to include enough lines for those.

Reporter

Updated

•

2 months ago

Attachment #9413745 - Attachment is obsolete: true

Comment 7

•

2 months ago

There's the related part:

[ 982250.261] {Display Queue} wl_display#1.delete_id(89)
[ 982250.295] {Default Queue} discarded wl_buffer#68.release()
[ 982250.304] {Default Queue} wl_callback#89.done(31369649)
[ 982250.314] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982250.323] {Default Queue}  -> wl_surface#63.commit()
[ 982251.573] {Default Queue}  -> wl_surface#63.attach(wl_buffer#83, 0, 0)

[ 982251.607]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#71, 0, 877)
[ 982251.619]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#84, 0, 188)

[ 982251.630] {Default Queue}  -> wl_surface#63.damage(0, 0, 2560, 1354)
[ 982251.641] {Default Queue}  -> wl_surface#63.commit()
[ 982251.651]  -> wl_display#1.sync(new id wl_callback#59)
[ 982251.779] {Display Queue} wl_display#1.delete_id(59)
[ 982251.795] wl_callback#59.done(16132)
[ 982256.066] {Display Queue} wl_display#1.delete_id(89)
[ 982256.102] {Default Queue} discarded wl_buffer#97.release()
[ 982256.111] {Default Queue} wl_callback#89.done(31369655)
[ 982256.121] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982256.129] {Default Queue}  -> wl_surface#63.commit()
[ 982257.479] {Default Queue}  -> wl_surface#63.attach(wl_buffer#91, 0, 0)

[ 982257.524]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#71, 0, 878)
[ 982257.539]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#90, 0, 186)

[ 982257.548] {Default Queue}  -> wl_surface#63.damage(0, 0, 2560, 1354)
[ 982257.582] {Default Queue}  -> wl_surface#63.commit()
[ 982257.593]  -> wl_display#1.sync(new id wl_callback#59)
[ 982257.717] {Display Queue} wl_display#1.delete_id(59)
[ 982257.736] wl_callback#59.done(16132)
[ 982260.449] {Display Queue} wl_display#1.delete_id(89)
[ 982260.484] {Default Queue} discarded wl_buffer#83.release()
[ 982260.493] {Default Queue} wl_callback#89.done(31369655)
[ 982260.503] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982260.512] {Default Queue}  -> wl_surface#63.commit()
[ 982261.739] {Default Queue}  -> wl_surface#63.attach(wl_buffer#68, 0, 0)

[ 982269.936] {Display Queue} wl_display#1.delete_id(89)
[ 982269.974] {Default Queue} wl_callback#89.done(31369667)

[ 982269.985] {Default Queue}  -> wl_surface#63.frame(new id wl_callback#89)
[ 982269.992] {Default Queue}  -> wl_surface#63.commit()
[ 982270.265] {Display Queue} wl_display#1.error(wp_linux_drm_syncobj_surface_v1#70, 4, "explicit sync is used, but no acquire point is set")

So looks like we're missing the set_acquire_point/set_release_point after wl_surface::attach(). But that code doesn't look like Firefox one but rather Mesa. Firefox doesn't use wp_linux_drm_syncobj_surface_v1 at all.

Comment 8

•

2 months ago

Looks like the affected code comes from wsi_wl_swapchain_queue_present() at MESA:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/vulkan/wsi/wsi_common_wayland.c#L2150

So please report it at MESA project:

Reporter

Comment 9

•

2 months ago

MESA issue opened: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11541

Updated

•

2 months ago

No longer regressions: 1898476

Updated

•

2 months ago

Blocks: wr-nv-linux

OS: Unspecified → Linux

Hardware: Unspecified → Desktop

G Julien

Comment 10

•

2 months ago

Downgrading egl-wayland on Arch Linux fixes the issue for now : egl-wayland 2:1.1.14-1 -> 2:1.1.13-2

Updated

•

2 months ago

Summary: Firefox _still_ crashes on Wayland with Explicit Sync on Nvidia → Firefox crashes on Wayland with Explicit Sync on Nvidia [@ wsi_wl_swapchain_queue_present]

https://github.com/NVIDIA/egl-wayland/blob/master/src/wayland-eglsurface.c#L228-L238

Comment 11

•

2 months ago

•

Edited

Isn't this the NVIDIA proprietary driver and Mesa isn't involved at all? The code using linux-drm-syncobj-v1 is in the egl-wayland library.

https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2243578903
https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2245324700

Comment 12

•

2 months ago

Looks like the egl-wayland code always sets up explicit sync for a window but then does not set sync points if a wlStreamResource was set when creating the surface context. Looks like a bug, but I can barely grasp the code.

Julian Sikorski

Comment 13

•

2 months ago

I am seeing crashes since updating egl-wayland to 1.14 on Fedora:
https://crash-stats.mozilla.org/report/index/914ab3e8-55e4-4f9b-8207-6cc140240721
https://crash-stats.mozilla.org/report/index/b37192d9-bf2a-47f9-90f4-ac79c0240721
https://crash-stats.mozilla.org/report/index/2297b073-fae0-4402-8bbd-38add0240721
Thunderbird is crashing too but I am not sure how to find the reports.
firefox-128.0-2.fc40.x86_64
egl-wayland-1.1.14-1.fc40.x86_64
xorg-x11-drv-nvidia-555.58.02-1.fc40.x86_64

Julian Sikorski

Comment 14

•

2 months ago

One more:
https://crash-stats.mozilla.org/report/index/a5c41147-85e1-4572-a3e7-598820240721

Mayank Bansal

Updated

•

2 months ago

Comment 15

•

2 months ago

After reporting back on the egl-wayland repo, a dev there claims this is still a Firefox issue.

Updated

•

2 months ago

Duplicate of this bug: 1909172

Comment 17

•

2 months ago

(In reply to Michael Lelli from comment #15)

After reporting back on the egl-wayland repo, a dev there claims this is still a Firefox issue.

https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2243578903
https://github.com/NVIDIA/egl-wayland/issues/118#issuecomment-2245324700

Yeah, and the egl-wayland downgrade fix is just a coincidence :-) Will look at it anyway.

Updated

•

2 months ago

Duplicate of this bug: 1909453

Updated

•

2 months ago

Depends on: 1910468

Comment 19

•

2 months ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::widget::WlLogHandler]

Updated

•

2 months ago

Flags: needinfo?(stransky)

Updated

•

2 months ago

Flags: needinfo?(stransky)

Summary: Firefox crashes on Wayland with Explicit Sync on Nvidia [@ wsi_wl_swapchain_queue_present] → Firefox crashes on Wayland with Explicit Sync on Nvidia with egl-wayland-1.1.14 [@ wsi_wl_swapchain_queue_present]

Updated

•

2 months ago

Flags: needinfo?(stransky)

Comment 20

•

2 months ago

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Comment 21

•

2 months ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on release (startup)
Top 20 desktop browser crashes on beta
Top 5 desktop browser crashes on Linux on beta
Top 5 desktop browser crashes on Linux on release (startup)

For more information, please visit BugBot documentation.

Keywords: topcrash, topcrash-startup

Comment 22

•

2 months ago

Have in mind the egl-wayland downgrade fix is to version 1.1.13 which does not have the explicit sync patches so while it is definitely not a coincidence, it doesn't actually mean much in this case other than Firefox is not crashing without explicit sync.

I can not get Firefox to run more than 3-5 minutes if egl-wayland contains explicit sync support. And that includes anything higher than 1.1.13 and the version of egl-wayland that is bundled with the nvidia driver on 560 (Fedora currently on rpmfusion removed the external egl-wayland and egl-gbm dependencies and is bundling the official with 560).

Comment 23

•

2 months ago

Please download, extract and run this build with a new profile:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/L199EHvwQzKeIge0M7l3Ow/runs/0/artifacts/public/build/target.tar.bz2

Run it on terminal with WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" env variables and attach the log here when it crashes. It contains extra logging needed for debugging.

Thanks.

Updated

•

2 months ago

Duplicate of this bug: 1909059

https://crash-stats.mozilla.org/report/index/6dd4e12a-33bb-4dfd-8543-eb5510240731
https://crash-stats.mozilla.org/report/index/ef019403-7f7d-45a1-9c10-dc3000240731
https://crash-stats.mozilla.org/report/index/6909a7d6-3451-4104-8f0d-211250240731 <= vaapi disabled
https://crash-stats.mozilla.org/report/index/4c5eee95-65d5-45e0-81d7-f2a650240731

Comment 25

•

2 months ago

here we go
fedora 40, xorg-x11-drv-nvidia-555.58.02-1.fc40.x86_64, egl-wayland-1.1.14-1.fc40.x86_64 ( stable with egl-wayland-1.1.13 )
about:config -> media.ffmpeg.vaapi.enabled = true
it seems it takes much longer to force a crash if media.ffmpeg.vaapi.enabled = false

will attach the logfiles

Comment 26

•

2 months ago

Attached file firefox-crash1.log.gz — Details

Comment 27

•

2 months ago

Attached file firefox-crash2.log.gz — Details

Comment 28

•

2 months ago

Attached file firefox-crash3-hwdec_off_tail.log.gz — Details

Comment 29

•

2 months ago

Attached file firefox-crash4.log.gz — Details

Reporter

Comment 30

•

2 months ago

Attached file crashlog-8ec0dee7-c83b-47ab-a561-141df0240731.txt.gz — Details

Attached a crash with Martin's test build and egl-wayland 1.1.14.

Crash report: https://crash-stats.mozilla.org/report/index/8ec0dee7-c83b-47ab-a561-141df0240731

https://searchfox.org/mozilla-central/rev/669fac9888b173c02baa4c036e980c0c204dfe02/gfx/webrender_bindings/RenderCompositorEGL.cpp#164

Comment 31

•

2 months ago

Thanks. According to the log it doesn't look like a Firefox bug. Let's see:

wl_buffer#227 is created here by GL as dmabuf buffer (Firefox doesn't create dmabuf buffers at all) and it has explicit sync set (again - Firefox doesn't set explicit points). This looks like MESA or egl-wayland code:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1025134.831]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#228, fd 247)
[1025134.850]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#214, fd 249)
[1025134.870]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#208, fd 250)
[1025134.889]  -> wp_linux_drm_syncobj_manager_v1#54.import_timeline(new id wp_linux_drm_syncobj_timeline_v1#201, fd 251)
[1025136.884] {Default Queue}  -> zwp_linux_dmabuf_v1#49.create_params(new id zwp_linux_buffer_params_v1#236)
[1025136.894] {Default Queue}  -> zwp_linux_buffer_params_v1#236.add(fd 252, 0, 0, 10240, 50331648, 6316052)
[1025136.900] {Default Queue}  -> zwp_linux_buffer_params_v1#236.create_immed(new id wl_buffer#237, 2560, 1382, 875713089, 0)
[1025136.904] {Default Queue}  -> zwp_linux_buffer_params_v1#236.destroy()
[1025136.911] {Default Queue}  -> wl_surface#67.attach(wl_buffer#237, 0, 0)
[1025136.923]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 508)
[1025136.927]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#201, 0, 1)
[1025136.931] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382)
[1025136.934] {Default Queue}  -> wl_surface#67.commit()

So wl_buffer#237 is internal GL buffer used for front/back buffer, the size indicates it too (2560, 1382):

[1024915.497] {Default Queue}  -> zwp_linux_buffer_params_v1#114.create_immed(new id wl_buffer#227, 2560, 1440, 875713089, 0)

Note that similar buffer (#221) is also allocated for rendering with the same size:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1025149.994] {Default Queue}  -> zwp_linux_dmabuf_v1#49.create_params(new id zwp_linux_buffer_params_v1#193)
[1025150.005] {Default Queue}  -> zwp_linux_buffer_params_v1#193.add(fd 249, 0, 0, 10240, 50331648, 6316052)
[1025150.011] {Default Queue}  -> zwp_linux_buffer_params_v1#193.create_immed(new id wl_buffer#221, 2560, 1382, 875713089, 0)
[1025150.016] {Default Queue}  -> zwp_linux_buffer_params_v1#193.destroy()
[1025150.022] {Default Queue}  -> wl_surface#67.attach(wl_buffer#221, 0, 0)
[1025150.036]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 509)
[1025150.040]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#208, 0, 1)
[1025150.044] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382)
[1025150.047] {Default Queue}  -> wl_surface#67.commit()

And now the error sequence:

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1063978.115] {Default Queue}  -> wl_surface#67.attach(wl_buffer#221, 0, 0)  << attach
[1063978.134]  -> wp_linux_drm_syncobj_surface_v1#70.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#72, 0, 5097)
[1063978.140]  -> wp_linux_drm_syncobj_surface_v1#70.set_release_point(wp_linux_drm_syncobj_timeline_v1#208, 0, 1151) << set sync
[1063978.145] {Default Queue}  -> wl_surface#67.damage(0, 0, 2560, 1382) << Set damage
[1063978.149] {Default Queue}  -> wl_surface#67.commit() << commit
[...]
[1063982.605] {Default Queue}  -> wl_surface#67.frame(new id wl_callback#241) << frame callback request
[1063982.609] {Default Queue}  -> wl_surface#67.commit()  << frame callback request commit

[Parent 3270: Renderer]: D/WidgetWayland nsWindow::LockSurface()
[1063989.819] {Default Queue}  -> wl_surface#67.attach(wl_buffer#237, 0, 0) << attach

<< missing sync point set, damage and commit. 

[1063994.651] {Default Queue}  -> wl_surface#67.frame(new id wl_callback#241) << frame callback request
[1063994.656] {Default Queue}  -> wl_surface#67.commit()  << frame callback request commit. 

But it also takes attach(wl_buffer#237) but without sync point so kaboom:

[1063994.793] {Display Queue} wl_display#1.error(wp_linux_drm_syncobj_surface_v1#70, 4, "explicit sync is used, but no acquire point is set")
[GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1#70: error 4: explicit sync is used, but no acquire point is set

As we see, for wl_buffer#221 it's attached, set damage size damage(0, 0, 2560, 1382) and commited.
wl_buffer#237 it's only attached and nothing else. As we see we're missing damage set here and commit.
Only frame callback commit is performed and that leads to protocol error as it also use already attached buffer without sync.

From the log it looks like egl-wayland (or someone else) attaches wl_buffer#237 even if it's not going to be committed which leads to the missing sync point error.

There's the Firefox core where it happens in Render Thread:

#ifdef MOZ_WIDGET_GTK
  // Rendering on Wayland has to be atomic (buffer attach + commit) and
  // wayland surface is also used by main thread so lock it before
  // we paint at SwapBuffers().
  UniquePtr<MozContainerSurfaceLock> lock;
  if (auto* gtkWidget = mWidget->AsGTK()) {
    lock = gtkWidget->LockSurface();
  }
#endif
  gl()->SwapBuffers();

As you can see we just call eglSwapBuffers().

So definitely not a Firefox bug (unfortunately as it looks like a clear/simple one and may be quickly fixed if it's in Firefox).

Flags: needinfo?(stransky)

Comment 32

•

2 months ago

Please report back at NVIDIA/egl-wayland and they hopefully will fix that.

Comment 33

•

2 months ago

And here is the log of the bad fd crash I am getting. Unsure if it is the same thing.

Comment 34

•

2 months ago

Attached file bad_fd_crash.tar.gz — Details

Austin Shafer

Comment 35

•

2 months ago

NVIDIA developer here, thanks for looking into the wl_surface locking. I do think it's possible this latest crash is a variant of another issue we are looking at recently, which could cause us to attach a surface but fail to set any sync points or commit. This could look like what you're running into above.

The problem is that I can't seem to reproduce the protocol error you're seeing myself. Maybe it's just something about the timing on my machine but I can't trigger the case I mentioned above to see if it matches the protocol error you all see. Instead I get the bad fd crash linked, or some variation of it. I've also seen warnings from IPDL complaining about bad fds too, although I can't seem to trigger that again to copy it here.

I have a prototype fix for this, but due to the above I'm unable to confirm it. Could someone with a proper repro please give it a try and let me know if it avoids the protocol error? This is still a bugfix we will want anyway but given that it could theoretically account for the symptoms in the previous comments I think it would be useful to test.
https://github.com/amshafer/egl-wayland/commit/a5182c7390a78ca2f7986cbcd2e1bf38f6be5f47

I have no clue what the source of the bad fd issues could be, I don't see an obvious way it could affect egl-wayland but I'm not very familiar with firefox internals.

Thanks!

Reporter

Comment 36

•

2 months ago

@Austin: I was able to run that egl-wayland commit for about a half hour. It didn't crash on me but I did notice a couple cases where Firefox would lock up for about ten seconds, following by the whole screen locking up for a couple seconds later, and then everything resuming like normal. So this specific issue seems like it may be fixed but still more issues remaining. I also encountered a complete Firefox lockup and was forced to terminate it but I don't know if that's Wayland related or not.

Comment 37

•

2 months ago

[Parent 334218, Compositor] WARNING: Call to mmap failed: Bad file descriptor: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:515

I don't think it's related to this issue at all - looks like bug in IPC code where SHM is mapped between processes so looks like this one comes from completely different Firefox part. Please file a new bug for it (also please attach possible crashes from about:crashes).

Thanks.

Flags: needinfo?(nodensntt)

Comment 38

•

2 months ago

(In reply to Michael Lelli from comment #36)

@Austin: I was able to run that egl-wayland commit for about a half hour. It didn't crash on me but I did notice a couple cases where Firefox would lock up for about ten seconds, following by the whole screen locking up for a couple seconds later, and then everything resuming like normal. So this specific issue seems like it may be fixed but still more issues remaining. I also encountered a complete Firefox lockup and was forced to terminate it but I don't know if that's Wayland related or not.

You can run on terminal with logging to terminal - you'll see if Firefox waits to any Wayland/widget event and where the potential lockup is. Something like:

WAYLAND_DEBUG=1 MOZ_LOG="Widget:5 WidgetWayland:5" ./firefox

may be enough.

https://www.swisstransfer.com/d/f2db2e21-c9b9-410a-926c-a87fb647a28a

Comment 39

•

2 months ago

I also tried that egl-wayland commit on openSUSE tumbleweed KDE (git build) with nvidia RTX 3060 Ti on driver v555.58.02, and tested both Firefox stable (v128.0.3) and latest nightly build.

there was no improvements, stable version keeps crashing frequently but randomly and with nightly I it's even worse with all the constant freezing issues

I added logfiles for Firefox-nightly version (the link below, as files where to too big), used the mentioned command to debug

Comment 40

•

2 months ago

According to the crashes, can you please attach crash data from about:crashes?
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Using_Mozilla_crash_reporter
Thanks.

Reporter

Comment 41

•

2 months ago

An important note that when testing new egl-wayland builds is you need to reboot your computer after installing it, or at least restart your DE session/window manager.

Comment 42

•

2 months ago

and yes oc I have rebooted my PC after the egl-wayland build update, as I do with every (system) update

Comment 43

•

2 months ago

Thanks. Looking at the crashes there are various errors/aborts/crashes across different Firefox component, I see only one VSync crash.
Crashes from https://bugzilla.mozilla.org/show_bug.cgi?id=1908825#c13 looks similar - random crashes after update to egl-wayland to 1.14.
I wonder if there's any fd management bug or so involved.

Can you run Firefox with strace, make it crash and attach the log here?

strace -f ./firefox > run.txt 2>&1

note that the log may be huge and Firefox may be slow while running under trace.
Thanks.

Comment 44

•

2 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #37)

[Parent 334218, Compositor] WARNING: Call to mmap failed: Bad file descriptor: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:515

I don't think it's related to this issue at all - looks like bug in IPC code where SHM is mapped between processes so looks like this one comes from completely different Firefox part. Please file a new bug for it (also please attach possible crashes from about:crashes).

Thanks.

Created bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1911281 for it as requested and listed related crash ids.

@Austin:
Now that you mention it, it looks like there may be something timing related involved (race condition maybe?) as the only instance I had that did not get the bad file descriptor crash was when I run Martin's extra debug logging build while piping stdout/stderr to tee (2>&1 | tee crash.log). This happened for a singular execution and I was unable to replicate this again. Could be completely coincidental but I thought I'd mention it.

Flags: needinfo?(nodensntt)

Austin Shafer

Comment 45

•

2 months ago

Thanks for testing! Regarding the freezes I think I may have reproduced that once but it's so intermittent that it's hard to tell. From what I've seen freezes like that can come from the compositor as well, sometimes one side doesn't signal their timeline point for various reasons and the other side is stuck waiting. iirc I only saw my one freeze on KDE, so comparing behavior across compositors may also be needed. I certainly make no promise that egl-wayland is 100% bug-free especially since explicit sync is such a foundational change, so please continue reporting issues and let me know if there's anything I should test.

Fwiw something we wondered internally is if there is indeed a fd management issue. We aren't sure exactly why that function from the commit (send_explicit_sync_points' eglDupNativeFenceFD) fails in this case but theoretically it could be from running out of available fds. So that might also be something worth checking, how many open fds are laying around. Reproducing the badfd issue happened around 10% of the time for me and wasn't very easy to trigger.

Comment 46

•

2 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #43)

Thanks. Looking at the crashes there are various errors/aborts/crashes across different Firefox component, I see only one VSync crash.
Crashes from https://bugzilla.mozilla.org/show_bug.cgi?id=1908825#c13 looks similar - random crashes after update to egl-wayland to 1.14.
I wonder if there's any fd management bug or so involved.

Can you run Firefox with strace, make it crash and attach the log here?
strace -f ./firefox > run.txt 2>&1
note that the log may be huge and Firefox may be slow while running under trace.
Thanks.

well it froze up immediately when I launched with that command and the took a while until it crashed, here is a 193.5MB log file: https://www.swisstransfer.com/d/a801c73b-9466-4cce-9f65-94b6fa3c5225

(this is with latest tumbleweed updates and latest nightly build and still using the egl-wayland commit, and a fresh reboot)

Comment 47

•

1 month ago

(In reply to ahjolinna from comment #46)

well it froze up immediately when I launched with that command and the took a while until it crashed, here is a 193.5MB log file: https://www.swisstransfer.com/d/a801c73b-9466-4cce-9f65-94b6fa3c5225

It's not really frozen it's just very slow so it looks like frozen/unresponsive. This is a side-effect of the overhead while running under strace and is normal.

Faaris

Comment 48

•

1 month ago

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

morguldir

Comment 49

•

1 month ago

(In reply to Faaris from comment #48)

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

128 is still affected by https://bugzilla.mozilla.org/show_bug.cgi?id=1898476, so you need a nightly/dev build to properly test

Juan

Comment 50

•

1 month ago

In my case, using the NFB v555 NVIDIA driver doesn't causes any crashes using the latest nightly public build released daily by the AUR's maintainer heftig https://aur.archlinux.org/firefox-nightly.git and it's completely usable for me when it comes to usability...

BUT i got a different problem in my case. Which is performance issues. Using MOZ_ENABLE_WAYLAND=1 (Wayland) on Firefox-nightly causes huge performance issues on heavy applications (For example: YouTube home page, or Twitter's for you page) accompanied by high CPU usage and Frame drops and stutters. Even when using decoding hardware acceleration (VAAPI-NVDEC) the performance issues are notorious and causes a bad experience using Firefox.

Using the environment variable MOZ_ENABLE_WAYLAND=0 and using XWayland removes this performance issues and the frame drops dissapear. (YouTube home page completely smooth and Twitter for you page completely smooth as well).

So i assume it's a problem related to Wayland obviously and maybe Explicit sync as well. I'll create a new bug report regarding this issue.
I'll leave a link to Google Drive to download a demonstration video of my problem.
https://drive.google.com/file/d/1jWR8TWwJ7YjSM5eMkqNBmYBVljT9xOsS/view?usp=sharing

Faaris

Comment 51

•

1 month ago

(In reply to morguldir from comment #49)

(In reply to Faaris from comment #48)

I've built egl-wayland from source with the latest commit that A. Shafer mentioned. Sadly it didn't help at all with the crashing for me (firefox 128.0.3). I have a reproducible way to cause the crash on my end. With the bitwarden extension installed, Pressing CtrlShift L to open its popup menu outright kills the browser.

128 is still affected by https://bugzilla.mozilla.org/show_bug.cgi?id=1898476, so you need a nightly/dev build to properly test

Thanks for the info. I've tested nightly and am still experiencing the same issue haha.

Any reason why those explicit sync fixes aren't being backported? It's causing a lot of crashes for quite a few users.

Comment 52

•

1 month ago

(In reply to Juan from comment #50)

In my case, using the NFB v555 NVIDIA driver doesn't causes any crashes using the latest nightly public build released daily by the AUR's maintainer heftig https://aur.archlinux.org/firefox-nightly.git and it's completely usable for me when it comes to usability...

You don't mention what egl-wayland version you are using though because as far as I know Arch has reverted 1.1.14 package back to 1.1.13 (like all cutting edge distros did), which does not have explicit sync support and firefox does not crash.

Thomas Pasch

Comment 53

•

1 month ago

I have been effected by the problem and have reported https://bugzilla.mozilla.org/show_bug.cgi?id=1909172

On Fedora 40 with new package:

$ rpm -qa egl-wayland
egl-wayland-1.1.14-2.20240805gitc439cd5.fc40.x86_64

and firefox from getfirefox.net the problem seems to be solved on my computer.

Thank you very much for support!

Thomas Pasch

Comment 54

•

1 month ago

I have to revert my last comment, as I have encountered several crashes (but less than before, firefox seems to be useable for 10 minutes or so now):

bp-ed12ddf9-b5d3-42d2-a698-fce980240807 07.08.24, 15:24
bp-cb628faf-517b-4b31-a529-b20b90240807 07.08.24, 15:24
bp-61c465f9-1d71-4aef-b5f2-747ab0240807 07.08.24, 15:24
bp-ac86cbef-dab7-4e34-8850-40a040240807 07.08.24, 14:54
bp-7a3e329f-63ea-4c58-9c02-db7fb0240807 07.08.24, 14:07
bp-96a02672-97a5-4ae7-8604-f25a10240807 07.08.24, 14:07

Comment 55

•

1 month ago

I updated my openSUSE Tumbleweed (KDE-git) to the latest NVIDIA v560.31.02 driver (with the same egl-wayland commit), my experience has been much more stable with Firefox v128.0.3—I’ve only had one crash. https://crash-stats.mozilla.org/report/index/6aad24ef-0225-4d1d-bcc3-457170240808

Juan

Comment 56

•

1 month ago

(In reply to ahjolinna from comment #55)

I updated my openSUSE Tumbleweed (KDE-git) to the latest NVIDIA v560.31.02 driver (with the same egl-wayland commit), my experience has been much more stable with Firefox v128.0.3—I’ve only had one crash. https://crash-stats.mozilla.org/report/index/6aad24ef-0225-4d1d-bcc3-457170240808

I can confirm. Firefox 128 doesn't crash anymore for me with the latest NVIDIA drivers.
Writing this on Firefox 128 at the time.

Thomas Pasch

Comment 57

•

1 month ago

Fedora 40 is now with akmods nvidia driver 560.31.02. I also still encounter crashs with firefox 129.0 from getfirefox.net:

bp-36b1df09-f69e-4cf7-a52a-b80ec0240808 08.08.24, 09:16
bp-df85a9dc-28d4-4119-80c8-178270240808 08.08.24, 09:16

huyizheng

Comment 58

•

1 month ago

Latest egl-wayland git master (commit 4480345, previously PR#124) seems fix this issue. No more crashes for me in a whole day.

Mathew Hodson

Updated

•

1 month ago

Updated

•

1 month ago

Comment 59

•

1 month ago

On my Fedora 40 (desktop) system, with egl-wayland 1.1.15, nvidia akmods version 560.31.02 and firefox from getfirefox.net version 129.0.1, I still encounter crashes but substantial less frequent:

bp-8cf6629c-cb12-49a7-aec3-4b7640240814 14.08.24, 10:20
bp-99a77787-d1de-4d2b-b6e3-acdb40240814 14.08.24, 13:38

Installed Packages
Name : egl-wayland
Version : 1.1.15
Release : 1.fc40
Architecture : x86_64

I think #1863047 is related.

Julius Bairaktaris

Comment 60

•

1 month ago

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes with the following error:

Wayland protocol error: wp_linux_drm_syncobj_surface_v1@72: error 4: No Acquire point provided

Installed Packages
Name : egl-wayland
Version : 1.1.15
Release : 1.fc40
Architecture : x86_64

Installed Packages
Name : akmod-nvidia
Epoch : 3
Version : 555.58.02
Release : 1.fc40

Comment 61

•

1 month ago

•

Edited

(In reply to Julius Bairaktaris from comment #60)

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes

I don't think ESR 128 has backported the fixes for bug 1898476?

(In reply to Thomas Pasch from comment #59)

On my Fedora 40 (desktop) system, with egl-wayland 1.1.15, nvidia akmods version 560.31.02 and firefox from getfirefox.net version 129.0.1, I still encounter crashes

And neither has Firefox 129. The decision here was "wontfix".

Comment 62

•

1 month ago

(In reply to Jan Alexander Steffens [:heftig] from comment #61)

(In reply to Julius Bairaktaris from comment #60)

On my Fedora 40 Workstation Thunderbird 128.1.0esr (Flatpak) crashes

I don't think ESR 128 has backported the fixes for bug 1898476?

ESR 128 backport was hold until proven it's working.

Updated

•

1 month ago

Duplicate of this bug: 1908816

Comment 64

•

1 month ago

Firefox side of this is fixed by Bug 1898476.
egl-wayland part is fixed by https://github.com/NVIDIA/egl-wayland/commit/448034502fdabc4ec60fb94051f981c5a901103b

Status: NEW → RESOLVED

Closed: 1 month ago

Resolution: --- → MOVED

Updated

•

1 month ago

URL: https://github.com/NVIDIA/egl-wayland...