Closed Bug 1716006 Opened 4 years ago Closed 4 years ago

Crash in [@ mozilla::detail::MutexImpl::lock | mozilla::layers::NativeSurfaceWayland::FrameCallbackHandler]

Categories

(Core :: Graphics: WebRender, defect, P4)

defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox89 --- unaffected
firefox90 --- unaffected
firefox91 --- fixed

People

(Reporter: nical, Assigned: rmader)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

I'm getting a lot of crashes with nightly with wayland enabled (MOZ_ENABLE_WAYLAND=1 env var) and wayland compositor (gfx.webrender.compositor.force-enabled = true). I'm running on top of GNOME 40.1.0 (fedora).

I think I got it each time I tried to use the gecko profiler when showing the profiler interface.

Crash report: https://crash-stats.mozilla.org/report/index/326a7349-2f62-41ba-a390-1a7bc0210611

MOZ_CRASH Reason: MOZ_CRASH(mozilla::detail::MutexImpl::mutexLock: pthread_mutex_lock failed)

Top 10 frames of crashing thread:

0 firefox-bin mozilla::detail::MutexImpl::lock mozglue/misc/Mutex_posix.cpp:118
1 libxul.so mozilla::layers::NativeSurfaceWayland::FrameCallbackHandler gfx/layers/SurfacePoolWayland.cpp:167
2 libffi.so.6 libffi.so.6@0x6c03 
3 libffi.so.6 libffi.so.6@0x6106 
4 libwayland-client.so.0 libwayland-client.so.0@0x6d0f 
5 libwayland-client.so.0 libwayland-client.so.0@0x742a 
6 libwayland-client.so.0 libwayland-client.so.0@0x761b 
7 libxul.so {virtual override thunk} 
8 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1075
9 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:107

Thanks for testing! For some reason I haven't managed to reproduce that here. Do you happen to know when this condition can happen? Only if the mutex is already gone, or if the current thread already locked the mutex, or both?

I've seen these crashes while using YouTube. (I also have a boatload of pinned tabs, so that might not mean much.)

(In reply to Robert Mader [:rmader] from comment #1)

Thanks for testing! For some reason I haven't managed to reproduce that here. Do you happen to know when this condition can happen? Only if the mutex is already gone, or if the current thread already locked the mutex, or both?

Attempts to recursively lock a non-recursive mutex can be detected (PTHREAD_MUTEX_ERRORCHECK) but this is only enabled with --enable-debug. Otherwise it will just deadlock.

(In reply to Jan Alexander Steffens [:heftig] from comment #3)

Attempts to recursively lock a non-recursive mutex can be detected (PTHREAD_MUTEX_ERRORCHECK) but this is only enabled with --enable-debug. Otherwise it will just deadlock.

Do I understand you correctly that a crash implies use after free then? As a recursive lock would just deadlock, not crash?

(In reply to Robert Mader [:rmader] from comment #4)

Do I understand you correctly that a crash implies use after free then? As a recursive lock would just deadlock, not crash?

Yes, I'm suspecting a use-after-free here.

Unlike buffer release callbacks, which can't happen after the
corresponding buffer was destroyed, frame callbacks can apparently
arrive even if the corresponding surface was destroyed.
This kinda makes sense as frame callbacks have independent objects
which actually can get destroyed manually.

Assignee: nobody → robert.mader
Status: NEW → ASSIGNED

Jan, Nico: I've still not been able to reproduce the issue but it would make sense if the patch above solves it. Could you try the following try build and check if it fixes the issues for you? That would be awesome!

https://treeherder.mozilla.org/jobs?repo=try&revision=178a6bceccac26c263486707dca7819da79b4dde

Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(jan.steffens)

Seems I can get Nightly to crash quite often (but not reliably) by having having some animated page open (I used https://www.w3.org/2010/05/video/mediaevents.html with the video playing) and then rapidly opening (Ctrl+T) and closing (Ctrl+W) a new tab.

The try build crashed as well (bp-ae7e2b28-0a4d-4a7d-95cf-599400210618) but on a pthread_mutex_destroy.

Flags: needinfo?(jan.steffens)

Thanks for testing! Still unable to reproduce, but maybe I need a bigger screen like 4K or so to trigger it - then we produce way more tiles.

Can I ask you for one more try with the following build? https://treeherder.mozilla.org/jobs?repo=try&revision=541c8cc0ac3e0b81c3a83e007e20b954ce40991e

It waits for the lock in the destructor and thus should avoid crashing in pthread_mutex_destroy.

Flags: needinfo?(nical.bugzilla)

Got a segfault: bp-42508c5e-d3bc-4719-8991-e57a20210619

PS: Haven't managed to reproduce this one yet. The STR from comment #8 no longer seem to work.

Thanks for testing! Still unable to reproduce, but maybe I need a bigger screen like 4K or so to trigger it - then we produce way more tiles.

You could try reducing gfx.webrender.picture-tile-height and -width?

Values too low make Firefox no longer start, though (Error flushing display: Resource temporarily unavailable). For me (3840×2400@2), 256x256 still works and 128x128 does not. If large tile counts cause crashes, does this mean this could get triggered by the automatic tile subdivision?

@rmader Regarding maybe needing a bigger screen to produce more tiles, I experienced the crash multiple times and I am on FullHD, but use 75% scaling, so maybe scaling can help to reproduce it more reliably? Just a thought, feel free to ignore if not useful!

(In reply to Jan Alexander Steffens [:heftig] from comment #10)

Got a segfault: bp-42508c5e-d3bc-4719-8991-e57a20210619

PS: Haven't managed to reproduce this one yet. The STR from comment #8 no longer seem to work.

So that's a unrelated/new issue? And the patch seems to fix the issue here?

(In reply to Jan Alexander Steffens [:heftig] from comment #11)

You could try reducing gfx.webrender.picture-tile-height and -width?

Values too low make Firefox no longer start, though (Error flushing display: Resource temporarily unavailable). For me (3840×2400@2), 256x256 still works and 128x128 does not. If large tile counts cause crashes, does this mean this could get triggered by the automatic tile subdivision?

Thanks for the hint. I think it's more about the fact that more tiles get created and destroyed, making it more likely to hit the issue.

(In reply to pirminbraun16 from comment #12)

@rmader Regarding maybe needing a bigger screen to produce more tiles, I experienced the crash multiple times and I am on FullHD, but use 75% scaling, so maybe scaling can help to reproduce it more reliably? Just a thought, feel free to ignore if not useful!

Thanks for the hint as well :)

Pushed by robert.mader@posteo.de: https://hg.mozilla.org/integration/autoland/rev/086a82512429 Clear pending callbacks when destroying a NativeSurfaceWayland, r=gfx-reviewers,gw
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Has Regression Range: --- → yes
Keywords: regression
See Also: → 1742990
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: