Closed Bug 1716006 Opened 4 years ago Closed 4 years ago

Crash in [@ mozilla::detail::MutexImpl::lock | mozilla::layers::NativeSurfaceWayland::FrameCallbackHandler]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

91 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox89

---

unaffected

firefox90

---

unaffected

firefox91

---

fixed

People

(Reporter: nical, Assigned: rmader)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Bug 1716006 - Clear pending callbacks when destroying a NativeSurfaceWayland, r=#gfx-reviewers 4 years ago Robert Mader [:rmader] 48 bytes, text/x-phabricator-request		Details \| Review

Nicolas Silva [:nical]

Reporter

Description

•

4 years ago

•

Edited

I'm getting a lot of crashes with nightly with wayland enabled (MOZ_ENABLE_WAYLAND=1 env var) and wayland compositor (gfx.webrender.compositor.force-enabled = true). I'm running on top of GNOME 40.1.0 (fedora).

I think I got it each time I tried to use the gecko profiler when showing the profiler interface.

Crash report: https://crash-stats.mozilla.org/report/index/326a7349-2f62-41ba-a390-1a7bc0210611

MOZ_CRASH Reason: MOZ_CRASH(mozilla::detail::MutexImpl::mutexLock: pthread_mutex_lock failed)

Top 10 frames of crashing thread:

0 firefox-bin mozilla::detail::MutexImpl::lock mozglue/misc/Mutex_posix.cpp:118
1 libxul.so mozilla::layers::NativeSurfaceWayland::FrameCallbackHandler gfx/layers/SurfacePoolWayland.cpp:167
2 libffi.so.6 libffi.so.6@0x6c03 
3 libffi.so.6 libffi.so.6@0x6106 
4 libwayland-client.so.0 libwayland-client.so.0@0x6d0f 
5 libwayland-client.so.0 libwayland-client.so.0@0x742a 
6 libwayland-client.so.0 libwayland-client.so.0@0x761b 
7 libxul.so {virtual override thunk} 
8 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1075
9 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:107

Robert Mader [:rmader]

Assignee

Updated

•

4 years ago

Blocks: wr-linux-wayland-compositing

Robert Mader [:rmader]

Assignee

Comment 1

•

4 years ago

Thanks for testing! For some reason I haven't managed to reproduce that here. Do you happen to know when this condition can happen? Only if the mutex is already gone, or if the current thread already locked the mutex, or both?

Jan Alexander Steffens [:heftig]

Comment 2

•

4 years ago

I've seen these crashes while using YouTube. (I also have a boatload of pinned tabs, so that might not mean much.)

Jan Alexander Steffens [:heftig]

Comment 3

•

4 years ago

•

Edited

(In reply to Robert Mader [:rmader] from comment #1)

Thanks for testing! For some reason I haven't managed to reproduce that here. Do you happen to know when this condition can happen? Only if the mutex is already gone, or if the current thread already locked the mutex, or both?

Attempts to recursively lock a non-recursive mutex can be detected (PTHREAD_MUTEX_ERRORCHECK) but this is only enabled with --enable-debug. Otherwise it will just deadlock.

Robert Mader [:rmader]

Assignee

Comment 4

•

4 years ago

(In reply to Jan Alexander Steffens [:heftig] from comment #3)

Attempts to recursively lock a non-recursive mutex can be detected (PTHREAD_MUTEX_ERRORCHECK) but this is only enabled with --enable-debug. Otherwise it will just deadlock.

Do I understand you correctly that a crash implies use after free then? As a recursive lock would just deadlock, not crash?

Jan Alexander Steffens [:heftig]

Comment 5

•

4 years ago

(In reply to Robert Mader [:rmader] from comment #4)

Do I understand you correctly that a crash implies use after free then? As a recursive lock would just deadlock, not crash?

Yes, I'm suspecting a use-after-free here.

Robert Mader [:rmader]

Assignee

Comment 6

•

4 years ago

Attached file Bug 1716006 - Clear pending callbacks when destroying a NativeSurfaceWayland, r=#gfx-reviewers — Details

Unlike buffer release callbacks, which can't happen after the
corresponding buffer was destroyed, frame callbacks can apparently
arrive even if the corresponding surface was destroyed.
This kinda makes sense as frame callbacks have independent objects
which actually can get destroyed manually.

Phabricator Automation

•

4 years ago

Thanks for testing! Still unable to reproduce, but maybe I need a bigger screen like 4K or so to trigger it - then we produce way more tiles.

You could try reducing gfx.webrender.picture-tile-height and -width?

Values too low make Firefox no longer start, though (Error flushing display: Resource temporarily unavailable). For me (3840×2400@2), 256x256 still works and 128x128 does not. If large tile counts cause crashes, does this mean this could get triggered by the automatic tile subdivision?

pirminbraun16

Comment 12

•

4 years ago

@rmader Regarding maybe needing a bigger screen to produce more tiles, I experienced the crash multiple times and I am on FullHD, but use 75% scaling, so maybe scaling can help to reproduce it more reliably? Just a thought, feel free to ignore if not useful!

Robert Mader [:rmader]

Assignee

Comment 13

•

4 years ago

(In reply to Jan Alexander Steffens [:heftig] from comment #10)

Got a segfault: bp-42508c5e-d3bc-4719-8991-e57a20210619

PS: Haven't managed to reproduce this one yet. The STR from comment #8 no longer seem to work.

So that's a unrelated/new issue? And the patch seems to fix the issue here?

(In reply to Jan Alexander Steffens [:heftig] from comment #11)

You could try reducing gfx.webrender.picture-tile-height and -width?

Values too low make Firefox no longer start, though (Error flushing display: Resource temporarily unavailable). For me (3840×2400@2), 256x256 still works and 128x128 does not. If large tile counts cause crashes, does this mean this could get triggered by the automatic tile subdivision?

Thanks for the hint. I think it's more about the fact that more tiles get created and destroyed, making it more likely to hit the issue.

(In reply to pirminbraun16 from comment #12)

@rmader Regarding maybe needing a bigger screen to produce more tiles, I experienced the crash multiple times and I am on FullHD, but use 75% scaling, so maybe scaling can help to reproduce it more reliably? Just a thought, feel free to ignore if not useful!

Thanks for the hint as well :)

Pulsebot

Comment 14

•

4 years ago

Pushed by robert.mader@posteo.de: https://hg.mozilla.org/integration/autoland/rev/086a82512429 Clear pending callbacks when destroying a NativeSurfaceWayland, r=gfx-reviewers,gw

Narcis Beleuzu [:NarcisB]

Comment 15

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/086a82512429

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox91: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 91 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox89: --- → unaffected

status-firefox90: --- → unaffected

status-firefox-esr78: --- → unaffected

Regressed by: 1711244

BMO Automation

Updated

•

4 years ago

Has Regression Range: --- → yes

Keywords: regression

Sean Feng [:sefeng211]

Updated

•

4 years ago