Crash in [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | _cairo_atomic_int_cmpxchg_return_old_impl]
Categories
(Core :: Graphics, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox113 | --- | wontfix |
firefox114 | --- | fixed |
firefox115 | --- | fixed |
People
(Reporter: RyanVM, Assigned: jrmuizel)
References
(Regression)
Details
(Keywords: crash, regression, topcrash, Whiteboard: [win:stability])
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-release+
|
Details | Review |
Only affects Windows 10 and 11. Filing this in Graphics for now, but I can't help but notice the mozilla::widget::ToastNotificationHandler::AsyncSaveImage
further down the stack. A couple of reports from before 112, but it looks like it clearly spiked starting with 112.0.2.
Crash report: https://crash-stats.mozilla.org/report/index/3e394898-6460-428a-956a-bc1d00230518
Reason: EXCEPTION_ACCESS_VIOLATION_WRITE
Top 10 frames of crashing thread:
0 ntdll.dll RtlpWaitOnCriticalSection
1 ntdll.dll RtlpEnterCriticalSectionContended
2 ntdll.dll RtlEnterCriticalSection
3 xul.dll _cairo_atomic_int_cmpxchg_return_old_impl gfx/cairo/cairo/src/cairo-atomic.c:69
3 xul.dll _cairo_surface_allocate_unique_id gfx/cairo/cairo/src/cairo-surface.c:282
3 xul.dll _cairo_surface_init gfx/cairo/cairo/src/cairo-surface.c:418
4 xul.dll _cairo_image_surface_create_for_pixman_image gfx/cairo/cairo/src/cairo-image-surface.c:193
5 xul.dll _cairo_image_surface_create_with_pixman_format gfx/cairo/cairo/src/cairo-image-surface.c:365
6 xul.dll _moz_cairo_image_surface_create_for_data gfx/cairo/cairo/src/cairo-image-surface.c:547
7 xul.dll mozilla::gfx::DrawTargetCairo::Init gfx/2d/DrawTargetCairo.cpp:1914
Reporter | ||
Comment 1•2 years ago
|
||
The timing of this crash spike and the crashing line in ToastNotificationHandler.cpp fits bug 1822817 quite well.
Comment 2•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on beta
For more information, please visit BugBot documentation.
Reporter | ||
Updated•2 years ago
|
Comment 3•2 years ago
|
||
It doesn't appear like we initialize _cairo_atomic_mutex
anywhere. However, it also seems like we must be building cairo without atomic support for this to be executed, which is somewhat surprising. Perhaps someone knows whether we may be misconfiguring cairo during build?
Comment 4•2 years ago
•
|
||
I agree that this crash looks exactly like what would happen if we executed _cairo_surface_init
with an uninitialized _cairo_atomic_mutex
. However we do have a path to initialize _cairo_atomic_mutex
, in _cairo_mutex_initialize
, which, adapting the disassembly into pseudo-code, looks like this:
if (!_cairo_mutex_initialized) {
_cairo_mutex_initialized = TRUE;
InitializeCriticalSection(&_cairo_image_solid_cache_mutex);
InitializeCriticalSection(&_cairo_toy_font_face_mutex);
InitializeCriticalSection(&_cairo_intern_string_mutex);
InitializeCriticalSection(&_cairo_scaled_font_map_mutex);
InitializeCriticalSection(&_cairo_scaled_glyph_page_cache_mutex);
InitializeCriticalSection(&_cairo_scaled_font_error_mutex);
InitializeCriticalSection(&_cairo_glyph_cache_mutex);
InitializeCriticalSection(&_cairo_win32_font_face_mutex);
InitializeCriticalSection(&_cairo_win32_font_dc_mutex);
InitializeCriticalSection(&_cairo_atomic_mutex);
}
_cairo_mutex_initialize
is called at the beginning of _cairo_surface_init
through a macro: CAIRO_MUTEX_INITIALIZE ();
which results in the following disassembly:
xul!_cairo_surface_init [/builds/worker/checkouts/gecko/gfx/cairo/cairo/src/cairo-surface.c @ 407]:
// ...
00007ffd`f3979d28 833d390e3a0500 cmp dword ptr [xul!_cairo_mutex_initialized (00007ffd`f8d1ab68)],0
00007ffd`f3979d2f 0f8462010000 je xul!_cairo_surface_init+0x187 (00007ffd`f3979e97)
00007ffd`f3979d35 // mutexes are initialized, go on
// ...
00007ffd`f3979e97 e884f4ffff call xul!_cairo_mutex_initialize (00007ffd`f3979320)
00007ffd`f3979e9c e994feffff jmp xul!_cairo_surface_init+0x25 (00007ffd`f3979d35)
I think what we have here is a race condition, because _cairo_mutex_initialized = TRUE;
is executed before InitializeCriticalSection(&_cairo_atomic_mutex);
, which implies that a different thread could execute _cairo_surface_init
, read _cairo_mutex_initialized
as TRUE
and thus assume that we are done initializing _cairo_atomic_mutex
while we are not, then crash by trying to use it. This being a race condition would also explain why this crash is tied to a background thread functionality AsyncSaveImage
.
Note that it is not desirable to just move _cairo_mutex_initialized = TRUE;
to the end of _cairo_mutex_initialize
because then the race condition would lead to multiple calls to InitializeCriticalSection
for the same section, which is undefined behavior. If possible, I would recommend that we add a first call to _cairo_mutex_initialize
at a point in time where we are sure that there can be no other thread relying on cairo
.
The way to reproduce this crash is likely to issue a lot of notifications at the same time, all with an associated image, before any other notification (and more generally before executing any other path that could call into _cairo_mutex_initialize
).
Updated•2 years ago
|
Comment 5•2 years ago
|
||
Note that it is not desirable to just move _cairo_mutex_initialized = TRUE; to the end of _cairo_mutex_initialize because then the race condition would lead to multiple calls to InitializeCriticalSection for the same section, which is undefined behavior.
Is there a reason this can't be paired with a guard for initialization started?
Comment 6•2 years ago
•
|
||
(In reply to Nick Rishel [:nrishel] from comment #5)
Is there a reason this can't be paired with a guard for initialization started?
I am not sure if I understand the question correctly, so feel free to correct me. The purpose of _cairo_mutex_initialize
is to initialize the synchronization primitives that are used by the rest of the cairo code. It needs to be called only once, so it would be more optimal to just call it ahead of use in an initialization path, rather than to guard the path to make it thread-safe (which would require to use a synchronization primitive which would itself have to be initialized ahead of use). However, both solutions would fix the crash (assuming the analysis from comment 4 is correct).
Assignee | ||
Updated•2 years ago
|
Comment 7•2 years ago
•
|
||
:yannis I'm taking as a given we'd prefer not to run this in the initialization (startup?) path to keep it as lean as possible. I'm not intimately familiar with this code though so I might be bringing incorrect assumptions.
Comment 8•2 years ago
•
|
||
Here, by initialization path, I mean any path that is guaranteed to run on the main thread before the first task that calls into gfxUtils::EncodeSourceSurface
gets dispatched. I think ToastNotificationHandler::AsyncSaveImage
could meet this requirement?
To clarify, I believe that:
- if
_cairo_mutex_initialize
has not yet been called, thengfxUtils::EncodeSourceSurface
is not thread-safe; - but otherwise,
gfxUtils::EncodeSourceSurface
is thread-safe.
If that really is the case (maybe [:jrmuizel] can confirm?), then we could solve the issue by exposing a new gfxUtils
function that calls into _cairo_mutex_initialize
, and ensuring that we call that new gfxUtils
function from the main thread (e.g. in ToastNotificationHandler::AsyncSaveImage
) before dispatching the faulty task to a background thread. This solution would avoid us the cost of guarding the call to gfxUtils::EncodeSourceSurface
, although doing that would fix the crash as well.
Assignee | ||
Comment 9•2 years ago
|
||
I believe this was regressed by bug 739096 which accidentally dropped support for atomics in Cairo on win32
Assignee | ||
Comment 10•2 years ago
|
||
We can probably largely mitigate this by restoring support for atomics on Win32. However, I also consider the racy _cairo_mutex_initialize to be an upstream bug that we should file.
Assignee | ||
Comment 11•2 years ago
|
||
The old Win32 specific atomics support was accidentally removed in bug 739096.
However, we can just use the existing CXX11 atomic support because we build with
clang-cl.
Comment 12•2 years ago
|
||
Assignee | ||
Comment 13•2 years ago
|
||
I filed an upstream issue about this here: https://gitlab.freedesktop.org/cairo/cairo/-/issues/782
Assignee | ||
Updated•2 years ago
|
Comment 14•2 years ago
|
||
bugherder |
Comment 15•2 years ago
|
||
The patch landed in nightly and beta is affected.
:jrmuizel, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox114
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 16•2 years ago
|
||
Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.
Beta/Release Uplift Approval Request
- User impact if declined: Occasional crashes
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This enables clang's atomics support on Windows. We've been using on other platforms so it should work.
- String changes made/needed:
- Is Android affected?: Yes
Comment 17•2 years ago
|
||
Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.
Fx114 is now in RC, switching uplift request to release
Comment 18•2 years ago
|
||
Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.
Approved for 114 RC 2, thanks.
Comment 19•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Description
•