Closed Bug 1833782 Opened 2 years ago Closed 2 years ago

Crash in [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | _cairo_atomic_int_cmpxchg_return_old_impl]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

115 Branch

Tracking Flags:

Tracking

Status

firefox-esr102

---

unaffected

firefox113

---

wontfix

firefox114

---

fixed

firefox115

---

fixed

People

(Reporter: RyanVM, Assigned: jrmuizel)

References

(Regression)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [win:stability])

Crash Data

Attachments

(1 file)

Bug 1833782. Enable CXX11 atomics on clang-cl. 2 years ago Jeff Muizelaar [:jrmuizel] 48 bytes, text/x-phabricator-request	pascalc : approval-mozilla-release+	Details \| Review

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

2 years ago

•

Edited

Only affects Windows 10 and 11. Filing this in Graphics for now, but I can't help but notice the mozilla::widget::ToastNotificationHandler::AsyncSaveImage further down the stack. A couple of reports from before 112, but it looks like it clearly spiked starting with 112.0.2.

Crash report: https://crash-stats.mozilla.org/report/index/3e394898-6460-428a-956a-bc1d00230518

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0  ntdll.dll  RtlpWaitOnCriticalSection  
1  ntdll.dll  RtlpEnterCriticalSectionContended  
2  ntdll.dll  RtlEnterCriticalSection  
3  xul.dll  _cairo_atomic_int_cmpxchg_return_old_impl  gfx/cairo/cairo/src/cairo-atomic.c:69
3  xul.dll  _cairo_surface_allocate_unique_id  gfx/cairo/cairo/src/cairo-surface.c:282
3  xul.dll  _cairo_surface_init  gfx/cairo/cairo/src/cairo-surface.c:418
4  xul.dll  _cairo_image_surface_create_for_pixman_image  gfx/cairo/cairo/src/cairo-image-surface.c:193
5  xul.dll  _cairo_image_surface_create_with_pixman_format  gfx/cairo/cairo/src/cairo-image-surface.c:365
6  xul.dll  _moz_cairo_image_surface_create_for_data  gfx/cairo/cairo/src/cairo-image-surface.c:547
7  xul.dll  mozilla::gfx::DrawTargetCairo::Init  gfx/2d/DrawTargetCairo.cpp:1914

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 1

•

2 years ago

The timing of this crash spike and the crashing line in ToastNotificationHandler.cpp fits bug 1822817 quite well.

Flags: needinfo?(nrishel)

Keywords: regression

Regressed by: 1822817

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 20 desktop browser crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

2 years ago

status-firefox113: affected → wontfix

Jamie Nicol [:jnicol]

Comment 3

•

2 years ago

It doesn't appear like we initialize _cairo_atomic_mutex anywhere. However, it also seems like we must be building cairo without atomic support for this to be executed, which is somewhat surprising. Perhaps someone knows whether we may be misconfiguring cairo during build?

Blocks: gfx-triage

Yannis Juglaret [:yannis]

Comment 4

•

2 years ago

•

Edited

I agree that this crash looks exactly like what would happen if we executed _cairo_surface_init with an uninitialized _cairo_atomic_mutex. However we do have a path to initialize _cairo_atomic_mutex, in _cairo_mutex_initialize, which, adapting the disassembly into pseudo-code, looks like this:

if (!_cairo_mutex_initialized) {
    _cairo_mutex_initialized = TRUE;

    InitializeCriticalSection(&_cairo_image_solid_cache_mutex);
    InitializeCriticalSection(&_cairo_toy_font_face_mutex);
    InitializeCriticalSection(&_cairo_intern_string_mutex);
    InitializeCriticalSection(&_cairo_scaled_font_map_mutex);
    InitializeCriticalSection(&_cairo_scaled_glyph_page_cache_mutex);
    InitializeCriticalSection(&_cairo_scaled_font_error_mutex);
    InitializeCriticalSection(&_cairo_glyph_cache_mutex);
    InitializeCriticalSection(&_cairo_win32_font_face_mutex);
    InitializeCriticalSection(&_cairo_win32_font_dc_mutex);
    InitializeCriticalSection(&_cairo_atomic_mutex);
}

_cairo_mutex_initialize is called at the beginning of _cairo_surface_init through a macro: CAIRO_MUTEX_INITIALIZE (); which results in the following disassembly:

xul!_cairo_surface_init [/builds/worker/checkouts/gecko/gfx/cairo/cairo/src/cairo-surface.c @ 407]:
// ...
00007ffd`f3979d28 833d390e3a0500  cmp     dword ptr [xul!_cairo_mutex_initialized (00007ffd`f8d1ab68)],0
00007ffd`f3979d2f 0f8462010000    je      xul!_cairo_surface_init+0x187 (00007ffd`f3979e97)
00007ffd`f3979d35 // mutexes are initialized, go on
// ...
00007ffd`f3979e97 e884f4ffff      call    xul!_cairo_mutex_initialize (00007ffd`f3979320)
00007ffd`f3979e9c e994feffff      jmp     xul!_cairo_surface_init+0x25 (00007ffd`f3979d35)

I think what we have here is a race condition, because _cairo_mutex_initialized = TRUE; is executed before InitializeCriticalSection(&_cairo_atomic_mutex);, which implies that a different thread could execute _cairo_surface_init, read _cairo_mutex_initialized as TRUE and thus assume that we are done initializing _cairo_atomic_mutex while we are not, then crash by trying to use it. This being a race condition would also explain why this crash is tied to a background thread functionality AsyncSaveImage.

Note that it is not desirable to just move _cairo_mutex_initialized = TRUE; to the end of _cairo_mutex_initialize because then the race condition would lead to multiple calls to InitializeCriticalSection for the same section, which is undefined behavior. If possible, I would recommend that we add a first call to _cairo_mutex_initialize at a point in time where we are sure that there can be no other thread relying on cairo.

The way to reproduce this crash is likely to issue a lot of notifications at the same time, all with an associated image, before any other notification (and more generally before executing any other path that could call into _cairo_mutex_initialize).

Yannis Juglaret [:yannis]

Updated

•

2 years ago

Whiteboard: [win:stability]

Nick Rishel [:nrishel]

Comment 5

•

2 years ago

Note that it is not desirable to just move _cairo_mutex_initialized = TRUE; to the end of _cairo_mutex_initialize because then the race condition would lead to multiple calls to InitializeCriticalSection for the same section, which is undefined behavior.

Is there a reason this can't be paired with a guard for initialization started?

Flags: needinfo?(nrishel)

Yannis Juglaret [:yannis]

Comment 6

•

2 years ago

•

Edited

(In reply to Nick Rishel [:nrishel] from comment #5)

Is there a reason this can't be paired with a guard for initialization started?

I am not sure if I understand the question correctly, so feel free to correct me. The purpose of _cairo_mutex_initialize is to initialize the synchronization primitives that are used by the rest of the cairo code. It needs to be called only once, so it would be more optimal to just call it ahead of use in an initialization path, rather than to guard the path to make it thread-safe (which would require to use a synchronization primitive which would itself have to be initialized ahead of use). However, both solutions would fix the crash (assuming the analysis from comment 4 is correct).

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

2 years ago

Assignee: nobody → jmuizelaar

Nick Rishel [:nrishel]

Comment 7

•

2 years ago

•

Edited

:yannis I'm taking as a given we'd prefer not to run this in the initialization (startup?) path to keep it as lean as possible. I'm not intimately familiar with this code though so I might be bringing incorrect assumptions.

Yannis Juglaret [:yannis]

Comment 8

•

2 years ago

•

Edited

Here, by initialization path, I mean any path that is guaranteed to run on the main thread before the first task that calls into gfxUtils::EncodeSourceSurface gets dispatched. I think ToastNotificationHandler::AsyncSaveImage could meet this requirement?

To clarify, I believe that:

if _cairo_mutex_initialize has not yet been called, then gfxUtils::EncodeSourceSurface is not thread-safe;
but otherwise, gfxUtils::EncodeSourceSurface is thread-safe.

If that really is the case (maybe [:jrmuizel] can confirm?), then we could solve the issue by exposing a new gfxUtils function that calls into _cairo_mutex_initialize, and ensuring that we call that new gfxUtils function from the main thread (e.g. in ToastNotificationHandler::AsyncSaveImage) before dispatching the faulty task to a background thread. This solution would avoid us the cost of guarding the call to gfxUtils::EncodeSourceSurface, although doing that would fix the crash as well.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 9

•

2 years ago

I believe this was regressed by bug 739096 which accidentally dropped support for atomics in Cairo on win32

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 10

•

2 years ago

We can probably largely mitigate this by restoring support for atomics on Win32. However, I also consider the racy _cairo_mutex_initialize to be an upstream bug that we should file.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 11

•

2 years ago

Attached file Bug 1833782. Enable CXX11 atomics on clang-cl. — Details

The old Win32 specific atomics support was accidentally removed in bug 739096.
However, we can just use the existing CXX11 atomic support because we build with
clang-cl.

Pulsebot

Comment 12

•

2 years ago

Pushed by jmuizelaar@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fc6056442a0f Enable CXX11 atomics on clang-cl. r=mstange

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 13

•

2 years ago

I filed an upstream issue about this here: https://gitlab.freedesktop.org/cairo/cairo/-/issues/782

Jeff Muizelaar [:jrmuizel]

Assignee

Updated

•

2 years ago

See Also: → https://gitlab.freedesktop.org/cairo/cairo/-/issues/782

Sandor Molnar[:smolnar]

Comment 14

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/fc6056442a0f

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox115: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 115 Branch

BugBot [:suhaib / :marco/ :calixte]

Comment 15

•

2 years ago

The patch landed in nightly and beta is affected.
:jrmuizel, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox114 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 16

•

2 years ago

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Beta/Release Uplift Approval Request

User impact if declined: Occasional crashes
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This enables clang's atomics support on Windows. We've been using on other platforms so it should work.
String changes made/needed:
Is Android affected?: Yes

Flags: needinfo?(jmuizelaar)

Attachment #9336201 - Flags: approval-mozilla-beta?

Donal Meehan [:dmeehan]

Comment 17

•

2 years ago

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Fx114 is now in RC, switching uplift request to release

Attachment #9336201 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Pascal Chevrel:pascalc

Comment 18

•

2 years ago

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Approved for 114 RC 2, thanks.

Attachment #9336201 - Flags: approval-mozilla-release? → approval-mozilla-release+

Pascal Chevrel:pascalc

Comment 19

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/2886ae111bee

status-firefox114: affected → fixed

Kelsey Gilbert [:jgilbert]

Updated

•

2 years ago

No longer blocks: gfx-triage

You need to log in before you can comment on or make changes to this bug.