Closed Bug 1833782 Opened 2 years ago Closed 2 years ago

Crash in [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | _cairo_atomic_int_cmpxchg_return_old_impl]

Categories

(Core :: Graphics, defect)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
115 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox113 --- wontfix
firefox114 --- fixed
firefox115 --- fixed

People

(Reporter: RyanVM, Assigned: jrmuizel)

References

(Regression)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [win:stability])

Crash Data

Attachments

(1 file)

Only affects Windows 10 and 11. Filing this in Graphics for now, but I can't help but notice the mozilla::widget::ToastNotificationHandler::AsyncSaveImage further down the stack. A couple of reports from before 112, but it looks like it clearly spiked starting with 112.0.2.

Crash report: https://crash-stats.mozilla.org/report/index/3e394898-6460-428a-956a-bc1d00230518

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0  ntdll.dll  RtlpWaitOnCriticalSection  
1  ntdll.dll  RtlpEnterCriticalSectionContended  
2  ntdll.dll  RtlEnterCriticalSection  
3  xul.dll  _cairo_atomic_int_cmpxchg_return_old_impl  gfx/cairo/cairo/src/cairo-atomic.c:69
3  xul.dll  _cairo_surface_allocate_unique_id  gfx/cairo/cairo/src/cairo-surface.c:282
3  xul.dll  _cairo_surface_init  gfx/cairo/cairo/src/cairo-surface.c:418
4  xul.dll  _cairo_image_surface_create_for_pixman_image  gfx/cairo/cairo/src/cairo-image-surface.c:193
5  xul.dll  _cairo_image_surface_create_with_pixman_format  gfx/cairo/cairo/src/cairo-image-surface.c:365
6  xul.dll  _moz_cairo_image_surface_create_for_data  gfx/cairo/cairo/src/cairo-image-surface.c:547
7  xul.dll  mozilla::gfx::DrawTargetCairo::Init  gfx/2d/DrawTargetCairo.cpp:1914

The timing of this crash spike and the crashing line in ToastNotificationHandler.cpp fits bug 1822817 quite well.

Flags: needinfo?(nrishel)
Keywords: regression
Regressed by: 1822817

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

It doesn't appear like we initialize _cairo_atomic_mutex anywhere. However, it also seems like we must be building cairo without atomic support for this to be executed, which is somewhat surprising. Perhaps someone knows whether we may be misconfiguring cairo during build?

Blocks: gfx-triage

I agree that this crash looks exactly like what would happen if we executed _cairo_surface_init with an uninitialized _cairo_atomic_mutex. However we do have a path to initialize _cairo_atomic_mutex, in _cairo_mutex_initialize, which, adapting the disassembly into pseudo-code, looks like this:

if (!_cairo_mutex_initialized) {
    _cairo_mutex_initialized = TRUE;

    InitializeCriticalSection(&_cairo_image_solid_cache_mutex);
    InitializeCriticalSection(&_cairo_toy_font_face_mutex);
    InitializeCriticalSection(&_cairo_intern_string_mutex);
    InitializeCriticalSection(&_cairo_scaled_font_map_mutex);
    InitializeCriticalSection(&_cairo_scaled_glyph_page_cache_mutex);
    InitializeCriticalSection(&_cairo_scaled_font_error_mutex);
    InitializeCriticalSection(&_cairo_glyph_cache_mutex);
    InitializeCriticalSection(&_cairo_win32_font_face_mutex);
    InitializeCriticalSection(&_cairo_win32_font_dc_mutex);
    InitializeCriticalSection(&_cairo_atomic_mutex);
}

_cairo_mutex_initialize is called at the beginning of _cairo_surface_init through a macro: CAIRO_MUTEX_INITIALIZE (); which results in the following disassembly:

xul!_cairo_surface_init [/builds/worker/checkouts/gecko/gfx/cairo/cairo/src/cairo-surface.c @ 407]:
// ...
00007ffd`f3979d28 833d390e3a0500  cmp     dword ptr [xul!_cairo_mutex_initialized (00007ffd`f8d1ab68)],0
00007ffd`f3979d2f 0f8462010000    je      xul!_cairo_surface_init+0x187 (00007ffd`f3979e97)
00007ffd`f3979d35 // mutexes are initialized, go on
// ...
00007ffd`f3979e97 e884f4ffff      call    xul!_cairo_mutex_initialize (00007ffd`f3979320)
00007ffd`f3979e9c e994feffff      jmp     xul!_cairo_surface_init+0x25 (00007ffd`f3979d35)

I think what we have here is a race condition, because _cairo_mutex_initialized = TRUE; is executed before InitializeCriticalSection(&_cairo_atomic_mutex);, which implies that a different thread could execute _cairo_surface_init, read _cairo_mutex_initialized as TRUE and thus assume that we are done initializing _cairo_atomic_mutex while we are not, then crash by trying to use it. This being a race condition would also explain why this crash is tied to a background thread functionality AsyncSaveImage.

Note that it is not desirable to just move _cairo_mutex_initialized = TRUE; to the end of _cairo_mutex_initialize because then the race condition would lead to multiple calls to InitializeCriticalSection for the same section, which is undefined behavior. If possible, I would recommend that we add a first call to _cairo_mutex_initialize at a point in time where we are sure that there can be no other thread relying on cairo.

The way to reproduce this crash is likely to issue a lot of notifications at the same time, all with an associated image, before any other notification (and more generally before executing any other path that could call into _cairo_mutex_initialize).

Whiteboard: [win:stability]

Note that it is not desirable to just move _cairo_mutex_initialized = TRUE; to the end of _cairo_mutex_initialize because then the race condition would lead to multiple calls to InitializeCriticalSection for the same section, which is undefined behavior.

Is there a reason this can't be paired with a guard for initialization started?

Flags: needinfo?(nrishel)

(In reply to Nick Rishel [:nrishel] from comment #5)

Is there a reason this can't be paired with a guard for initialization started?

I am not sure if I understand the question correctly, so feel free to correct me. The purpose of _cairo_mutex_initialize is to initialize the synchronization primitives that are used by the rest of the cairo code. It needs to be called only once, so it would be more optimal to just call it ahead of use in an initialization path, rather than to guard the path to make it thread-safe (which would require to use a synchronization primitive which would itself have to be initialized ahead of use). However, both solutions would fix the crash (assuming the analysis from comment 4 is correct).

Assignee: nobody → jmuizelaar

:yannis I'm taking as a given we'd prefer not to run this in the initialization (startup?) path to keep it as lean as possible. I'm not intimately familiar with this code though so I might be bringing incorrect assumptions.

Here, by initialization path, I mean any path that is guaranteed to run on the main thread before the first task that calls into gfxUtils::EncodeSourceSurface gets dispatched. I think ToastNotificationHandler::AsyncSaveImage could meet this requirement?

To clarify, I believe that:

  • if _cairo_mutex_initialize has not yet been called, then gfxUtils::EncodeSourceSurface is not thread-safe;
  • but otherwise, gfxUtils::EncodeSourceSurface is thread-safe.

If that really is the case (maybe [:jrmuizel] can confirm?), then we could solve the issue by exposing a new gfxUtils function that calls into _cairo_mutex_initialize, and ensuring that we call that new gfxUtils function from the main thread (e.g. in ToastNotificationHandler::AsyncSaveImage) before dispatching the faulty task to a background thread. This solution would avoid us the cost of guarding the call to gfxUtils::EncodeSourceSurface, although doing that would fix the crash as well.

I believe this was regressed by bug 739096 which accidentally dropped support for atomics in Cairo on win32

We can probably largely mitigate this by restoring support for atomics on Win32. However, I also consider the racy _cairo_mutex_initialize to be an upstream bug that we should file.

The old Win32 specific atomics support was accidentally removed in bug 739096.
However, we can just use the existing CXX11 atomic support because we build with
clang-cl.

Pushed by jmuizelaar@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fc6056442a0f Enable CXX11 atomics on clang-cl. r=mstange

I filed an upstream issue about this here: https://gitlab.freedesktop.org/cairo/cairo/-/issues/782

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 115 Branch

The patch landed in nightly and beta is affected.
:jrmuizel, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox114 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(jmuizelaar)

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Beta/Release Uplift Approval Request

  • User impact if declined: Occasional crashes
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This enables clang's atomics support on Windows. We've been using on other platforms so it should work.
  • String changes made/needed:
  • Is Android affected?: Yes
Flags: needinfo?(jmuizelaar)
Attachment #9336201 - Flags: approval-mozilla-beta?

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Fx114 is now in RC, switching uplift request to release

Attachment #9336201 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Comment on attachment 9336201 [details]
Bug 1833782. Enable CXX11 atomics on clang-cl.

Approved for 114 RC 2, thanks.

Attachment #9336201 - Flags: approval-mozilla-release? → approval-mozilla-release+
No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: