Closed Bug 1681243 Opened 3 years ago Closed 3 years ago

Crash in [@ RtlInitializeCriticalSectionAndSpinCount | InitializeCriticalSectionAndSpinCount]

Categories

(Core :: Memory Allocator, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- fixed

People

(Reporter: gsvelto, Assigned: away)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/8afae343-b8b9-47ea-96b4-43d230201202

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0 ntdll.dll RtlInitializeCriticalSectionAndSpinCount 
1 kernelbase.dll InitializeCriticalSectionAndSpinCount 
2 mozglue.dll arena_t::arena_t memory/build/mozjemalloc.cpp:3517
3 mozglue.dll ArenaCollection::CreateArena memory/build/mozjemalloc.cpp:3601
4 mozglue.dll moz_create_arena_with_params memory/build/malloc_decls.h:115
5 xul.dll static mozilla::dom::DocGroup::Create dom/base/DocGroup.cpp:127
6 xul.dll mozilla::dom::BrowsingContextGroup::AddDocument docshell/base/BrowsingContextGroup.cpp:327
7 xul.dll mozilla::dom::Document::SetScopeObject dom/base/Document.cpp:6913
8 xul.dll mozilla::dom::Document::SetScriptGlobalObject dom/base/Document.cpp:7093
9 xul.dll nsGlobalWindowOuter::SetNewDocument dom/base/nsGlobalWindowOuter.cpp:2395

This crash is hitting this assertion

The crash volume is low on crashstats but there's a corresponding crash I found on the Microsoft health dashboards with an alarmingly high volume: between 1k and 4k per day crashes depending on the days. The (very poor) stack trace is the following there:

0 	ntdll 	RtlInitializeCriticalSectionAndSpinCount 	0x1C
1 	kernelbase 	InitializeCriticalSectionAndSpinCount 	0xB
2 	mozglue 	arena_t::arena_t 	0x22
3 	mozglue 	ArenaCollection::CreateArena 	0x5D
4 	mozglue 	je_malloc 	0x5A0
5 	mozglue 	moz_xmalloc 	0xD
6 	firefox 	std::basic_string_wchar_t,std::char_traits_wchar_t_,std::allocator_wchar_t_ _::_Reallocate_for_`lambda at /builds/worker/checkouts/gecko/vs2017_15.8.4/VC/include/xstring:2668:35',const wchar_t *_ 	0x8D
7 	firefox 	_GLOBAL__sub_I_permissionsService.cpp 	0x39
8 	ucrtbase 	_initterm 	0x43
9 	firefox 	__scrt_common_main_seh 	0x81

One particularly worrying aspect is that Microsoft documentation states that InitializeCriticalSectionAndSpinCount() always succeeds in recent versions of Windows making this extremely puzzling.

Alright, I figured it out by cracking open a minidump:

  • This allocation of an arena_t object is fallible and is failing, the relevant pointer in the minidump is NULL
  • However we're still calling the arena_t::arena_t constructor (which I find odd, but the intricacies of C++ object construction have often eluded me)
  • The failing statement in these crashes is the first one in the constructor which is why we appear to be crashing in the assertion, but we're not. InitializeCriticalSectionAndSpinCount() is just being called on a NULL pointer

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

  • However we're still calling the arena_t::arena_t constructor (which I find odd, but the intricacies of C++ object construction have often eluded me)

I think that is the key.

Based on https://godbolt.org/z/fn1h5z it looks like clang skips the null check if the allocation function doesn't have noexcept, and our Windows builds don't have noexcept because clang-cl sets _MSC_VER.

Perhaps that _MSC_VER condition dates back to a time when it meant actual MSVC and clang-cl wasn't something we thought about. In the same godbolt link above, it seems that MSVC doesn't care whether you set noexcept, it'll do the null check either way. I suppose it could be argued that this is a failure of clang-cl to be MSVC-compatible, and I'm happy to file it upstream if y'all agree, but in the meantime I think the easiest fix is to remove the ifdef on our side.

Confirmed by disassembling a try build that it did the trick: https://treeherder.mozilla.org/jobs?repo=try&revision=5a88503d62b9f0e0ff971d2562af80c83cbc32f2

If the arena is null, the code will now fall into the error logging code, in which I happened to notice these famous last words:
// In practice, this is an extremely unlikely failure.

In clang-cl builds, thanks to clang-cl's defining of _MSC_VER, this function was not marked noexcept. This led clang to believe that it could call arena_t's constructor without checking for null.

I suppose we could scope the condition down to real MSVC, but since we don't support that anymore, we can just stop checking.

Assignee: nobody → dmajor
Status: NEW → ASSIGNED

I looked for other code that may have this pattern but mozjemalloc.cpp seems to be the only one: https://searchfox.org/mozilla-central/search?q=operator.*fallible&path=&case=true&regexp=true (SharedBuffer.h is using the global operator new from cxxalloc.h, which is already noexcept)

Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3239d2c504d3
Use noexcept on arena_t's fallible allocator, even on Windows r=glandium
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: