Open Bug 1758654 Opened 4 years ago Updated 3 years ago

Stack overflows when writing child minidumps on Windows

Categories

(Toolkit :: Crash Reporting, defect, P2)

Unspecified
Windows
defect

Tracking

()

REOPENED
100 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- unaffected
firefox100 --- wontfix
firefox101 --- fixed

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/5d610874-4dff-4fe7-9a06-dcf1f0220308

Reason: EXCEPTION_STACK_OVERFLOW

Top 10 frames of crashing thread:

0 ntdll.dll RtlDuplicateUnicodeString 
1 firefox.exe mozilla::freestanding::patched_LdrLoadDll browser/app/winlauncher/freestanding/DllBlocklist.cpp:354
2 kernelbase.dll LoadLibraryExW 
3 xul.dll google_breakpad::MinidumpGenerator::GetRpcrt4Module toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:545
4 xul.dll google_breakpad::MinidumpGenerator::GetCreateUuid toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:554
5 xul.dll google_breakpad::MinidumpGenerator::GenerateDumpFilePath toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:566
6 xul.dll google_breakpad::MinidumpGenerator::GenerateDumpFile toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:465
7 xul.dll google_breakpad::CrashGenerationServer::GenerateDump toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:979
8 xul.dll google_breakpad::CrashGenerationServer::HandleDumpRequest toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:891
9 xul.dll static google_breakpad::CrashGenerationServer::OnDumpRequest toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:834

This is an interesting issue that cropped up when I looked at proto-signatures that contained the CrashGenerationServer::HandleDumpRequest() function. It seems like we're overflowing the stack in several different and interesting ways, mostly in Microsoft minidump-writing code. It might be time to bump this thread's stack size.

Priority: -- → P1
Summary: Stack overflows when handling child minidumps → Stack overflows when writing child minidumps on Windows

No easy solution here: the thread we're running the generator callback on is part of the thread pool, so we can't control its stack size. A quick Google search turned out the ability to run a thread's code in a fiber (see 1 2 3 4 5 6 and 7). It seems complicated but it should be possible to do the following inside OnDumpRequest():

  1. Convert the current thread to a fiber (if it's not already)
  2. Allocate a fiber with a larger stack than the thread's default stack
  3. Run the minidump generation code in the fiber
  4. Destroy the fiber once we're done
  5. Convert the fiber back to a thread unless the thread had already been converted to a fiber, then I assume we should leave it as it was
Priority: P1 → P2
Assignee: nobody → rkraesig

Implement a strategy suggested by Raymond Chen [0] for avoiding stack
overflow in library code: move the relevant work onto a temporary fiber
with a potentially arbitrary stack size.

[0] https://devblogs.microsoft.com/oldnewthing/20200601-00/?p=103815
The Old New Thing, "How can I expand my thread’s stack at runtime?"
(and several followup posts).

Removed a redundant signature.

Crash Signature: RtlDuplicateUnicodeString | mozilla::freestanding::patched_LdrLoadDll | LoadLibraryExW] [@ Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_… → Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch]

Implement a strategy suggested by Raymond Chen [0] for avoiding stack
overflow in library code: move the relevant work onto a temporary fiber
with a potentially arbitrary stack size.

Additionally, implement a cppunittest to test the fiber code.

[0] https://devblogs.microsoft.com/oldnewthing/20200601-00/?p=103815
The Old New Thing, "How can I expand my thread’s stack at runtime?"
(and several followup posts).

Attachment #9269002 - Attachment is obsolete: true
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5885b9403069 avoid stack overflow by running on a fiber r=gsvelto

This was due to a type-level (but not machine-level) mismatch between size_t and SIZE_T in 32-bit builds, affecting the cppunittest's test output. (Sigh.)

Resubmitting following emendations and successful try build.

Flags: needinfo?(rkraesig)
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0dbe666a3aa6 avoid stack overflow by running on a fiber r=gsvelto
Regressions: 1762302
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a07bb9e2ffe8 avoid stack overflow by running on a fiber r=gsvelto
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 100 Branch

And belatedly clearing the needinfo request.

For reference:
  a) stateless lambda functions ordinarily decay only to __cdecl function pointers on GCC and Clang under x86; and
  b) the fix for bug 1760857 caused my sloppy skip-if condition to error out.

Flags: needinfo?(rkraesig)
Depends on: 1763093
Regressions: 1763093

:apavel: the regression in bug 1763093 should be fixed on Nightly, but the fix wasn't in by merge time (hence 1763675).

Can we get this patch reverted on Beta?

Flags: needinfo?(apavel)

Hi Ray, Dianna is going to take care of this so moving the ni to her.
Thank you.

Flags: needinfo?(apavel) → needinfo?(dsmith)
Flags: needinfo?(dsmith)
No longer depends on: 1763093
No longer regressions: 1763675

We're a version past the alleged fix hitting release, and telemetry doesn't appear to indicate any drop in the linked signature-set.

I think we may want to revert it and reopen the bug. :/

Crash Signature: Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch] → RtlDuplicateUnicodeString | mozilla::freestanding::patched_LdrLoadDll | LoadLibraryExW] [@ Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_…
Crash Signature: RtlDuplicateUnicodeString | mozilla::freestanding::patched_LdrLoadDll | LoadLibraryExW] [@ Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_… → Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch]

(In reply to Ray Kraesig [:rkraesig] from comment #16)

We're a version past the alleged fix hitting release, and telemetry doesn't appear to indicate any drop in the linked signature-set.

Indeed, it seems that these aren't real stack overflows, there's something else going on, possibly corruption of the stack pointer that makes Windows think it run out of stack?

I think we may want to revert it and reopen the bug. :/

Sad to agree with you but yes, better revert the change and investigate further.

The fixes for this bug and for its dependent bug 1763093 appear to have
had no effect on crash rates. Revert them.

Indeed, it seems that these aren't real stack overflows, there's something else going on, possibly corruption of the stack pointer that makes Windows think it run out of stack?

Stack pointer corruption seems unlikely -- EXCEPTION_STACK_OVERFLOW should only occur when the guard page at the end of the stack is hit, but can't be replaced with real memory. (Ordinary pointer corruption should normally just result in ERROR_ACCESS_DENIED (0xC0000005) when the instruction pointer jumps off into Nowhereland.)

In a lot of cases, this looks less like "heavy stack usage" and more like "there's so little memory that Windows won't allocate a stack page". Most of these crashes that I've checked have very short stacks, and many of them have available-page-file values of less than 32MiB. I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party.

(In reply to Ray Kraesig [:rkraesig] from comment #19)

In a lot of cases, this looks less like "heavy stack usage" and more like "there's so little memory that Windows won't allocate a stack page". Most of these crashes that I've checked have very short stacks, and many of them have available-page-file values of less than 32MiB. I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)

Ah, very good catch! So these are OOMs in disguise. The child process OOM'd, the main process tries to grab a minidump and fails because, well, we're out-of-memory already.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/abe87d773698 revert existing patches for this bug r=gsvelto

(In reply to Ray Kraesig [:rkraesig] from comment #19)

I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)

Despite the emoji, this is feasible and perhaps even reasonable (with some flexibility around the "1000ms" figure). We can recover from a stack overflow if there's memory to be had: see MSDN's documentation on _resetstkoflw.

Nonetheless, removing myself as this bug's assignee, since I'm not currently working on it.

Assignee: rkraesig → nobody
Keywords: leave-open

Copying crash signatures from duplicate bugs.

Crash Signature: Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch] → Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch] [@ google_breakpad::CrashGenerationServer::G…
Severity: S2 → S4
Crash Signature: Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch] [@ → Win32LiveSystemProvider::GetThreadContextEx] [@ RtlFindActivationContextSectionString | sxsisol_SearchActCtxForDllName] [@ je_malloc | moz_xmalloc | std::basic_string<T>::_Construct_lv_contents] [@ bsearch] [@

Demoting severity because this is most likely caused by low-memory scenarios and there's probably not much we can do about it.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: