Stack overflows when writing child minidumps on Windows
Categories
(Toolkit :: Crash Reporting, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr91 | --- | unaffected |
| firefox99 | --- | unaffected |
| firefox100 | --- | wontfix |
| firefox101 | --- | fixed |
People
(Reporter: gsvelto, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 files, 1 obsolete file)
Crash report: https://crash-stats.mozilla.org/report/index/5d610874-4dff-4fe7-9a06-dcf1f0220308
Reason: EXCEPTION_STACK_OVERFLOW
Top 10 frames of crashing thread:
0 ntdll.dll RtlDuplicateUnicodeString
1 firefox.exe mozilla::freestanding::patched_LdrLoadDll browser/app/winlauncher/freestanding/DllBlocklist.cpp:354
2 kernelbase.dll LoadLibraryExW
3 xul.dll google_breakpad::MinidumpGenerator::GetRpcrt4Module toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:545
4 xul.dll google_breakpad::MinidumpGenerator::GetCreateUuid toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:554
5 xul.dll google_breakpad::MinidumpGenerator::GenerateDumpFilePath toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:566
6 xul.dll google_breakpad::MinidumpGenerator::GenerateDumpFile toolkit/crashreporter/breakpad-client/windows/crash_generation/minidump_generator.cc:465
7 xul.dll google_breakpad::CrashGenerationServer::GenerateDump toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:979
8 xul.dll google_breakpad::CrashGenerationServer::HandleDumpRequest toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:891
9 xul.dll static google_breakpad::CrashGenerationServer::OnDumpRequest toolkit/crashreporter/breakpad-client/windows/crash_generation/crash_generation_server.cc:834
This is an interesting issue that cropped up when I looked at proto-signatures that contained the CrashGenerationServer::HandleDumpRequest() function. It seems like we're overflowing the stack in several different and interesting ways, mostly in Microsoft minidump-writing code. It might be time to bump this thread's stack size.
| Reporter | ||
Updated•4 years ago
|
| Reporter | ||
Comment 1•4 years ago
|
||
No easy solution here: the thread we're running the generator callback on is part of the thread pool, so we can't control its stack size. A quick Google search turned out the ability to run a thread's code in a fiber (see 1 2 3 4 5 6 and 7). It seems complicated but it should be possible to do the following inside OnDumpRequest():
- Convert the current thread to a fiber (if it's not already)
- Allocate a fiber with a larger stack than the thread's default stack
- Run the minidump generation code in the fiber
- Destroy the fiber once we're done
- Convert the fiber back to a thread unless the thread had already been converted to a fiber, then I assume we should leave it as it was
| Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 2•4 years ago
|
||
Implement a strategy suggested by Raymond Chen [0] for avoiding stack
overflow in library code: move the relevant work onto a temporary fiber
with a potentially arbitrary stack size.
[0] https://devblogs.microsoft.com/oldnewthing/20200601-00/?p=103815
The Old New Thing, "How can I expand my thread’s stack at runtime?"
(and several followup posts).
| Reporter | ||
Comment 3•3 years ago
|
||
Removed a redundant signature.
Comment 4•3 years ago
|
||
Implement a strategy suggested by Raymond Chen [0] for avoiding stack
overflow in library code: move the relevant work onto a temporary fiber
with a potentially arbitrary stack size.
Additionally, implement a cppunittest to test the fiber code.
[0] https://devblogs.microsoft.com/oldnewthing/20200601-00/?p=103815
The Old New Thing, "How can I expand my thread’s stack at runtime?"
(and several followup posts).
Updated•3 years ago
|
Comment 6•3 years ago
|
||
Backed out changeset 5885b9403069 (Bug 1758654) for causing build bustage CLOSED TREE
Log: https://treeherder.mozilla.org/logviewer?job_id=372700339&repo=autoland&lineNumber=33033
Backout: https://hg.mozilla.org/integration/autoland/rev/1863bae042601b48656d73bb8a221d6a67d03c8a
Comment 7•3 years ago
|
||
This was due to a type-level (but not machine-level) mismatch between size_t and SIZE_T in 32-bit builds, affecting the cppunittest's test output. (Sigh.)
Resubmitting following emendations and successful try build.
Comment 9•3 years ago
|
||
Backed out for causing Bug 1762302
Backout link: https://hg.mozilla.org/integration/autoland/rev/60080026a143c4f4515557c1704b0b3775607001
Comment 10•3 years ago
|
||
Comment 11•3 years ago
|
||
| bugherder | ||
Comment 12•3 years ago
|
||
And belatedly clearing the needinfo request.
For reference:
a) stateless lambda functions ordinarily decay only to __cdecl function pointers on GCC and Clang under x86; and
b) the fix for bug 1760857 caused my sloppy skip-if condition to error out.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 13•3 years ago
|
||
:apavel: the regression in bug 1763093 should be fixed on Nightly, but the fix wasn't in by merge time (hence 1763675).
Can we get this patch reverted on Beta?
Comment 14•3 years ago
|
||
Hi Ray, Dianna is going to take care of this so moving the ni to her.
Thank you.
Comment 15•3 years ago
|
||
Backed out from beta for causing bug 1763093, as requested
Backout link: https://hg.mozilla.org/releases/mozilla-beta/rev/6b94822f9ed0f51e43d5923598da14e3cb0aba16
Updated•3 years ago
|
Comment 16•3 years ago
|
||
We're a version past the alleged fix hitting release, and telemetry doesn't appear to indicate any drop in the linked signature-set.
I think we may want to revert it and reopen the bug. :/
Updated•3 years ago
|
| Reporter | ||
Comment 17•3 years ago
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #16)
We're a version past the alleged fix hitting release, and telemetry doesn't appear to indicate any drop in the linked signature-set.
Indeed, it seems that these aren't real stack overflows, there's something else going on, possibly corruption of the stack pointer that makes Windows think it run out of stack?
I think we may want to revert it and reopen the bug. :/
Sad to agree with you but yes, better revert the change and investigate further.
Comment 18•3 years ago
|
||
The fixes for this bug and for its dependent bug 1763093 appear to have
had no effect on crash rates. Revert them.
Comment 19•3 years ago
|
||
Indeed, it seems that these aren't real stack overflows, there's something else going on, possibly corruption of the stack pointer that makes Windows think it run out of stack?
Stack pointer corruption seems unlikely -- EXCEPTION_STACK_OVERFLOW should only occur when the guard page at the end of the stack is hit, but can't be replaced with real memory. (Ordinary pointer corruption should normally just result in ERROR_ACCESS_DENIED (0xC0000005) when the instruction pointer jumps off into Nowhereland.)
In a lot of cases, this looks less like "heavy stack usage" and more like "there's so little memory that Windows won't allocate a stack page". Most of these crashes that I've checked have very short stacks, and many of them have available-page-file values of less than 32MiB. I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)
Comment 20•3 years ago
|
||
A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party.
| Reporter | ||
Comment 21•3 years ago
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #19)
In a lot of cases, this looks less like "heavy stack usage" and more like "there's so little memory that Windows won't allocate a stack page". Most of these crashes that I've checked have very short stacks, and many of them have available-page-file values of less than 32MiB. I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)
Ah, very good catch! So these are OOMs in disguise. The child process OOM'd, the main process tries to grab a minidump and fails because, well, we're out-of-memory already.
Updated•3 years ago
|
Comment 22•3 years ago
|
||
Comment 23•3 years ago
|
||
| bugherder | ||
Comment 24•3 years ago
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #19)
I don't think we can do much, if anything, about those. (Other than stall for 1000ms, maybe. 🙃)
Despite the emoji, this is feasible and perhaps even reasonable (with some flexibility around the "1000ms" figure). We can recover from a stack overflow if there's memory to be had: see MSDN's documentation on _resetstkoflw.
Nonetheless, removing myself as this bug's assignee, since I'm not currently working on it.
Comment 26•3 years ago
|
||
Copying crash signatures from duplicate bugs.
| Reporter | ||
Updated•3 years ago
|
| Reporter | ||
Comment 27•3 years ago
|
||
Demoting severity because this is most likely caused by low-memory scenarios and there's probably not much we can do about it.
Description
•