Missing unwind information for detoured code in Windows hooking code
Categories
(Core :: mozglue, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox110 | --- | fixed |
People
(Reporter: yannis, Assigned: yannis)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
3.24 KB,
patch
|
Details | Diff | Splinter Review | |
Bug 1800898 - Move UnwindInfo definition to a dedicated WindowsUnwindInfo.h file. r=handyman,mhowell
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
In bug 1798787, some tests are failing but only on the Windows 11 virtual machines used by our CI when testing 64-bit builds. [:gerard-majax] identified that the reason is that crash dumps do not get generated on these machines for 64-builds, then by adding logs we were able to confirm that the vectored exception handler google_breakpad::ExceptionHandler::HandleHeapCorruption
is called but not the unhandled exception filter google_breakpad::ExceptionHandler::HandleException
although it is properly set. This unhandled exception filter is responsible for generating the crash dumps.
The investigation results point to our hooking code in mozglue
as a root cause for the Windows internal code in RtlDispatchException
being unable to find the unhandled exception filter. More precisely, finding the filter requires stack unwinding, and our hooking code can detour instructions to a dynamically allocated area without providing unwinding information for that area. On x64, if the dynamically allocated area contains a call instruction, then we will push on the stack a return address that points to this area, and callees will be unable to do stack unwinding properly and thus unable to reach the unhandled exception filter.
Quoting the related Microsoft documentation:
Table-based exception handling requires a table entry for all functions that allocate stack space or call another function (for example, nonleaf functions). [...]
[...] For dynamically generated functions [{such as} JIT compilers], the runtime to support these functions must either use RtlInstallFunctionTableCallback or RtlAddFunctionTable to provide this information to the operating system. Failure to do so will result in unreliable exception handling and debugging of processes.
The Windows 11 workers in our CI use kernel32.dll
version 10.0.22000.1219 (download link). In this specific version, one of the first instructions of BaseThreadInitThunk
is a call
instruction (starting at the 12th byte):
BaseThreadInitThunk
180015540 48 83 ec 28 SUB RSP,0x28
180015544 85 c9 TEST ECX,ECX
180015546 75 17 JNZ LAB_18001555f
180015548 49 8b c8 MOV RCX,R8
18001554b e8 30 ff 00 00 CALL BaseThreadInitXfgThunk
180015550 8b c8 MOV ECX,EAX
Our hooking code detours instructions that fall within the first 13 bytes. This results in the following code for BaseThreadInitThunk
as dumped in WinDbg on a loaner machine:
0:006> u KERNEL32!BaseThreadInitThunk
KERNEL32!BaseThreadInitThunk:
00007ff8`a99e5540 49bbb05e7188f87f0000 mov r11,offset mozglue!patched_BaseThreadInitThunk (00007ff8`88715eb0)
00007ff8`a99e554a 41ffe3 jmp r11
...
0:006> u mozglue!patched_BaseThreadInitThunk
mozglue!patched_BaseThreadInitThunk [/builds/worker/checkouts/gecko/toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp @ 586]:
...
00007ff8`88715f2f 488b05121d0800 mov rax,qword ptr [mozglue!stub_BaseThreadInitThunk (00007ff8`88797c48)]
00007ff8`88715f36 89f9 mov ecx,edi
00007ff8`88715f38 4889da mov rdx,rbx
00007ff8`88715f3b 4989f0 mov r8,rsi
00007ff8`88715f3e ff15c4900800 call qword ptr [mozglue!__guard_dispatch_icall_fptr (00007ff8`8879f008)]
0:006> u poi(mozglue!stub_BaseThreadInitThunk)
000001d4`d84a0010 4883ec28 sub rsp,28h
000001d4`d84a0014 85c9 test ecx,ecx
000001d4`d84a0016 740e je 000001d4`d84a0026
000001d4`d84a0018 ff2500000000 jmp qword ptr [000001d4`d84a001e]
000001d4`d84a001e 5f559ea9f87f0000 [points to KERNEL32!BaseThreadInitThunk+0x1f (00007ff8`a99e555f)]
000001d4`d84a0026 498bc8 mov rcx,r8
000001d4`d84a0029 ff1502000000 call qword ptr [000001d4`d84a0031]
000001d4`d84a002f eb08 jmp 000001d4`d84a0039
000001d4`d84a0031 80549fa9f87f0000 [points to KERNEL32!BaseThreadInitXfgThunk (00007ff8`a99f5480)]
000001d4`d84a0039 ff2500000000 jmp qword ptr [000001d4`d84a003f]
000001d4`d84a003f 50559ea9f87f0000 [points to KERNEL32!BaseThreadInitThunk+0x10 (00007ff8`a99e5550)]
The final excerpt shows the dynamically allocated area containing a call
instruction, which is there to mimic the original CALL BaseThreadInitXfgThunk
instruction. With this hook set, newly started threads would execute with a return address of 000001d4d84a002f
pushed to their stack, for which no unwinding information is available.
I used the following code to do manual unwinding when reaching google_breakpad::ExceptionHandler::HandleHeapCorruption
and confirm that this is the problem:
void UnwindAndDumpHandlers(EXCEPTION_POINTERS* ExceptionInfo)
{
CONTEXT ContextRecord{};
memcpy(&ContextRecord, ExceptionInfo->ContextRecord, sizeof(CONTEXT));
auto ControlPc = ExceptionInfo->ContextRecord->Rip;
while (ControlPc) {
fprintf(stderr, "RtlLookupFunctionEntry(ControlPc=%p)\n", (void*)ControlPc);
DWORD64 ImageBase = 0;
auto FunctionEntry = RtlLookupFunctionEntry(ControlPc, &ImageBase, nullptr);
if (!FunctionEntry) {
fprintf(stderr, "Failed to retrieve a function entry for ControlPc=%p\n", (void*)ControlPc);
break;
}
fprintf(stderr, "-> FunctionEntry=%p [%p-%p], ImageBase=%p\n", (void*)FunctionEntry, (void*)(ImageBase + (uint64_t)FunctionEntry->BeginAddress), (void*)(ImageBase + (uint64_t)FunctionEntry->EndAddress), (void*)ImageBase);
void* HandlerData = nullptr;
DWORD64 EstablisherFrame = 0;
fprintf(stderr, "RtlVirtualUnwind(1, ImageBase=%p, ControlPc=%p, FunctionEntry=%p)\n", (void*)ImageBase, (void*)ControlPc, (void*)FunctionEntry);
auto ExceptionRoutine = RtlVirtualUnwind(1, ImageBase, ControlPc, FunctionEntry, &ContextRecord, &HandlerData, &EstablisherFrame, nullptr);
ControlPc = ContextRecord.Rip;
fprintf(stderr, "-> ExceptionRoutine=%p, EstablisherFrame=%p, ControlPc=%p\n\n", (void*)ExceptionRoutine, (void*)EstablisherFrame, (void*)ControlPc);
}
}
When the hook is present, it results in the following log:
Reached ExceptionHandler::HandleHeapCorruption, dumping stacked exception handlers.
RtlLookupFunctionEntry(ControlPc=00007FFC72E31F54)
-> FunctionEntry=00007FFC76DEA750 [00007FFC72E31F50-00007FFC72E31F56], ImageBase=00007FFC6B2F0000
RtlVirtualUnwind(1, ImageBase=00007FFC6B2F0000, ControlPc=00007FFC72E31F54, FunctionEntry=00007FFC76DEA750)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000014FA7FFEA0, ControlPc=00000119973F002F
RtlLookupFunctionEntry(ControlPc=00000119973F002F)
Failed to retrieve a function entry for ControlPc=00000119973F002F
Whereas a normal log would manage to unwind the stack properly, eventually finding the unhandled exception filter (00007FFC179D3AA0
in this log):
Reached ExceptionHandler::HandleHeapCorruption, dumping stacked exception handlers.
RtlLookupFunctionEntry(ControlPc=00007FFBD8831F54)
-> FunctionEntry=00007FFBDC7EA750 [00007FFBD8831F50-00007FFBD8831F56], ImageBase=00007FFBD0CF0000
RtlVirtualUnwind(1, ImageBase=00007FFBD0CF0000, ControlPc=00007FFBD8831F54, FunctionEntry=00007FFBDC7EA750)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000027807FFDD0, ControlPc=00007FFC16C65550
RtlLookupFunctionEntry(ControlPc=00007FFC16C65550)
-> FunctionEntry=00007FFC16D05A8C [00007FFC16C65540-00007FFC16C65580], ImageBase=00007FFC16C50000
RtlVirtualUnwind(1, ImageBase=00007FFC16C50000, ControlPc=00007FFC16C65550, FunctionEntry=00007FFC16D05A8C)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000027807FFDE0, ControlPc=00007FFC1794485B
RtlLookupFunctionEntry(ControlPc=00007FFC1794485B)
-> FunctionEntry=00007FFC17AC01EC [00007FFC17944830-00007FFC1794488D], ImageBase=00007FFC17940000
RtlVirtualUnwind(1, ImageBase=00007FFC17940000, ControlPc=00007FFC1794485B, FunctionEntry=00007FFC17AC01EC)
-> ExceptionRoutine=00007FFC179D3AA0, EstablisherFrame=00000027807FFE10, ControlPc=0000000000000000
I believe that the most robust fix here would be to add unwinding information for the dynamically allocated area, but maybe we have other options? For example it seems that we can force a 10 bytes patch; while this may work here it may fail in future versions of kernelbase.dll
if the call instruction reaches the first 10 bytes. What do you think?
Thanks!
Assignee | ||
Comment 1•1 year ago
|
||
We were able to confirm that this missing unwinding information is the root cause for bug 1798787: using the attached patch makes the associated test pass in release and debug. I provide the patch so that it can be used as a temporary fix to assess whether this could be the root cause for other bugs on the Windows 11 workers. I am preparing the actual, more robust fix which will allocate the unwind information together with the detoured code if we identify that a call
lives there.
Assignee | ||
Comment 2•1 year ago
|
||
Writing a proper patch that adds unwinding information for detoured code may take a while. In order to give us more time to prepare this patch, I propose in bug 1798787 a simple optimization that will allow us to bypass the specific instance of the problem that was identified on windows 11 workers, although this will not fix the general problem.
Assignee | ||
Comment 3•1 year ago
|
||
Assignee | ||
Comment 4•1 year ago
|
||
Depends on D163102
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
|
||
(In reply to Yannis Juglaret from comment #1)
[...] I am preparing the actual, more robust fix which will allocate the unwind information together with the detoured code if we identify that a
call
lives there.
Some notes about the current patch I propose:
- It adds unwind information even if there is no call being detoured, because we have space anyway (trampolines have a fixed size of 128 bytes and according to tests, unwind information fits for all the functions we hook) and it can help in the (unlikely) event that profiling stops on our detoured code.
- It does not add unwind information when hooking out-of-process, because that is more tricky as
RtlLookupFunctionEntry
andRtlAddFunctionTable
work on the current process.
Assignee | ||
Comment 6•1 year ago
|
||
Depends on D163102
Updated•1 year ago
|
Updated•1 year ago
|
Comment 7•1 year ago
|
||
The severity field is not set for this bug.
:glandium, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 8•1 year ago
|
||
Let's land the patch on Monday when soft code freeze ends.
Pushed by yjuglaret@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/598fae5e1647 Move UnwindInfo definition to a dedicated WindowsUnwindInfo.h file. r=handyman,mhowell https://hg.mozilla.org/integration/autoland/rev/1518003a9003 Add helpers to iterate unwind codes on Windows. r=mhowell,handyman https://hg.mozilla.org/integration/autoland/rev/50d574f1395b Add unwinding information for in-process detoured code. r=handyman,mhowell
Comment 10•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/598fae5e1647
https://hg.mozilla.org/mozilla-central/rev/1518003a9003
https://hg.mozilla.org/mozilla-central/rev/50d574f1395b
Updated•1 year ago
|
Description
•