Closed Bug 1800898 Opened 1 year ago Closed 1 year ago

Missing unwind information for detoured code in Windows hooking code

Categories

(Core :: mozglue, defect)

x86_64
Windows
defect

Tracking

()

RESOLVED FIXED
110 Branch
Tracking Status
firefox110 --- fixed

People

(Reporter: yannis, Assigned: yannis)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

In bug 1798787, some tests are failing but only on the Windows 11 virtual machines used by our CI when testing 64-bit builds. [:gerard-majax] identified that the reason is that crash dumps do not get generated on these machines for 64-builds, then by adding logs we were able to confirm that the vectored exception handler google_breakpad::ExceptionHandler::HandleHeapCorruption is called but not the unhandled exception filter google_breakpad::ExceptionHandler::HandleException although it is properly set. This unhandled exception filter is responsible for generating the crash dumps.

The investigation results point to our hooking code in mozglue as a root cause for the Windows internal code in RtlDispatchException being unable to find the unhandled exception filter. More precisely, finding the filter requires stack unwinding, and our hooking code can detour instructions to a dynamically allocated area without providing unwinding information for that area. On x64, if the dynamically allocated area contains a call instruction, then we will push on the stack a return address that points to this area, and callees will be unable to do stack unwinding properly and thus unable to reach the unhandled exception filter.

Quoting the related Microsoft documentation:

Table-based exception handling requires a table entry for all functions that allocate stack space or call another function (for example, nonleaf functions). [...]

[...] For dynamically generated functions [{such as} JIT compilers], the runtime to support these functions must either use RtlInstallFunctionTableCallback or RtlAddFunctionTable to provide this information to the operating system. Failure to do so will result in unreliable exception handling and debugging of processes.

The Windows 11 workers in our CI use kernel32.dll version 10.0.22000.1219 (download link). In this specific version, one of the first instructions of BaseThreadInitThunk is a call instruction (starting at the 12th byte):

                             BaseThreadInitThunk
       180015540 48 83 ec 28     SUB        RSP,0x28
       180015544 85 c9           TEST       ECX,ECX
       180015546 75 17           JNZ        LAB_18001555f
       180015548 49 8b c8        MOV        RCX,R8
       18001554b e8 30 ff 00 00  CALL       BaseThreadInitXfgThunk
       180015550 8b c8           MOV        ECX,EAX

Our hooking code detours instructions that fall within the first 13 bytes. This results in the following code for BaseThreadInitThunk as dumped in WinDbg on a loaner machine:

0:006> u KERNEL32!BaseThreadInitThunk
KERNEL32!BaseThreadInitThunk:
00007ff8`a99e5540 49bbb05e7188f87f0000 mov r11,offset mozglue!patched_BaseThreadInitThunk (00007ff8`88715eb0)
00007ff8`a99e554a 41ffe3          jmp     r11
...

0:006> u mozglue!patched_BaseThreadInitThunk
mozglue!patched_BaseThreadInitThunk [/builds/worker/checkouts/gecko/toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp @ 586]:
...
00007ff8`88715f2f 488b05121d0800  mov     rax,qword ptr [mozglue!stub_BaseThreadInitThunk (00007ff8`88797c48)]
00007ff8`88715f36 89f9            mov     ecx,edi
00007ff8`88715f38 4889da          mov     rdx,rbx
00007ff8`88715f3b 4989f0          mov     r8,rsi
00007ff8`88715f3e ff15c4900800    call    qword ptr [mozglue!__guard_dispatch_icall_fptr (00007ff8`8879f008)]

0:006> u poi(mozglue!stub_BaseThreadInitThunk)
000001d4`d84a0010 4883ec28        sub     rsp,28h
000001d4`d84a0014 85c9            test    ecx,ecx
000001d4`d84a0016 740e            je      000001d4`d84a0026
000001d4`d84a0018 ff2500000000    jmp     qword ptr [000001d4`d84a001e]
000001d4`d84a001e 5f559ea9f87f0000 [points to KERNEL32!BaseThreadInitThunk+0x1f (00007ff8`a99e555f)]
000001d4`d84a0026 498bc8          mov     rcx,r8
000001d4`d84a0029 ff1502000000    call    qword ptr [000001d4`d84a0031]
000001d4`d84a002f eb08            jmp     000001d4`d84a0039
000001d4`d84a0031 80549fa9f87f0000 [points to KERNEL32!BaseThreadInitXfgThunk (00007ff8`a99f5480)]
000001d4`d84a0039 ff2500000000    jmp     qword ptr [000001d4`d84a003f]
000001d4`d84a003f 50559ea9f87f0000 [points to KERNEL32!BaseThreadInitThunk+0x10 (00007ff8`a99e5550)]

The final excerpt shows the dynamically allocated area containing a call instruction, which is there to mimic the original CALL BaseThreadInitXfgThunk instruction. With this hook set, newly started threads would execute with a return address of 000001d4d84a002f pushed to their stack, for which no unwinding information is available.

I used the following code to do manual unwinding when reaching google_breakpad::ExceptionHandler::HandleHeapCorruption and confirm that this is the problem:

void UnwindAndDumpHandlers(EXCEPTION_POINTERS* ExceptionInfo)
{
  CONTEXT ContextRecord{};
  memcpy(&ContextRecord, ExceptionInfo->ContextRecord, sizeof(CONTEXT));
  auto ControlPc = ExceptionInfo->ContextRecord->Rip;
  while (ControlPc) {
    fprintf(stderr, "RtlLookupFunctionEntry(ControlPc=%p)\n", (void*)ControlPc);
    DWORD64 ImageBase = 0;
    auto FunctionEntry = RtlLookupFunctionEntry(ControlPc, &ImageBase, nullptr);
    if (!FunctionEntry) {
      fprintf(stderr, "Failed to retrieve a function entry for ControlPc=%p\n", (void*)ControlPc);
      break;
    }
    fprintf(stderr, "-> FunctionEntry=%p [%p-%p], ImageBase=%p\n", (void*)FunctionEntry, (void*)(ImageBase + (uint64_t)FunctionEntry->BeginAddress), (void*)(ImageBase + (uint64_t)FunctionEntry->EndAddress), (void*)ImageBase);
    void* HandlerData = nullptr;
    DWORD64 EstablisherFrame = 0;
    fprintf(stderr, "RtlVirtualUnwind(1, ImageBase=%p, ControlPc=%p, FunctionEntry=%p)\n", (void*)ImageBase, (void*)ControlPc, (void*)FunctionEntry);
    auto ExceptionRoutine = RtlVirtualUnwind(1, ImageBase, ControlPc, FunctionEntry, &ContextRecord, &HandlerData, &EstablisherFrame, nullptr);
    ControlPc = ContextRecord.Rip;
    fprintf(stderr, "-> ExceptionRoutine=%p, EstablisherFrame=%p, ControlPc=%p\n\n", (void*)ExceptionRoutine, (void*)EstablisherFrame, (void*)ControlPc);
  }
}

When the hook is present, it results in the following log:

Reached ExceptionHandler::HandleHeapCorruption, dumping stacked exception handlers.
RtlLookupFunctionEntry(ControlPc=00007FFC72E31F54)
-> FunctionEntry=00007FFC76DEA750 [00007FFC72E31F50-00007FFC72E31F56], ImageBase=00007FFC6B2F0000
RtlVirtualUnwind(1, ImageBase=00007FFC6B2F0000, ControlPc=00007FFC72E31F54, FunctionEntry=00007FFC76DEA750)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000014FA7FFEA0, ControlPc=00000119973F002F
RtlLookupFunctionEntry(ControlPc=00000119973F002F)
Failed to retrieve a function entry for ControlPc=00000119973F002F

Whereas a normal log would manage to unwind the stack properly, eventually finding the unhandled exception filter (00007FFC179D3AA0 in this log):

Reached ExceptionHandler::HandleHeapCorruption, dumping stacked exception handlers.
RtlLookupFunctionEntry(ControlPc=00007FFBD8831F54)
-> FunctionEntry=00007FFBDC7EA750 [00007FFBD8831F50-00007FFBD8831F56], ImageBase=00007FFBD0CF0000
RtlVirtualUnwind(1, ImageBase=00007FFBD0CF0000, ControlPc=00007FFBD8831F54, FunctionEntry=00007FFBDC7EA750)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000027807FFDD0, ControlPc=00007FFC16C65550
RtlLookupFunctionEntry(ControlPc=00007FFC16C65550)
-> FunctionEntry=00007FFC16D05A8C [00007FFC16C65540-00007FFC16C65580], ImageBase=00007FFC16C50000
RtlVirtualUnwind(1, ImageBase=00007FFC16C50000, ControlPc=00007FFC16C65550, FunctionEntry=00007FFC16D05A8C)
-> ExceptionRoutine=0000000000000000, EstablisherFrame=00000027807FFDE0, ControlPc=00007FFC1794485B
RtlLookupFunctionEntry(ControlPc=00007FFC1794485B)
-> FunctionEntry=00007FFC17AC01EC [00007FFC17944830-00007FFC1794488D], ImageBase=00007FFC17940000
RtlVirtualUnwind(1, ImageBase=00007FFC17940000, ControlPc=00007FFC1794485B, FunctionEntry=00007FFC17AC01EC)
-> ExceptionRoutine=00007FFC179D3AA0, EstablisherFrame=00000027807FFE10, ControlPc=0000000000000000

I believe that the most robust fix here would be to add unwinding information for the dynamically allocated area, but maybe we have other options? For example it seems that we can force a 10 bytes patch; while this may work here it may fail in future versions of kernelbase.dll if the call instruction reaches the first 10 bytes. What do you think?

Thanks!

Flags: needinfo?(davidp99)
Attached patch quickfix.patchSplinter Review

We were able to confirm that this missing unwinding information is the root cause for bug 1798787: using the attached patch makes the associated test pass in release and debug. I provide the patch so that it can be used as a temporary fix to assess whether this could be the root cause for other bugs on the Windows 11 workers. I am preparing the actual, more robust fix which will allocate the unwind information together with the detoured code if we identify that a call lives there.

Assignee: nobody → yjuglaret
Status: NEW → ASSIGNED
Flags: needinfo?(davidp99)

Writing a proper patch that adds unwinding information for detoured code may take a while. In order to give us more time to prepare this patch, I propose in bug 1798787 a simple optimization that will allow us to bypass the specific instance of the problem that was identified on windows 11 workers, although this will not fix the general problem.

Attachment #9305303 - Attachment description: Bug 1800898 - Move UnwindInfo definition to a dedicated WindowsUnwindInfo.h file. r=handyman,gerard-majax → Bug 1800898 - Move UnwindInfo definition to a dedicated WindowsUnwindInfo.h file. r=handyman,mhowell
Attachment #9305306 - Attachment description: WIP: Bug 1800898 - Add unwinding information for in-process detoured code. r=handyman,gerard-majax → Bug 1800898 - Add unwinding information for in-process detoured code. r=handyman,mhowell

(In reply to Yannis Juglaret from comment #1)

[...] I am preparing the actual, more robust fix which will allocate the unwind information together with the detoured code if we identify that a call lives there.

Some notes about the current patch I propose:

  • It adds unwind information even if there is no call being detoured, because we have space anyway (trampolines have a fixed size of 128 bytes and according to tests, unwind information fits for all the functions we hook) and it can help in the (unlikely) event that profiling stops on our detoured code.
  • It does not add unwind information when hooking out-of-process, because that is more tricky as RtlLookupFunctionEntry and RtlAddFunctionTable work on the current process.
Attachment #9306390 - Attachment description: Bug 1800898 - Add a helper function to help iterate unwind codes on Windows. r=mhowell → Bug 1800898 - Add helpers to iterate unwind codes on Windows. r=mhowell
Attachment #9306390 - Attachment description: Bug 1800898 - Add helpers to iterate unwind codes on Windows. r=mhowell → Bug 1800898 - Add helpers to iterate unwind codes on Windows. r=handyman,mhowell

The severity field is not set for this bug.
:glandium, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mh+mozilla)

Let's land the patch on Monday when soft code freeze ends.

Pushed by yjuglaret@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/598fae5e1647
Move UnwindInfo definition to a dedicated WindowsUnwindInfo.h file. r=handyman,mhowell
https://hg.mozilla.org/integration/autoland/rev/1518003a9003
Add helpers to iterate unwind codes on Windows. r=mhowell,handyman
https://hg.mozilla.org/integration/autoland/rev/50d574f1395b
Add unwinding information for in-process detoured code. r=handyman,mhowell
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 110 Branch
Regressions: 1805331
Blocks: 1805435
Regressions: 1810716
Flags: needinfo?(mh+mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: