Closed Bug 1798787 Opened 2 years ago Closed 1 year ago

browser_utility_crashReporter.js and browser_utility_audioDecodeCrash.js fail to crash/pass when run on windows 11

Categories

(Core :: IPC, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: gerard-majax)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 obsolete file)

currently we run on windows 10, we are looking to get tests running on windows 11 in the new year. There are some failures that are new and unexpected.

You can see both the errors in this try push.

I am not sure how to decipher this, but I am happy to hack and try other things if there are suggestions.

:jld, any suggestions on how to get more info or fix this?

also there might be a related failure on xpcshell (here). this crashes on shutdown in toolkit/crashreporter/test/unit/test_crash_terminator.js.

Flags: needinfo?(jld)

I don't know much about this, but :gerard-majax is the de facto owner of the Utility process so might have some ideas.

Flags: needinfo?(jld) → needinfo?(lissyx+mozillians)
Flags: needinfo?(lissyx+mozillians)

Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIPropertyBag2.getPropertyAsAString]

This suggest something is not right with how processes are crashing: https://searchfox.org/mozilla-central/rev/aa1bb4f0ca8bfda4b117d1befca47b72d5dd6d5d/ipc/glue/test/browser/head.js#266-267 we dont have a dumpID ?

https://searchfox.org/mozilla-central/rev/aa1bb4f0ca8bfda4b117d1befca47b72d5dd6d5d/ipc/glue/UtilityProcessParent.cpp#141,146-148

Assignee: nobody → lissyx+mozillians

oh interesting; Maybe there is something with the OS version, or some expected value from a library/dll when crashing that windows 11 has different or some permissions thing.

if you were going to push to try, you can reproduce it with:
./mach try fuzzy --worker-override="win10-64-2004=gecko-t/win11-64-2009-alpha" --no-artifact -q 'test-windows10' --rebuild 2 ipc/glue/test/browser/

likewise I could poke around a bit tomorrow or test patches out.

I dont know anything about Windows 11, it will take me a while before I can investigate since this requires me to setup a VM ...

Severity: -- → S3
Priority: -- → P2
Depends on: 1800898

After much failed attempts with the TC loaner, I setup a Win11 21H2 local VM and it looks like I can repro the issue. :jmaher was unable to repro, and his VM was 22H2. The TC workers are also on 21H2.

maybe it is worth considering 22H2? :Jonathan?

Flags: needinfo?(jmoss)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #8)

maybe it is worth considering 22H2? :Jonathan?

Working on this today!

Flags: needinfo?(jmoss)

We have identified the root cause for the test failure in bug 1800898 and are working on a fix. Since 21H2 seems to have version-specific bugs, it is quite interesting for us to have the tests run on it at the moment. Could we wait that 21H2-specific bugs are fixed before moving to 22H2?

In order to give us more time to prepare a proper patch for the general problem as stated in bug 1800898, I propose a patch that will bypass the specific instance of the problem that we have identified on our Windows 11 21H2 workers, although this will not fix the general problem.

The patch implements a tail-call optimization for detoured code. When we recognize that the last instruction we hook is a call, instead of using a call instruction that would return to our detoured code, we can simulate a call with the push and jmp instructions, and push a return address that will return directly to the original code.

The problematic hooked function in Windows 11 21H2 workers is as follows:

                             BaseThreadInitThunk
       180015540 48 83 ec 28     SUB        RSP,0x28
       180015544 85 c9           TEST       ECX,ECX
       180015546 75 17           JNZ        LAB_18001555f
       180015548 49 8b c8        MOV        RCX,R8
       18001554b e8 30 ff 00 00  CALL       BaseThreadInitXfgThunk
       180015550 8b c8           MOV        ECX,EAX

Without the optimization, the detoured code looks like this on the Windows 11 21H2 workers:

0:006> u poi(mozglue!stub_BaseThreadInitThunk)
000001d4`d84a0010 4883ec28        sub     rsp,28h
000001d4`d84a0014 85c9            test    ecx,ecx
000001d4`d84a0016 740e            je      000001d4`d84a0026
000001d4`d84a0018 ff2500000000    jmp     qword ptr [000001d4`d84a001e]
000001d4`d84a001e ...             00007ff8`a99e555f (KERNEL32!BaseThreadInitThunk+0x1f)
000001d4`d84a0026 498bc8          mov     rcx,r8
000001d4`d84a0029 ff1502000000    call    qword ptr [000001d4`d84a0031]
000001d4`d84a002f eb08            jmp     000001d4`d84a0039
000001d4`d84a0031 ...             00007ff8`a99f5480 (KERNEL32!BaseThreadInitXfgThunk)
000001d4`d84a0039 ff2500000000    jmp     qword ptr [000001d4`d84a003f]
000001d4`d84a003f ...             00007ff8`a99e5550 (KERNEL32!BaseThreadInitThunk+0x10)

With the optimization, it now looks like this:

0:006> u poi(mozglue!stub_BaseThreadInitThunk)
00000126`bc7c0010 4883ec28        sub     rsp,28h
00000126`bc7c0014 85c9            test    ecx,ecx
00000126`bc7c0016 740e            je      00000126`bc7c0026
00000126`bc7c0018 ff2500000000    jmp     qword ptr [00000126`bc7c001e]
00000126`bc7c001e ...             00007ff8`a99e555f (KERNEL32!BaseThreadInitThunk+0x1f)
00000126`bc7c0026 498bc8          mov     rcx,r8
00000126`bc7c0029 ff3506000000    push    qword ptr [00000126`bc7c0035]
00000126`bc7c002f ff2508000000    jmp     qword ptr [00000126`bc7c003d]
00000126`bc7c0035 ...             00007ff8`a99e5550 (KERNEL32!BaseThreadInitThunk+0x10)
00000126`bc7c003d ...             00007ff8`a99f5480 (KERNEL32!BaseThreadInitXfgThunk)

Because the address pushed on the stack is part of the original code for BaseThreadInitThunk, it benefits from the unwinding information embedded in kernel32.dll. Callees thus manage to do proper stack unwinding and all structured exception handlers get called.

Attachment #9304670 - Attachment description: WIP: Bug 1798787 - Implement a tail-call optimization for detoured code. → Bug 1798787 - Implement a tail-call optimization for detoured code. r=handyman
Attachment #9304670 - Attachment description: Bug 1798787 - Implement a tail-call optimization for detoured code. r=handyman → Bug 1798787 - Implement a tail-call optimization for detoured code in Windows x64 interception. r=handyman

The changes described in comment 12 are actually not compatible with Intel CET's shadow stack security feature. Hopefully I have this feature on my machine which helped me realize this. So it's better to not integrate these changes and fix the general problem instead.

Attachment #9304670 - Attachment is obsolete: true

I'll just leave here some notes from [:gerard-majax] for reproducing the test in a running browser:

start browser, open browser console

const utilityProcessTest = () => {
  return Cc["@mozilla.org/utility-process-test;1"].createInstance(
       Ci.nsIUtilityProcessTest
  );
}
const ProcessTools = Cc["@mozilla.org/processtools-service;1"].getService(
 Ci.nsIProcessToolsService
); 

then open about:processes
then in the console, utilityProcessTest().startProcess(); and you should see a utility process appearing in the about:processes
then issue ProcessTools.crash(x); with x the PID of the process, and it should crash
and about:crashes should have a crash to submit

The patch for bug 1800898 is ready and will fix this bug, it is undergoing code review at the moment.

The code review is over and we can land the patch on Monday when soft code freeze ends.

Depends on: 1805435
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: