Closed Bug 1730033 Opened 3 years ago Closed 3 years ago

Crash in [@ _CreateWindowEx] from mscom::ProcessRuntime::ProcessRuntime

Categories

(Core :: IPC: MSCOM, defect)

x86
Windows 7
defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox93 --- wontfix
firefox94 + wontfix
firefox95 + fixed
firefox96 --- fixed

People

(Reporter: mccr8, Assigned: toshi)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/f061d480-e5e5-49d8-8097-51b870210908

Reason: EXCEPTION_ACCESS_VIOLATION_EXEC

Top 10 frames of crashing thread:

0  @0x77894ec3 
1 ntdll.dll KiUserCallbackDispatcher 
2 ntdll.dll KiUserApcDispatcher 
3 user32.dll _CreateWindowEx 
4 user32.dll CreateWindowExW 
5 ole32.dll InitMainThreadWnd 
6 ole32.dll CoInitializeEx d:\w7rtm\com\ole32\com\class\compobj.cxx:2109
7 xul.dll mozilla::mscom::ProcessRuntime::ProcessRuntime ipc/mscom/ProcessRuntime.cpp:134
8 xul.dll XRE_InitChildProcess toolkit/xre/nsEmbedFunctions.cpp:668
9 xul.dll mozilla::BootstrapImpl::XRE_InitChildProcess toolkit/xre/Bootstrap.cpp:67

Not a ton of these crashes. Kind of similar to bug 1652696.

This is showing up in some child processes in 94 Beta according to crash ping telemetry.

  • Looks to have started in beta 2, on or around October 12th.
  • x86 only
  • Windows 7 only

[Tracking Requested - why for this release]:
New crash signature in 94 beta. In the top ten for some utility processes according to crash ping telemetry.

Interesting note, this is not showing up in the GPU or socket processes currently. Child processes that experience it - rdd, content

Not much we can do for the 94.0 release at this point and Socorro data shows this starting to spike already back in August. Still, we can track it for now to keep an eye on it in case of spikes after launch next week.

94 Beta started on 2021-08-09. The Fission Beta experiment increased from 15% Fission in 93 Beta to 30% Fission in 94 Beta on 2021-08-10. However, I only see

Fission Milestone: --- → MVP
Fission Milestone: MVP → ?

I spent some time on this.

The crash happened when user32!__fnINLPCREATESTRUCT tried to jump to a user-mode callback, which should be user32!DispatchClientMessage in this case, but the jump destination was an invalid address for some reason. The jump destination is provided as the parameter of user32!__fnINLPCREATESTRUCT and it was set in kernel.

Interestingly, the lowest two bytes in the invalid address provided and the expected jump destination are the same (In the output below, they are 77894ec3 and 00714ec3). I don't think it's a coincidence. Another unusual thing is the imagebase of user32.dll is a low address. Since user32.dll is a system DLL, it's usually loaded to a high address.

0:000> .excr
eax=001cefec ebx=00000000 ecx=00000000 edx=0000000a esi=001cefc4 edi=00000000
eip=77894ec3 esp=001cef80 ebp=001cefac iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
77894ec3 ??              ???
0:000> kn2
 # ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 001cef7c 0070e98a     0x77894ec3
01 001cefac 77b56fee     user32!__fnINLPCREATESTRUCT+0x8b
0:000> ub 0070e98a
user32!__fnINLPCREATESTRUCT+0x5d:
0070e972 3bcf            cmp     ecx,edi
0070e974 0f85c8a6ffff    jne     user32!__fnINLPCREATESTRUCT+0x61 (00709042)
0070e97a ff7658          push    dword ptr [esi+58h]
0070e97d 50              push    eax
0070e97e ff7620          push    dword ptr [esi+20h]
0070e981 ff761c          push    dword ptr [esi+1Ch]
0070e984 ff7618          push    dword ptr [esi+18h]
0070e987 ff565c          call    dword ptr [esi+5Ch]
0:000> x user32!DispatchClientMessage
00714ec3          user32!DispatchClientMessage (_DispatchClientMessage@20)
0:000> lm m user32
start    end        module name
00700000 007c9000   user32     (pdb symbols)          d:\symbols\user32.pdb\DD74...

If user32.dll was loaded to 77880000 first and then it was unloaded and reloaded to 00700000, this might happen, but I'm not sure it's even possible. Or, somebody mistakenly modified the stack data, replacing the highest two bytes of the jump destination..?

Looking through the crash reports, many of them have Avast's module aswhook.dll loaded, but not all of them. For example, https://crash-stats.mozilla.org/report/index/99fed8db-fc68-4462-9fba-2bff70211013 does not have any third-party modules.

(In reply to Toshihito Kikuchi [:toshi] from comment #6)

If user32.dll was loaded to 77880000 first and then it was unloaded and reloaded to 00700000, this might happen, but I'm not sure it's even possible.

I think I fully understood this crash. What I wrote above was actually correct. There is a rule in Windows 7 32-bit: For win32k's user-mode callbacks to work, user32.dll must be loaded to a single virtual address among all processes. This is because win32k!gpsi stores user32's callback functions directly. In Windows 7 64-bit, on the other hand, win32k!gpsi points to ntdll's functions like ntdll!NtdllDispatchMessage_W that jump to user32's functions like user32!DispatchClientMessage. So it can handle user32's addresses per process.

Because of our win32k-lockdown efforts, firefox.exe delay-loads user32.dll. If someone (our code or injected code) reserves the expected mapped address of user32.dll (77880000 in the example above) before user32.dll is loaded, user32.dll is loaded to a different address like 00700000. Since win32k does not have a place to store that new address, it invokes user32!DispatchClientMessage, always assuming user32.dll is loaded to 77880000, resulting in this crash.

A workaround would be to explicitly load user32.dll as early as possible if the system is Win7 32-bit.

Clearing the Fission Milestone flag because this crash signature affects Fission and e10s equally. Also, this Windows 7 crash signature first appeared in August, but Fission's Windows 7 crashiness has been a problem since much earlier than that.

Fission Milestone: ? → ---
Attached file poc

Attaching a small program to reproduce this crash. Only works on Win7 32-bit OS.

In Windows 7 32-bit, we need to load user32.dll as early as possible to prevent
user32.dll from being mapped to an address different from the one cached in win32k.
If that happens, win32k's usermode callback triggers execute AV.

The proposed mitigation is to pre-load user32.dll in wmain. Ideally user32.dll
should be loaded as a dependent module of firefox.exe, but we cannot unless we ship
different binaries for platforms that support win32k-lockdown and do not.

If a third-party module is loaded before wmain, this does not guarantee user32.dll
will be mapped to an expected address. We'll keep monitoring crash data.

Assignee: nobody → tkikuchi
Status: NEW → ASSIGNED
Pushed by tkikuchi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ca34b82f2933
Pre-load user32.dll in Windows 7 32-bit. r=bobowen

Toshi, does this crash only happen when win32k-lockdown is enabled? Should we uplift your fix to Beta 95 or let it ride the trains with 96?

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

(In reply to Chris Peterson [:cpeterson] from comment #12)

Toshi, does this crash only happen when win32k-lockdown is enabled? Should we uplift your fix to Beta 95 or let it ride the trains with 96?

No, this happens only on Win7 32-bit, where win32k lockdown does not exist (win32k-lockdown is Win10's feature).

I'll request an uplift to see if this patch really reduces the crash cases. Avast injects their code very early (they hook LdrLoadDll for ntdll.dll, that's long before wmain I patched). If their module steals user32's address intentionally or unintentionally before wmain, this patch does not mitigate the problem.

Flags: needinfo?(tkikuchi)

Comment on attachment 9248978 [details]
Bug 1730033 - Pre-load user32.dll in Windows 7 32-bit. r=bobowen

Beta/Release Uplift Approval Request

  • User impact if declined: On Windows 7 32-bit, Tab process may randomly crash. Avast users are more likely to hit the issue. Since this is a tab process startup crash, enabling Fission increases the chance to hit this issue.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is to load user32.dll if the system is Win7 32-bit, so the risk is very limited and low. The only risk is this change may not fully stop the crash if the crash is caused by a third-party that is injected and runs before wmain.
  • String changes made/needed:
Attachment #9248978 - Flags: approval-mozilla-beta?

Comment on attachment 9248978 [details]
Bug 1730033 - Pre-load user32.dll in Windows 7 32-bit. r=bobowen

Approved for 95 beta 3, thanks.

Attachment #9248978 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

I think that's due to attempts to enable on older Windows, see https://bugzilla.mozilla.org/show_bug.cgi?id=1719212

(needinfo Bob to make sure I'm not missing anything)

Flags: needinfo?(gpascutto) → needinfo?(bobowencode)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #19)

I think that's due to attempts to enable on older Windows, see https://bugzilla.mozilla.org/show_bug.cgi?id=1719212

(needinfo Bob to make sure I'm not missing anything)

This is not to do with win32k lockdown (although it is to do with win32k), it is to do with something else causing user32.dll to be mapped to a different address than the one cached for win32k.
Windows 7 32-bit (and I guess earlier) problem.

Flags: needinfo?(bobowencode)

Some further thoughts:
Maybe we can at least mitigate this issue by acting as if win32k lockdown is enabled for much of the content process even on win7 and so never load user32.dll or call win32k APIs. Of course in this case other injected DLLs are loading it, but it seems to be our calls that are crashing. There is another slight caveat in that we believe that win7 might need to fallback to GDI fonts in some cases, but it would hopefully reduce the problem.

The other idea is whether we could use the code we have to restore the import directory, but instead add in user32.dll on win7 32-bit.
I don't know if it is possible for us to expand the list in this way.

Blocks: 1826400
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: