Closed Bug 1400169 Opened 8 years ago Closed 2 months ago

Crash in CallHookWithSEH

Categories

(External Software Affecting Firefox :: Other, defect, P3)

x86
Windows

Tracking

(firefox-esr52 wontfix, firefox-esr102 wontfix, firefox55 wontfix, firefox56- wontfix, firefox57 wontfix, firefox58 wontfix, firefox113 wontfix, firefox114 wontfix, firefox115 wontfix, firefox149 fixed)

RESOLVED FIXED
149 Branch
Tracking Status
firefox-esr52 --- wontfix
firefox-esr102 --- wontfix
firefox55 --- wontfix
firefox56 - wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox149 --- fixed

People

(Reporter: philipp, Assigned: gstoll)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [main process crash][inj+])

Crash Data

Attachments

(2 files)

[Tracking Requested - why for this release]: this crash signature is starting to rise on windows 10 for users with the creator's update with 32bit versions of firefox since 2017-09-12 (that would be patch tuesday). This bug was filed from the Socorro interface and is report bp-554e999d-b52f-4828-a604-4c3210170913. ============================================================= Crashing Thread (0) Frame Module Signature Source 0 @0x43445a30 1 user32.dll CallHookWithSEH 2 user32.dll __fnHkINLPMSG 3 ntdll.dll KiUserCallbackDispatcher 4 ntdll.dll KiUserApcDispatcher 5 user32.dll IsDialogMessageW 6 user32.dll DialogBox2 7 user32.dll InternalDialogBox 8 user32.dll DialogBoxIndirectParamAorW 9 user32.dll DialogBoxIndirectParamW 10 mozglue.dll mozilla::detail::MutexImpl::unlock() mozglue/misc/Mutex_windows.cpp:33 11 xul.dll nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, nsTimerImpl::Callback&&) xpcom/threads/nsTimerImpl.cpp:232 12 mozglue.dll mozilla::detail::MutexImpl::unlock() mozglue/misc/Mutex_windows.cpp:33 13 xul.dll nsFilePicker::ShowFilePicker(nsString const&) widget/windows/nsFilePicker.cpp:534 14 xul.dll nsFilePicker::ShowW(short*) widget/windows/nsFilePicker.cpp:620 15 xul.dll nsFilePicker::Show(short*) widget/windows/nsFilePicker.cpp:648 16 xul.dll AsyncShowFilePicker::Run() widget/nsBaseFilePicker.cpp:86 17 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1418 18 xul.dll mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:96 19 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:231 20 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:211 21 xul.dll nsBaseAppShell::Run() widget/nsBaseAppShell.cpp:156 22 xul.dll nsAppShell::Run() widget/windows/nsAppShell.cpp:271 23 xul.dll nsAppStartup::Run() toolkit/components/startup/nsAppStartup.cpp:283 24 xul.dll XREMain::XRE_mainRun() toolkit/xre/nsAppRunner.cpp:4567 25 xul.dll XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4747 26 xul.dll XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4842 27 xul.dll mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/Bootstrap.cpp:45 28 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:115 29 firefox.exe __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253 30 kernel32.dll BaseThreadInitThunk 31 ntdll.dll __RtlUserThreadStart 32 ntdll.dll _RtlUserThreadStart
The volume of crashes in 56 seems very low. Track 56- for now. Feel free to nominate again if you don't agree.
tagging this as regression to get more eyeballs on it (though it might not technically be our regression)
Keywords: regression
According to https://support.mozilla.org/en-US/questions/1176159 this is possibly related to Immunet Anti-virus. Worth checking out.
We can try blocklisting. David can you also investigate the issue, seems like a crash related to this antivirus and file picking. Adam, can you also try contacting the company and directing them to this bug?
Flags: needinfo?(davidp99)
Flags: needinfo?(jmathies)
(ni? adam re: Comment #4)
Flags: needinfo?(astevenson)
I believe this is related to DLLs injecting into our process but I think the issue is probably more general than Immunet AV. I haven't been able to come up with anything actionable. First, on Immunet AV, I tried 32-bit Nightly with it installed (Immunet's installer says it chose a 64-bit install). I was unable to get it to crash when spawned from either the chrome process (Ctrl-O) or proxied from the content process (https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/file). I also don't think I see anything related to Immunet (or really AV in general) in the modules or correlations sections of the crash-stats, although maybe they don't appear in crash-stats if they have been unloaded? -- The rest of this comment is deep in the debugging weeds -- As for the actual crash, I believe (pretty strongly but definitely without a proof) that the issue is DLL(s) registering for hooks and not cleaning them up (that is, calling SetWindowsHookEx but freeing the handler they register without calling UnhookWindowsHookEx). I dont have a repro for this bug but I have stepped through the behavior when it _doesn't_ crash and the point of failure seems to be a dynamic call to a function pointer stored in a table. Indeed, it does two lookups and the first one is strong evidence of the second. The first lookup, in KiUserCallbackDispatcher, looks to be bringing up the table of function hooks for a specific hook type given by the idHook parameter of SetWindowHookEx [1]. The killer here is the names of the functions it looks up -- in our crashes the name of the function is "__fnHkINLPMSG", which looks like gibberish, until you see the names of other functions in the table, like "__fnHkINLPMOUSEHOOKSTRUCTEX". This starts to make it pretty clear that the name of the function is a mangled idHook type and our crashes are almost certainly due to hooking the "WH_MSGFILTER" type. If that weren't convincing enough, the docs for WH_MSGFILTER are: "Installs a hook procedure that monitors messages generated as a result of an input event in a dialog box, message box, menu, or scroll bar." I believe it is then crashing calling someones bad WH_MSGFILTER handler. Its more likely that it doesn't exist than that the handler crashed, given the call stack. Another genuine possibility is that the offending application hooked up its handlers with the wrong bitness (ie a 64-bit handler in our 32-bit process). I don't know anything about how this might happen tho. Note that the type of the handler explains why this is happening when we post a dialog -- it may be the rare opportunity for the system to call this handler. Given this explanation, its not a surprise that there are other crashes under CallHookWithSEH that are probably other injections failing to run or to clean up after themselves. For example, [2] is a different crash that seems to be due to a WH_MOUSE handler. [3] is a crash-stats search that looks for __fnHkINLPMSG in the call stack and [4] is the graph showing the spike in this crash on Sep 12. The spike, strangely, is only in release and the esr, despite the release having gone out 2 weeks prior. So I suspect something changed out in the wild -- something like a bad AV release -- I just haven't found it. --- [1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms644990(v=vs.85).aspx [2] https://crash-stats.mozilla.com/report/index/3c3955ed-d66c-4f9b-9cda-971d10170921 [3] https://crash-stats.mozilla.com/signature/?proto_signature=~__fnHkINLPMSG&signature=CallHookWithSEH&date=%3E%3D2017-08-21T23%3A46%3A11.000Z&date=%3C2017-09-21T23%3A46%3A11.000Z [4] https://crash-stats.mozilla.com/signature/?proto_signature=~__fnHkINLPMSG&signature=CallHookWithSEH&date=%3E%3D2017-08-21T23%3A46%3A11.000Z&date=%3C2017-09-21T23%3A46%3A11.000Z#graphs
Flags: needinfo?(davidp99)
Blocks: injecteject
Flags: needinfo?(jmathies)
This is a demo app, hacked up from this github repo [1], that I used to explore SetWindowsHookEx. I was able to use this to demonstrate most of the stuff in my last comment. For example, the method is the callback for the hook for WH_MSGFILTER events and the method is only called when we use dialogs, not menus (specially handled) or scrollbars (due to APZ) or other stuff. What I got wrong was that this handler calls LoadLibrary, so any loaded DLL shouldn't be easily unloaded. However, the crashes seem to happen immediately after the jump -- the topmost valid stack entry is a "call eax" and eax contains the location that we crash at -- so its probably a bad function pointer. And I still don't understand how this stuff works with 32/64-bit inconsistencies. [1] https://github.com/proteansec/visual-studio-projects
Just an update that I've reached out and will update when I hear back.
Flags: needinfo?(astevenson)
They have someone that can help us look into this issue. I've added :jimm to the email chain.
Whiteboard: [main process crash]
This was a known issue and something addressed in their last release 6.0.4, which is available on their website.
Priority: -- → P3
Component: Widget: Win32 → Other
Product: Core → External Software Affecting Firefox
Version: 55 Branch → unspecified
OS: Windows 10 → Windows
Do you know what the issue was? Is there anything we can do on our side?
Flags: needinfo?(jmathies)
Orphaned windows hooks is the current theory, not much to go on here.
Flags: needinfo?(jmathies)
Whiteboard: [main process crash] → [main process crash][inj+]
Severity: critical → S2
See Also: → 1771815

This is pretty low volume now, downgrading to S4.

Severity: S2 → S4

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta

:gstoll, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(gstoll)
Keywords: topcrash

This newer batch of crashes seem to be associated with third-party modules published by either Idera, Inc. (which seems to be a database product) or Intuit (seemingly associated with Quickbooks). They are x86-only parent process crashes, and they seem to be stack overflows. Thunderbird has been seeing these for a while I think (see bug 1771815).

I'm not opposed to blocking some things here, but in both of these cases there are a bunch of modules loaded, so we should try to figure out which one is a good one to block.

Severity: S4 → S3
Flags: needinfo?(gstoll)

OK, new theory: these recent round of crashes seem to have started in 147b6, which is what bug 1998650 got uplifted to. :yannis, is it possible that change is allowing some modules to load that were previously not able to?

Flags: needinfo?(yjuglaret)

These are not stack overflows but rather Control Flow Guard fast-fail failures, which use the same error code. The call stack is:

ntdll!RtlFailFast2
ntdll!RtlpHandleInvalidUserCallTarget+0x93
ntdll!RtlDispatchException+0x56c41
ntdll!KiUserExceptionDispatcher+0xf
ntdll!LdrpValidateUserCallTargetBitMapCheck
user32!DispatchHookW+0xcd
user32!CallHookWithSEH+0x59
user32!__fnHkINDWORD+0x26
ntdll!KiUserCallbackDispatcher+0x36
win32u!NtUserPeekMessage+0xc
user32!_PeekMessage+0xeb
user32!PeekMessageW+0x1b1
xul!mozilla::widget::WinUtils::WaitForMessage+0x99 

I am very confident that this crash occurs when a DLL sets up a hook with SetWindowsHookEx and that hook propagates to our threads, if that DLL hasn't been compiled with /cf:guard. Notably, if a program calls it with dwThreadId to 0, that will install the hook on all existing threads running in the same desktop as the calling thread per the documentation. The developers might be doing that without realizing the impact -- in any case, we would need to figure out which DLLs are doing these calls, and block them. So far I have found that pattern being used in evpComponentsXE.bpl and I would thus recommend blocking it (more details about that investigation below).

As for the link to bug 1998650, probably? Perhaps if Firefox 32-bit was previously loading an outdated-enough version of the runtime DLLs from the SysWOW64 directory, that might have resulted in Control Flow Guard getting somehow unenforced for our processes, and now that we load up-to-date DLLs the mitigation would be active again? This would deserve further investigation, but the crash volume does have the pattern of a new issue affecting old software, so likely linked to a change on our side, and that change is indeed a good candidate.

Below are the details for the evpComponentsXE.bpl investigation. I've downloaded the crash dumps and looked for the unusual DLLs being the most present: they are the BPL files signed by Idera, Inc., which are loaded from C:\Program Files (x86)\Sage Evolution and are present in 26 crashes (out of 31) from 8 unique users (based on install_time). But, there are a lot of different BPL files.

Luckily I've found a zip file that contains those BPL files by following this download procedure. The BPL files there are not signed with the same vendor name (maybe due to localization), but they are in the same version, so close enough for investigating. These are Delphi packages, and as suspected they don't have the CF Guard tables like our binaries do, this can be confirmed by comparing the output of dumpbin /loadconfig for mozglue.dll and e.g. evpComponentsXE.bpl.

Then I created the dependency graph between these DLLs and realized that the ones that start with evp are among the most top-level DLLs in that regard. I compared that to the files that import SetWindowsHookExW and confirmed the use of a zero dwThreadId by evpComponentsXE.bpl -- the other .bpl are just its (long list of) dependencies.

Flags: needinfo?(yjuglaret)

(In reply to Yannis Juglaret [:yannis] from comment #17)

I've downloaded the crash dumps and looked for the unusual DLLs being the most present: they are the BPL files signed by Idera, Inc., which are loaded from C:\Program Files (x86)\Sage Evolution and are present in 26 crashes (out of 31) from 8 unique users (based on install_time).

The 5 remaining crashes from 2 unique users are all with Intuit DLLs, so I concurr with comment 15 that Sage Evolution and Intuit are the only impacted software we know of for the moment. We would need to do a similar analysis for Intuit.

I haven't found the 2020 and 2021 versions of the Intuit software (that we have in the crashes), but after looking at the current trial edition, I would say a 90% chance that ui.DLL is the cause (has a call to SetWindowsHookExA with dwThreadId set to 0) and a 10% chance that it is FtuEngine.dll (has calls to SetWindowsHookExA that pass in the result of GetWindowThreadProcessId -- most likely called on their own windows which would be fine). We might want to block both to be sure.

Assignee: nobody → gstoll
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 149 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: