Open Bug 1809655 Opened 2 years ago Updated 2 years ago

Crash in [@ shutdownhang | ... <various> ... | LdrLoadDll ] due to SetWindowsHookEx DLL injection

Categories

(Core :: Widget: Win32, defect, P3)

All
Windows
defect

Tracking

()

People

(Reporter: valentin, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, nightly-community, Whiteboard: [tbird crash][necko-triaged][win:stability])

Crash Data

+++ This bug was initially created as a clone of Bug #1356853 +++

(In reply to Jens Stutte [:jstutte] from bug 1356853 comment #42)

The signature [@ shutdownhang | NtQueryAttributesFile ] shows a fair amount of hangs where we seem to do stub_LdrLoadDll intercepting a PeekMessageW. I cannot really see what kind of DLL we try to load here, but I'd assume, it is not really cache related.

Bug 1356853 was fixed by bug 1786256.
This is to track the remaining crashes.

Summary: Crash in [@ shutdownhang | NtQueryAttributesFile]. CacheFileIOManager because SyncRemoveAllCacheFiles takes too long to Clear Cache on Shutdown for users with clear cache history on close → Crash in [@ shutdownhang | NtQueryAttributesFile] - stub_LdrLoadDll intercepting a PeekMessageW
Blocks: 1682899

The severity field is not set for this bug.
:spohl, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(spohl.mozilla.bugs)
Severity: -- → S3
Flags: needinfo?(spohl.mozilla.bugs)
Priority: -- → P3
Whiteboard: [tbird crash][necko-triaged] → [tbird crash][necko-triaged][win:stability]

:rkraesig, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(rkraesig)

Despite being a topcrasher, it's also a shutdownhang, which as crash bugs go has relatively little impact. P3/S3 will do for now, although we may revisit that.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)

(In reply to Jens Stutte [:jstutte] from bug 1356853 comment #42)

The signature [@ shutdownhang | NtQueryAttributesFile ] shows a fair amount of hangs where we seem to do stub_LdrLoadDll intercepting a PeekMessageW. I cannot really see what kind of DLL we try to load here, but I'd assume, it is not really cache related.

That's not quite what happens. A sample stack:

0 	ntdll.dll 	NtQueryAttributesFile 		context
1 	ntdll.dll 	LdrpGetNtPathFromDosPath 		cfi
2 	ntdll.dll 	LdrpResolveDllName 		cfi
3 	ntdll.dll 	LdrpMapDllFullPath 		cfi
4 	ntdll.dll 	LdrpProcessWork 		cfi
5 	ntdll.dll 	LdrpLoadDllInternal 		cfi
6 	ntdll.dll 	LdrpLoadDll 		cfi
7 	ntdll.dll 	LdrLoadDll 		cfi
8 	firefox.exe 	mozilla::interceptor::FuncHookCrossProcess 	toolkit/xre/dllservices/mozglue/nsWindowsDllInterceptor.h:254 	inlined
8 	firefox.exe 	mozilla::freestanding::patched_LdrLoadDll 	browser/app/winlauncher/freestanding/DllBlocklist.cpp:360 	cfi
9 	KERNELBASE.dll 	LoadLibraryExW 		cfi
10 	user32.dll 	_ClientLoadLibrary 		cfi
11 	ntdll.dll 	KiUserCallbackDispatch 		cfi
12 	win32u.dll 	NtUserPeekMessage 		cfi
13 	user32.dll 	_PeekMessage(tagMSG*, HWND__*, unsigned int, unsigned int, unsigned int, unsigned int, int) 		cfi
14 	user32.dll 	PeekMessageW

The key thing here is the ntdll!KiUserCallbackDispatch. When a thread with a message queue enters certain syscalls (such as NtUserPeekMessage here), Windows may process messages for that thread on that thread: while it's "blocked" in the kernel, the kernel creates a fake stack frame for KiUserCallbackDispatch after the syscall, and then dispatches the message from there.

Usually this would dispatch to a WindowProc; but the target dispatch function here is user32!_ClientLoadLibrary, which is an undocumented Windows function that acts as a landing pad for SetWindowsHookEx injections. Something is trying to load a DLL during shutdown, and apparently is even succeeding in doing so. (Or at least if it is failing, it's not our interceptor causing the failure.)

Given that this is inducing a shutdownhang, I suspect we may have a situation vaguely similar to bug 1823412, where something is causing multiple DLL loads -- but then again, it could just be that shutdown takes too long and that a DLL load is often what pushes it over the edge. Either way, we might be able to get away with just blocking third-party DLL loading during shutdown. :gstoll?

Crash Signature: [@ shutdownhang | NtQueryAttributesFile] [@ shutdownhang | ZwQueryAttributesFile] → [@ shutdownhang | NtQueryAttributesFile] [@ shutdownhang | ZwQueryAttributesFile] [@ shutdownhang | ZwCreateSection] [@ shutdownhang | NtMapViewOfSection] [@ shutdownhang | ZwUnmapViewOfSection]
Flags: needinfo?(rkraesig) → needinfo?(gstoll)
Summary: Crash in [@ shutdownhang | NtQueryAttributesFile] - stub_LdrLoadDll intercepting a PeekMessageW → Crash in [@ shutdownhang | ... <various> ... | LdrLoadDll ] due to SetWindowsHookEx DLL injection

Hmm. I can't think of a reason in theory that blocking DLL loads during shutdown would be a bad thing. (note that we don't know at the time of the load whether they're third-party or not, so we'd have to block all DLLs from loading)

It does make me kind of nervous, so if we want to try this we should do it early in the release cycle so it has plenty of time to see if it reduces shutdownhangs and doesn't cause crashes at shutdown.

Flags: needinfo?(gstoll)

All these hangs seem to be caused by background tasks, and most of the times by backgroundupdate (>99%), though we have also a very few uninstall and removeDirectory.

I think also here bug 1798904 comment 9 applies.

Flags: needinfo?(nalexander)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

(In reply to BugBot [:suhaib / :marco/ :calixte] from comment #6)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Yeah, it seems Fx113 contains something that fixes this already in many cases. I'd assume bug 1832254 could further help for some of the rest.

You need to log in before you can comment on or make changes to this bug.