Deadlock when stackwalking while a DLL is being loaded
Categories
(Core :: Gecko Profiler, defect, P2)
Tracking
()
People
(Reporter: florian, Unassigned)
References
Details
Attachments
(1 file)
5.71 KB,
text/plain
|
Details |
I was profiling startup of today's Nightly on my 2017 reference laptop, and Firefox remained stuck on the Skeleton UI. From the task manager, I clicked "Create dump file". I transfered that file to a more powerful laptop with Visual Studio installed, and that's how I got stacks (see attachment).
I'm not sure if this is actually specific to startup, seeing the stacks it seems to be a deadlock when the sampler thread interrupts the main thread while it is loading a DLL (RtlInsertInvertedFunctionTable
in the main thread, and RtlpxLookupFunctionTable
in the sampler thread that is stack walking).
I thought we disabled stack walking while loading DLLs, so I'm a bit confused. Maybe an edge case that we missed?
Reporter | ||
Comment 1•3 years ago
|
||
mozilla::freestanding::patched_LdrLoadDll line 356 is https://searchfox.org/mozilla-central/rev/c03e8de87cdb0ce0378c0886d3c0ce8bbf9dc44e/browser/app/winlauncher/freestanding/DllBlocklist.cpp#356
mozilla::interceptor::FuncHookCrossProcess<mozilla::interceptor::WindowsDllInterceptor<...>>::operator() line 254 is https://searchfox.org/mozilla-central/rev/c03e8de87cdb0ce0378c0886d3c0ce8bbf9dc44e/mozglue/misc/nsWindowsDllInterceptor.h#254
I wonder if we are missing a AutoSuppressStackWalking suppress;
somewhere.
Searching for AutoSuppressStackWalking
, I see it's used with LdrUnloadDll, LdrResolveDelayLoadAPI, and RtlInstallFunctionTableCallback, but not with LdrLoadDll
... anymore! It was removed in bug 1542830 (part 4), in favor of calling SuppressStackWalking()
in LoaderObserver::OnBeginDllLoad
.
So I guess the question is: Did LoaderObserver
miss the call, or was it not running (yet) during startup?
Should we re-add an explicit AutoSuppressStackWalking
in patched_LdrLoadDll
? (They can be safely nested.)
Aaron, you worked on bug 1542830, what do you think?
Comment 3•3 years ago
|
||
Florian, how recent was your build? There was a regression that was fixed in bug 1685703 that could have affected this. Have you seen this with any builds that include that fix?
Reporter | ||
Comment 4•3 years ago
|
||
In comment 0 I wrote "today's Nightly" on January 19th. So yes, that's more recent than the fix in bug 1685703.
Comment 5•3 years ago
|
||
Whoops, sorry. Sometimes I read a little too fast and miss things.
The LdrLoadDll
hook always calls into LoaderObserver::OnBeginDllLoad
, which always suppresses stack walking.
But looking at the stack walking suppression code, there is potential for racing there:
Suppose T1 is loading a DLL and calls SuppressStackWalking
.
T2 is doing stack walking and its PC has already passed the check.
Then T1 proceeds under the assumption that it is suppressed, but T2 still proceeds to walk the stack. Boom.
Thank you Aaron, your analysis looks correct to me.
So we need a way to block SuppressStackWalking
while WalkStackMain64
is already working. 🤔
A mutex would be the simplest solution, but it would also block simultaneous stack walks.
Something like an multiple-readers-single-writer exclusion thing (with the "readers" being stack walkers, and the "writer" being the AutoSuppressStackWalking
) would be better, though this would prevent simultaneous AutoSuppressStackWalking
sections.
Since we're going to make changes here, we should consider if we can improve on the previous solution, by blocking stack walking only on the specific thread(s) where SuppressStackWalking
is called.
I'll need to think more about this... Please send suggestions if you know of the perfect mechanism for this!
I think I just got the deadlock (win10/x64 local build). I attached a debugger, the main thread was stuck in the interceptor while patched_LdrLoadDll
was calling stub_LdrLoadDll()
.
We could add AutoSuppressStackWalking
in these patched_...
functions. It may not fix the suppression-check race, but it would help remove places where it could happen.
[update 2022-09-02: Moved to bug 1788859, so this bug here doesn't get side-tracked]
And these could also be nice spots to add informative markers. (But not labels, since AutoSuppressStackWalking
would prevent collecting them!)
Note: The suggestion in comment 7 to add AutoSuppressStackWalking around LdrLoadDll calls was moved to new bug 1788859, because there are other things (see comment 5 above) that should stay the focus of this bug here.
Comment 9•6 months ago
•
|
||
(In reply to (No longer employed by Mozilla) Aaron Klotz from comment #5)
But looking at the stack walking suppression code, there is potential for racing there:
Suppose T1 is loading a DLL and calls
SuppressStackWalking
.
T2 is doing stack walking and its PC has already passed the check.Then T1 proceeds under the assumption that it is suppressed, but T2 still proceeds to walk the stack. Boom.
This analysis is correct except that it cannot happen if T2 is suspending/walking T1, because T2 would have necessarily suspended T1 before checking sStackWalkSuppressions
. A third thread T3 and a second lock L2 could be required to explain this situation. T1 holds L1 but waits for L2; T2 waits for L1 to do stackwalking on T3; T3 has been suspended by T2 while holding L2 (because T2 read sStackSuppressions
before T1 acquired L1). Let me detail this more because it is quite hairy.
L1 is LdrpInvertedFunctionTableSRWLock
. L2 is LdrpMrdataLock
. RtlInsertInvertedFunctionTable
acquires both locks one after the other. It looks like this:
void RtlInsertInvertedFunctionTable(...)
{
RtlCaptureImageExceptionValues(...);
RtlAcquireSRWLockExclusive(&LdrpInvertedFunctionTableSRWLock);
LdrProtectMrdata(0);
RtlpInsertInvertedFunctionTableEntry();
LdrProtectMrdata(1);
RtlReleaseSRWLockExclusive(&LdrpInvertedFunctionTableSRWLock);
}
void LdrProtectMrdata(...)
{
RtlAcquireSRWLockExclusive(&LdrpMrdataLock);
// ...
RtlReleaseSRWLockExclusive(&LdrpMrdataLock);
}
There are code paths that acquire LdrpMrdataLock
without acquiring LdrpInvertedFunctionTableSRWLock
before (for example, AddVectoredExceptionHandler
). A deadlock can occur if the profiler (T2) suspends a thread (T3) while it holds LdrpMrdataLock
but not LdrpInvertedFunctionTableSRWLock
, and a different thread (T1) starts loading a DLL before T2 has finished walking the stack of T3.
Here is what the faulty sequence of events would look like:
T1 | T2 | T3
-------------------------------|-----------------------------|-----------------------------
| | AddVectoredExceptionHandler
| | AcquireExclusive(L2)
| | [ holds L2 ]
| SamplerThread::Run |
| SuspendThread(T3) |
| | [ suspended, holds L2 ]
| DoMozStackWalkThread |
| sStackWalkSuppressions == 0 |
| |
| RtlLookupFunctionEntry |
| AcquireShared(L1) |
| [ holds L1 ] |
| ReleaseShared(L1) |
| |
| RtlLookupFunctionEntry |
| AcquireShared(L1) |
| [ holds L1 ] |
| ReleaseShared(L1) |
| |
| ... |
| |
patched_LdrLoadDll | |
++sStackWalkSuppressions | |
LdrLoadDll | |
RtlInsertInvertedFunctionTable | |
AcquireExclusive(L1) | |
[ holds L1 ] | |
AcquireExclusive(L2) | |
[ waits L2, holds L1 ] | |
| RtlLookupFunctionEntry |
| AcquireShared(L1) |
[ waits L2, holds L1 ] | [ waits L1 ] | [ suspended, holds L2 ]
Comment 10•6 months ago
•
|
||
While I think the scenario I described in comment 9 is a real deadlock scenario, I feel like this scenario would be much more unlikely to manifest than [:florian]'s experience of the deadlock. So I tried to reproduce the deadlock myself to investigate, and managed to. When I attached to the main process, the following evidence showed that the scenario from comment 9 (and, more generally, comment 5) is not the one involved here:
- I had similar stacks to the ones from comment 0;
sStackWalkSuppressions
currently had value 0, despite the main thread being loading a DLL;- the main thread had been suspended by the profiler thread (it was the only thread with a suspend count of 2, where 1 was due to the debugger);
- no other thread than the main thread and the profiler thread seemed involved in the deadlock (no T3 from comment 9).
After some analysis with [:florian] and a bit more investigation, I now manage to reproduce the deadlock consistently by introducing artificial delays at strategic points, and I am now confident that I found the proper root cause for this deadlock. When we use a launcher process, we transition gLoaderObserver
from &gDefaultObserver
to &gMozglueLoaderObserver
in DllBlocklist_Initialize
. If a DLL is being loaded in another thread while we are transitioning, LoaderPrivateAPIImp::NotifyBeginDllLoad
can call gDefaultObserver.OnBeginDllLoad()
but LoaderPrivateAPIImp::NotifyEndDllLoad
can match this call with a gMozglueLoaderObserver.OnEndDllLoad()
for that DLL. If that happens, because gDefaultObserver
does not have stack walk suppressions but gMozglueLoaderObserver
has, sStackWalkSuppressions
will reach (size_t)-1
when the DLL is loaded, instead of 0
. This can be confirmed by adding the following assertion, which crashes the process if we are desuppressing with a zero count, resulting in a crash instead of a deadlock in the faulty situation:
void DesuppressStackWalking() {
MOZ_RELEASE_ASSERT(sStackWalkSuppressions--);
}
sStackWalkSuppressions
reaching (size_t)-1
will break all the purpose of the stack walk suppressions:
- when it would be safe to do stack walking, we will not do it (because
sStackWalkSuppressions
is(size_t)-1
instead of0
), which will result in a lot of missing stacks (bug 1686565); - when it would be unsafe to do stack walking, we will do it (because
sStackWalkSuppressions
is0
instead of1
), which can result in deadlocks (bug 1782125).
Here, the problem can only manifest if we are using a launcher process and there are DLLs loading on a non-main thread before we finish executing DllBlocklist_Initialize
. This is does not occur on my machine under normal circumstances, but it can happen:
- if
MOZ_PROFILER_STARTUP=1
, because the profiler thread can start before we finish executingDllBlocklist_Initialize
, and it results in loading a DLL on that thread; - and probably also if third-party software (e.g. an antivirus product) injects a DLL via
CreateRemoteThread
before we finish executingDllBlocklist_Initialize
.
In the case of MOZ_PROFILER_STARTUP=1
, here is the crash stack I get through the assertion from above:
# 5 Id: 23e0.28d8 Suspend: 1 Teb: 000000b9`ad9af000 Unfrozen
# Child-SP RetAddr Call Site
00 (Inline Function) --------`-------- mozglue!AnnotateMozCrashReason+0xe [C:\mozilla-source-alt\mozilla-unified\obj-x86_64-pc-windows-msvc\dist\include\mozilla\Assertions.h @ 43]
01 000000b9`ae9ffa70 00007ffe`d52c9ef5 mozglue!DesuppressStackWalking+0x21 [C:\mozilla-source-alt\mozilla-unified\mozglue\misc\StackWalk.cpp @ 132]
02 000000b9`ae9ffaa0 00007ff6`b20d7878 mozglue!mozilla::glue::LoaderObserver::OnEndDllLoad+0x35 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\LoaderObserver.cpp @ 53]
03 000000b9`ae9ffc00 00007ff6`b20d623d firefox!mozilla::freestanding::LoaderPrivateAPIImp::NotifyEndDllLoad+0x78 [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\LoaderPrivateAPI.cpp @ 210]
04 000000b9`ae9ffc60 00007ff6`b20d600e firefox!mozilla::freestanding::ModuleLoadFrame::~ModuleLoadFrame+0x2d [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\ModuleLoadFrame.cpp @ 44]
05 000000b9`ae9ffca0 00007fff`3c3f6772 firefox!mozilla::freestanding::patched_LdrLoadDll+0x8e [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\DllBlocklist.cpp @ 377]
06 000000b9`ae9ffd70 00007fff`3c7a1843 KERNELBASE!LoadLibraryExW+0x172
07 000000b9`ae9ffde0 00007fff`3c7a9cc9 ucrtbase!try_get_module+0x4b
08 000000b9`ae9ffe10 00007fff`3c7a9324 ucrtbase!__acrt_AppPolicyGetThreadInitializationTypeInternal+0x79
09 000000b9`ae9ffe40 00007fff`3df926ad ucrtbase!thread_start<unsigned int (__cdecl*)(void *),1>+0x54
0a 000000b9`ae9ffe70 00007fff`3efaa9f8 KERNEL32!BaseThreadInitThunk+0x1d
0b 000000b9`ae9ffea0 00000000`00000000 ntdll!RtlUserThreadStart+0x28
To force the faulty situation, I added the following artificial delays:
- In
browser/app/winlauncher/freestanding/LoaderPrivateAPI.cpp
:
class MOZ_ONLY_USED_TO_AVOID_STATIC_CONSTRUCTORS DefaultLoaderObserver final
: public nt::LoaderObserver {
public:
// ...
void OnBeginDllLoad(void** aContext,
PCUNICODE_STRING aRequestedDllName) final {
static DWORD64 sMainTeb = 0;
if (!sMainTeb) {
sMainTeb = __readgsqword(0x30);
}
// If we are not on the main thread
if (__readgsqword(0x30) != sMainTeb) {
Sleep(500);
}
}
// ...
};
- In
toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp
:
MFBT_API void DllBlocklist_Initialize(uint32_t aInitFlags) {
// ...
Sleep(250);
glue::ModuleLoadFrame::StaticInit(&gMozglueLoaderObserver, &gWinLauncher);
Sleep(500);
// ...
}
Description
•