Open Bug 1687510 Opened 3 years ago Updated 5 months ago

Deadlock when stackwalking while a DLL is being loaded

Categories

(Core :: Gecko Profiler, defect, P2)

defect

Tracking

()

People

(Reporter: florian, Unassigned)

References

Details

Attachments

(1 file)

Attached file Stacks

I was profiling startup of today's Nightly on my 2017 reference laptop, and Firefox remained stuck on the Skeleton UI. From the task manager, I clicked "Create dump file". I transfered that file to a more powerful laptop with Visual Studio installed, and that's how I got stacks (see attachment).

I'm not sure if this is actually specific to startup, seeing the stacks it seems to be a deadlock when the sampler thread interrupts the main thread while it is loading a DLL (RtlInsertInvertedFunctionTable in the main thread, and RtlpxLookupFunctionTable in the sampler thread that is stack walking).

I thought we disabled stack walking while loading DLLs, so I'm a bit confused. Maybe an edge case that we missed?

mozilla::freestanding::patched_LdrLoadDll line 356 is https://searchfox.org/mozilla-central/rev/c03e8de87cdb0ce0378c0886d3c0ce8bbf9dc44e/browser/app/winlauncher/freestanding/DllBlocklist.cpp#356

mozilla::interceptor::FuncHookCrossProcess<mozilla::interceptor::WindowsDllInterceptor<...>>::operator() line 254 is https://searchfox.org/mozilla-central/rev/c03e8de87cdb0ce0378c0886d3c0ce8bbf9dc44e/mozglue/misc/nsWindowsDllInterceptor.h#254

I wonder if we are missing a AutoSuppressStackWalking suppress; somewhere.

Searching for AutoSuppressStackWalking, I see it's used with LdrUnloadDll, LdrResolveDelayLoadAPI, and RtlInstallFunctionTableCallback, but not with LdrLoadDll ... anymore! It was removed in bug 1542830 (part 4), in favor of calling SuppressStackWalking() in LoaderObserver::OnBeginDllLoad.

So I guess the question is: Did LoaderObserver miss the call, or was it not running (yet) during startup?
Should we re-add an explicit AutoSuppressStackWalking in patched_LdrLoadDll? (They can be safely nested.)

Aaron, you worked on bug 1542830, what do you think?

Severity: -- → S3
Flags: needinfo?(aklotz)
Priority: -- → P2

Florian, how recent was your build? There was a regression that was fixed in bug 1685703 that could have affected this. Have you seen this with any builds that include that fix?

Flags: needinfo?(aklotz) → needinfo?(florian)

In comment 0 I wrote "today's Nightly" on January 19th. So yes, that's more recent than the fix in bug 1685703.

Flags: needinfo?(florian)

Whoops, sorry. Sometimes I read a little too fast and miss things.

The LdrLoadDll hook always calls into LoaderObserver::OnBeginDllLoad, which always suppresses stack walking.

But looking at the stack walking suppression code, there is potential for racing there:

Suppose T1 is loading a DLL and calls SuppressStackWalking.
T2 is doing stack walking and its PC has already passed the check.

Then T1 proceeds under the assumption that it is suppressed, but T2 still proceeds to walk the stack. Boom.

Flags: needinfo?(gsquelart)

Thank you Aaron, your analysis looks correct to me.

So we need a way to block SuppressStackWalking while WalkStackMain64 is already working. 🤔
A mutex would be the simplest solution, but it would also block simultaneous stack walks.
Something like an multiple-readers-single-writer exclusion thing (with the "readers" being stack walkers, and the "writer" being the AutoSuppressStackWalking) would be better, though this would prevent simultaneous AutoSuppressStackWalking sections.

Since we're going to make changes here, we should consider if we can improve on the previous solution, by blocking stack walking only on the specific thread(s) where SuppressStackWalking is called.

I'll need to think more about this... Please send suggestions if you know of the perfect mechanism for this!

Flags: needinfo?(gsquelart)

I think I just got the deadlock (win10/x64 local build). I attached a debugger, the main thread was stuck in the interceptor while patched_LdrLoadDll was calling stub_LdrLoadDll().

We could add AutoSuppressStackWalking in these patched_... functions. It may not fix the suppression-check race, but it would help remove places where it could happen.
[update 2022-09-02: Moved to bug 1788859, so this bug here doesn't get side-tracked]

And these could also be nice spots to add informative markers. (But not labels, since AutoSuppressStackWalking would prevent collecting them!)

Note: The suggestion in comment 7 to add AutoSuppressStackWalking around LdrLoadDll calls was moved to new bug 1788859, because there are other things (see comment 5 above) that should stay the focus of this bug here.

(In reply to (No longer employed by Mozilla) Aaron Klotz from comment #5)

But looking at the stack walking suppression code, there is potential for racing there:

Suppose T1 is loading a DLL and calls SuppressStackWalking.
T2 is doing stack walking and its PC has already passed the check.

Then T1 proceeds under the assumption that it is suppressed, but T2 still proceeds to walk the stack. Boom.

This analysis is correct except that it cannot happen if T2 is suspending/walking T1, because T2 would have necessarily suspended T1 before checking sStackWalkSuppressions. A third thread T3 and a second lock L2 could be required to explain this situation. T1 holds L1 but waits for L2; T2 waits for L1 to do stackwalking on T3; T3 has been suspended by T2 while holding L2 (because T2 read sStackSuppressions before T1 acquired L1). Let me detail this more because it is quite hairy.

L1 is LdrpInvertedFunctionTableSRWLock. L2 is LdrpMrdataLock. RtlInsertInvertedFunctionTable acquires both locks one after the other. It looks like this:

void RtlInsertInvertedFunctionTable(...)
{
  RtlCaptureImageExceptionValues(...);
  RtlAcquireSRWLockExclusive(&LdrpInvertedFunctionTableSRWLock);
  LdrProtectMrdata(0);
  RtlpInsertInvertedFunctionTableEntry();
  LdrProtectMrdata(1);
  RtlReleaseSRWLockExclusive(&LdrpInvertedFunctionTableSRWLock);
}

void LdrProtectMrdata(...)
{
  RtlAcquireSRWLockExclusive(&LdrpMrdataLock);
  // ...
  RtlReleaseSRWLockExclusive(&LdrpMrdataLock);
}

There are code paths that acquire LdrpMrdataLock without acquiring LdrpInvertedFunctionTableSRWLock before (for example, AddVectoredExceptionHandler). A deadlock can occur if the profiler (T2) suspends a thread (T3) while it holds LdrpMrdataLock but not LdrpInvertedFunctionTableSRWLock, and a different thread (T1) starts loading a DLL before T2 has finished walking the stack of T3.

Here is what the faulty sequence of events would look like:

            T1                 |             T2              |             T3
-------------------------------|-----------------------------|-----------------------------
                               |                             | AddVectoredExceptionHandler
                               |                             |    AcquireExclusive(L2)
                               |                             |  [     holds L2        ]
                               |      SamplerThread::Run     |
                               |      SuspendThread(T3)      |
                               |                             |  [ suspended, holds L2 ]
                               |     DoMozStackWalkThread    |
                               | sStackWalkSuppressions == 0 |
                               |                             |
                               |    RtlLookupFunctionEntry   |
                               |      AcquireShared(L1)      |
                               |   [       holds L1       ]  |
                               |      ReleaseShared(L1)      |
                               |                             |
                               |    RtlLookupFunctionEntry   |
                               |      AcquireShared(L1)      |
                               |   [       holds L1       ]  |
                               |      ReleaseShared(L1)      |
                               |                             |
                               |            ...              |
                               |                             |
      patched_LdrLoadDll       |                             |
  ++sStackWalkSuppressions     |                             |
        LdrLoadDll             |                             |
RtlInsertInvertedFunctionTable |                             |
    AcquireExclusive(L1)       |                             |
  [        holds L1         ]  |                             |
    AcquireExclusive(L2)       |                             |
  [    waits L2, holds L1   ]  |                             |
                               |    RtlLookupFunctionEntry   |
                               |      AcquireShared(L1)      |
  [    waits L2, holds L1   ]  |   [      waits L1       ]   |  [ suspended, holds L2 ]

While I think the scenario I described in comment 9 is a real deadlock scenario, I feel like this scenario would be much more unlikely to manifest than [:florian]'s experience of the deadlock. So I tried to reproduce the deadlock myself to investigate, and managed to. When I attached to the main process, the following evidence showed that the scenario from comment 9 (and, more generally, comment 5) is not the one involved here:

  • I had similar stacks to the ones from comment 0;
  • sStackWalkSuppressions currently had value 0, despite the main thread being loading a DLL;
  • the main thread had been suspended by the profiler thread (it was the only thread with a suspend count of 2, where 1 was due to the debugger);
  • no other thread than the main thread and the profiler thread seemed involved in the deadlock (no T3 from comment 9).

After some analysis with [:florian] and a bit more investigation, I now manage to reproduce the deadlock consistently by introducing artificial delays at strategic points, and I am now confident that I found the proper root cause for this deadlock. When we use a launcher process, we transition gLoaderObserver from &gDefaultObserver to &gMozglueLoaderObserver in DllBlocklist_Initialize. If a DLL is being loaded in another thread while we are transitioning, LoaderPrivateAPIImp::NotifyBeginDllLoad can call gDefaultObserver.OnBeginDllLoad() but LoaderPrivateAPIImp::NotifyEndDllLoad can match this call with a gMozglueLoaderObserver.OnEndDllLoad() for that DLL. If that happens, because gDefaultObserver does not have stack walk suppressions but gMozglueLoaderObserver has, sStackWalkSuppressions will reach (size_t)-1 when the DLL is loaded, instead of 0. This can be confirmed by adding the following assertion, which crashes the process if we are desuppressing with a zero count, resulting in a crash instead of a deadlock in the faulty situation:

void DesuppressStackWalking() {
  MOZ_RELEASE_ASSERT(sStackWalkSuppressions--);
}

sStackWalkSuppressions reaching (size_t)-1 will break all the purpose of the stack walk suppressions:

  • when it would be safe to do stack walking, we will not do it (because sStackWalkSuppressions is (size_t)-1 instead of 0), which will result in a lot of missing stacks (bug 1686565);
  • when it would be unsafe to do stack walking, we will do it (because sStackWalkSuppressions is 0 instead of 1), which can result in deadlocks (bug 1782125).

Here, the problem can only manifest if we are using a launcher process and there are DLLs loading on a non-main thread before we finish executing DllBlocklist_Initialize. This is does not occur on my machine under normal circumstances, but it can happen:

  • if MOZ_PROFILER_STARTUP=1, because the profiler thread can start before we finish executing DllBlocklist_Initialize, and it results in loading a DLL on that thread;
  • and probably also if third-party software (e.g. an antivirus product) injects a DLL via CreateRemoteThread before we finish executing DllBlocklist_Initialize.

In the case of MOZ_PROFILER_STARTUP=1, here is the crash stack I get through the assertion from above:

#  5  Id: 23e0.28d8 Suspend: 1 Teb: 000000b9`ad9af000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     mozglue!AnnotateMozCrashReason+0xe [C:\mozilla-source-alt\mozilla-unified\obj-x86_64-pc-windows-msvc\dist\include\mozilla\Assertions.h @ 43]
01 000000b9`ae9ffa70 00007ffe`d52c9ef5     mozglue!DesuppressStackWalking+0x21 [C:\mozilla-source-alt\mozilla-unified\mozglue\misc\StackWalk.cpp @ 132]
02 000000b9`ae9ffaa0 00007ff6`b20d7878     mozglue!mozilla::glue::LoaderObserver::OnEndDllLoad+0x35 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\LoaderObserver.cpp @ 53]
03 000000b9`ae9ffc00 00007ff6`b20d623d     firefox!mozilla::freestanding::LoaderPrivateAPIImp::NotifyEndDllLoad+0x78 [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\LoaderPrivateAPI.cpp @ 210]
04 000000b9`ae9ffc60 00007ff6`b20d600e     firefox!mozilla::freestanding::ModuleLoadFrame::~ModuleLoadFrame+0x2d [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\ModuleLoadFrame.cpp @ 44]
05 000000b9`ae9ffca0 00007fff`3c3f6772     firefox!mozilla::freestanding::patched_LdrLoadDll+0x8e [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\DllBlocklist.cpp @ 377]
06 000000b9`ae9ffd70 00007fff`3c7a1843     KERNELBASE!LoadLibraryExW+0x172
07 000000b9`ae9ffde0 00007fff`3c7a9cc9     ucrtbase!try_get_module+0x4b
08 000000b9`ae9ffe10 00007fff`3c7a9324     ucrtbase!__acrt_AppPolicyGetThreadInitializationTypeInternal+0x79
09 000000b9`ae9ffe40 00007fff`3df926ad     ucrtbase!thread_start<unsigned int (__cdecl*)(void *),1>+0x54
0a 000000b9`ae9ffe70 00007fff`3efaa9f8     KERNEL32!BaseThreadInitThunk+0x1d
0b 000000b9`ae9ffea0 00000000`00000000     ntdll!RtlUserThreadStart+0x28

To force the faulty situation, I added the following artificial delays:

  • In browser/app/winlauncher/freestanding/LoaderPrivateAPI.cpp:
class MOZ_ONLY_USED_TO_AVOID_STATIC_CONSTRUCTORS DefaultLoaderObserver final
    : public nt::LoaderObserver {
 public:
  // ...
  void OnBeginDllLoad(void** aContext,
                      PCUNICODE_STRING aRequestedDllName) final {
    static DWORD64 sMainTeb = 0;
    if (!sMainTeb) {
      sMainTeb = __readgsqword(0x30);
    }
    // If we are not on the main thread
    if (__readgsqword(0x30) != sMainTeb) {
      Sleep(500);
    }
  }
  // ...
};
  • In toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:
MFBT_API void DllBlocklist_Initialize(uint32_t aInitFlags) {
  // ...
  Sleep(250);
  glue::ModuleLoadFrame::StaticInit(&gMozglueLoaderObserver, &gWinLauncher);
  Sleep(500);
  // ...
}
See Also: → 1836225
You need to log in before you can comment on or make changes to this bug.