Open Bug 1839299 Opened 5 months ago Updated 5 months ago

Use a finer-grained strategy than stack walking suppressions to avoid stack walking deadlock on Windows x64 and aarch64


(Core :: Gecko Profiler, enhancement, P2)






(Reporter: yannis, Assigned: yannis)




(2 files)

Stack walking on Windows x64 and aarch64 relies on RtlLookupFunctionEntry, which acquires up to two ntdll-internal SRW locks as shared, namely (according to PDB symbols):

  • LdrpInvertedFunctionTableSRWLock: always, to match with a function from an executable file;
  • RtlpDynamicFunctionTableLock: if that fails, to match with function tables for dynamic code.

We currently avoid deadlock situations by using stack walk suppressions, which means that we guard some paths, such that if any thread is within such path at stack walk time we do not attempt to walk the stack. It is a coarse-grained strategy that prevents us from profiling some situations (e.g. DLL loading) as best as we could theoretically. We must guard all paths that could end up acquiring the locks mentioned above, and we are likely to miss some of those paths, which has led and can still lead to deadlock when using the Firefox Profiler, or when using Nightly through Background Hang Reporter.

We could try a fine-grained strategy instead, by collecting pointers to the ntdll-internal SRW locks mentioned above, which would let us use AcquireSRWLockShared and TryAcquireSRWLockShared on these locks, which should solve most of our problems.

See also the discussion starting with bug 1836225 comment 2.

We have hooking code for the LEA instruction in x86_64, but not when it
is REX-prefixed. We need the REX-prefixed variant to hook
RtlReleaseSRWLockShared from ntdll.

Depends on D181628

Assignee: nobody → yjuglaret

On 64-bit Windows (x86_64, aarch64), stack walking relies on
RtlLookupFunctionEntry to navigate from one frame to the next. This
function acquires up to two ntdll internal locks when it is called.

The profiler and the background hang monitor both need to walk the
stacks of suspended threads. This can lead to deadlock situations,
which so far we have avoided with stack walk suppressions. We guard some
critical paths to mark them as suppressing stack walk, and we forbid
stack walking when any thread is currently on such path.

While stack walk suppression has helped remove most deadlock situations,
some can remain because it is hard to detect and manually annotate all
the paths that could lead to a deadlock situation. Another drawback is
that stack walk suppression disables stack walking for much larger
portions of code than required. For example, we disable stack walking
for LdrLoadDll, so we cannot collect stacks while we are loading a DLL.
Yet, the lock that could lead to a deadlock situation is only held
during a very small portion of the whole time spent in LdrLoadDll.

This patch addresses these two issues by implementing a finer-grained
strategy to avoid deadlock situations. We hook ntdll lock acquisition
and release functions in order to collect pointers to the two
problematic locks. This allows us to play with these locks, such that
we can guarantee safe stack walking with no deadlock.

If we fail to collect pointers to the locks, we fall back to using stack
walk suppressions like before. This way we get the best of both worlds:
if we are confident that the situation is under control, we will use the
new strategy and get better profiler accuracy and no deadlock; in case
of doubt, we can still use the profiler thanks to stack walk

Depends on D182553

Bug 1823412 was a good example of a bug where capturing the stack during DLL loads could have been helpful. Back then I shared some profiles in bug 1823412 comment 9. Here is what startup profiles would look like for this same bug if taken using the proposed strategy:

Severity: -- → N/A
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.