Closed Bug 1337499 Opened 7 years ago Closed 7 years ago

Hang at RtlpLookupFunctionEntry when stackwalking in Win64 builds

Categories

(Core :: Gecko Profiler, defect)

x86_64
Windows 10
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: mconley, Assigned: jandem)

References

(Blocks 1 open bug)

Details

(Keywords: hang, reproducible)

Attachments

(2 files)

http://searchfox.org/mozilla-central/rev/f5077ad52f8b90183e73038869f6140f0afbf427/mozglue/misc/StackWalk.cpp#389

STR:

1) In a 64-bit build of Nightly for Windows, install the Gecko Profiler Add-on
2) Enable the Gecko Profiler Add-on, so that you are profiling.
3) Proceed to surf the internet
4) Wait for your browser to eventually hang and never recover.

ER:

Should not hang like that, even when profiling.

AR:

We landed patches to fix something similar to this in bug 1334933, but there's still more work to do here.
Stack from the main thread:

 	ntdll.dll!RtlLookupFunctionEntry()	Unknown
>	mozglue.dll!WalkStackMain64(WalkStackData * aData) Line 391	C++
 	mozglue.dll!MozStackWalk(void(*)(unsigned int, void *, void *, void *) aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned __int64 aThread, void * aPlatformData) Line 612	C++
 	xul.dll!Sampler::doNativeBacktrace(ThreadInfo & aInfo, TickSample * aSample) Line 1108	C++
 	xul.dll!Sampler::InplaceTick(TickSample * sample) Line 1325	C++
 	xul.dll!Sampler::GetBacktrace() Line 1402	C++
 	xul.dll!profiler_get_backtrace() Line 1111	C++
 	xul.dll!mozilla::PresShell::ScheduleReflow() Line 9023	C++
 	xul.dll!mozilla::PresShell::FrameNeedsReflow(nsIFrame * aFrame, nsIPresShell::IntrinsicDirty aIntrinsicDirty, nsFrameState aBitToAdd, nsIPresShell::ReflowRootHandling aRootHandling) Line 2804	C++
 	xul.dll!nsBlockFrame::RemoveFrame(mozilla::layout::FrameChildListID aListID, nsIFrame * aOldFrame) Line 5294	C++
 	xul.dll!nsFrameManager::RemoveFrame(mozilla::layout::FrameChildListID aListID, nsIFrame * aOldFrame) Line 510	C++
 	xul.dll!nsCSSFrameConstructor::ContentRemoved(nsIContent * aContainer, nsIContent * aChild, nsIContent * aOldNextSibling, nsCSSFrameConstructor::RemoveFlags aFlags, bool * aDidReconstruct, nsIContent * * aDestroyedFramesFor) Line 8528	C++
 	xul.dll!mozilla::PresShell::ContentRemoved(nsIDocument * aDocument, nsIContent * aMaybeContainer, nsIContent * aChild, int aIndexInContainer, nsIContent * aPreviousSibling) Line 4498	C++
 	xul.dll!nsNodeUtils::ContentRemoved(nsINode * aContainer, nsIContent * aChild, int aIndexInContainer, nsIContent * aPreviousSibling) Line 226	C++
 	xul.dll!nsINode::doRemoveChildAt(unsigned int aIndex, bool aNotify, nsIContent * aKid, nsAttrAndChildArray & aChildArray) Line 1944	C++
 	xul.dll!mozilla::dom::FragmentOrElement::RemoveChildAt(unsigned int aIndex, bool aNotify) Line 1160	C++
 	xul.dll!nsINode::Remove() Line 1863	C++
 	xul.dll!mozilla::dom::ElementBinding::remove(JSContext * cx, JS::Handle<JSObject *> obj, mozilla::dom::Element * self, const JSJitMethodCallArgs & args) Line 3569	C++
 	xul.dll!mozilla::dom::GenericBindingMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Line 2965	C++
What are the other threads doing? This is between XPCOM/JS startup and shutdown, so the JIT is definitely not the one poking at Win64 function tables or taking the lock here.

There must be a similar issue elsewhere.
Note to self to get stacks for each thread from WinDbg next time I hit this.
Flags: needinfo?(mconley)
I wonder if it's something like:

* Main thread takes some internal Windows lock in RtlLookupFunctionEntry.
* Background Hang Monitor thread (or something else) suspends the main thread to walk its stack.
* It calls RtlLookupFunctionEntry but it can wait forever because the suspended main thread holds the lock.

Are we missing the stack walk lock somewhere? (If it's protecting against this too we can't remove it in bug 1336877..)
I'm pretty sure it must be what I described in comment 4.

The stack walk lock was added to keep the BHM/profiler threads from calling RtlLookupFunctionEntry while the JIT was calling RtlAddFunctionEntry or RtlRemoveFunctionEntry. The JIT no longer does that, but we have another, similar situation: the main thread calling MozStackWalk on itself which ends up calling RtlLookupFunctionEntry as well. See comment 4.

So this no longer involves SpiderMonkey but happens when walking the stack of a thread that's walking its own stack (but got suspended).
Just hit this again - this time, it's just the content process hanging.
Flags: needinfo?(mconley)
Is any of that (comment 6) useful?
Flags: needinfo?(jdemooij)
Yeah it confirms the problem is the main thead calling RtlLookupFunctionEntry (and holding some internal Windows lock), then getting suspended by another thread that calls RtlLookupFunctionEntry as well, so we deadlock.

Possible fix is to add this to the start of Sampler::GetBacktrace:

#ifdef _WIN64
  AcquireStackWalkWorkaroundLock();
#endif

And this at the end, right before we return:

#ifdef _WIN64
  ReleaseStackWalkWorkaroundLock();
#endif

And Sampler.cpp may need an #include:

#ifdef _WIN64
#include "mozilla/StackWalk_windows.h"
#endif

I'm not sure if Sampler::GetBacktrace is the best place for this. Let me know if you want a Try build so you could test a patch for this - I can push something tomorrow.
This looks exactly like the latter part of bug 1263595 comment 53: ScheduleReflow on the main thread wants to log a backtrace, and takes the function table lock while doing so; Hang Monitor thread pauses the main thread in order to walk it, but then the HM thread also wants that function table lock.

In that comment I was unsure about whether this was just a one-off missing lock, or if it's a broader class of problem that we could get into more generally. Jan, any thoughts on that?
(In reply to David Major [:dmajor] from comment #9)
> In that comment I was unsure about whether this was just a one-off missing
> lock, or if it's a broader class of problem that we could get into more
> generally. Jan, any thoughts on that?

The best fix might be to have WalkStackMain64 call TryAcquireStackWalkLock if it's walking the main thread's stack, and if we're unable to take that lock we give up. Then the profiler thread and BHM thread no longer need to take the lock, and it will also work automatically for the main thread (and fix this bug).

The profiler thread currently checks whether the target thread CanInvokeJS(). I think we no longer need that as all the function table stuff now happens on the main thread or on the profiler/BHM threads (which we don't profile/monitor I think).

Does that sound reasonable?
Flags: needinfo?(jdemooij)
ni?ing for comment 10
Flags: needinfo?(dmajor)
NI nathan as well (reviewer of the patch that added the stack walk lock).

Proposal is to take the lock (1) in WalkStackMain64 (and give up if it's taken) and (2) In the SpiderMonkey code that calls RtlAddFunctionTable/RtlDeleteFunctionTable during process startup/shutdown.

That should ensure at most one thread is doing Rtl*FunctionTable stuff.
Flags: needinfo?(nfroyd)
I don't know this code well enough to speak for the design. If it makes the hang go away, and we don't see any other hangs crop up, then no objection from me!
Flags: needinfo?(dmajor)
Not to complicate the discussion, but I'm curious about the priorities here. Should the profiler get priority over whatever WalkStackMain64 is used for, or would we rather have the profiler drop a sample if *it* can't take the lock? I'm mostly worried about the profiler getting in the way of walking the stack during a crash, though I don't know if that could happen with this fix.
(In reply to Jan de Mooij [:jandem] from comment #12)
> Proposal is to take the lock (1) in WalkStackMain64 (and give up if it's
> taken) and (2) In the SpiderMonkey code that calls
> RtlAddFunctionTable/RtlDeleteFunctionTable during process startup/shutdown.

I think this is reasonable, but I am a little concerned about the scenarios Emanuel raises.  Though I don't really know if the profiler would even be functioning once we crashed...
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #15)
> I think this is reasonable, but I am a little concerned about the scenarios
> Emanuel raises.  Though I don't really know if the profiler would even be
> functioning once we crashed...

+1. Crash stacks should have priority over profiler sampling.
I concur - crash stacks have higher priority.

Is that enough information to proceed with a fix?
Flags: needinfo?(jdemooij)
(In reply to Mike Conley (:mconley) from comment #17)
> Is that enough information to proceed with a fix?

I'm unfamiliar with the profiler and stack walking code. I'm happy to try to write a patch for the fix I suggested, but if that's not sufficient because we want to make sure crash stacks are prioritized, I don't know what's the best thing to do here.

Nathan, what do you think? Who's working on the profiler these days?
Flags: needinfo?(nfroyd)
I think mstange has been hacking around in the profiler backend recently, and might have an opinion here.
Flags: needinfo?(mstange)
As Mike said, Markus is the right person to ask.

I don't know a way to make sure the crash reporter has priority.  The profiler should just drop samples if it can't take the appropriate lock(s), definitely; the crashreporter should just wait for the lock if the lock is currently held, I guess?  But what happens if the profiler actually crashes while the lock is held?  That's not going to end well.
Flags: needinfo?(nfroyd)
I don't know how the crash reporter factors into this. However, if that's a problem, wouldn't we already have it? Say we crash while the Background Hang Monitor is walking the stack - what happens in that case? Do we get a crash stack? And do the threads other than the one that crashed keep running while we collect the stack?

(In reply to Jan de Mooij [:jandem] from comment #10)
> The best fix might be to have WalkStackMain64 call TryAcquireStackWalkLock
> if it's walking the main thread's stack, and if we're unable to take that
> lock we give up.

This solution sounds good to me, and should be a strict improvement. I don't completely understand why we treat the main thread specially here, though - is it because that's the only thread that currently tries to synchronously get stacks from itself?
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #21)
> This solution sounds good to me, and should be a strict improvement. I don't
> completely understand why we treat the main thread specially here, though -
> is it because that's the only thread that currently tries to synchronously
> get stacks from itself?

Yeah, but it's probably fine and safer to do that unconditionally for every thread for now.

I'll write a patch for this in a bit.
Attached patch PatchSplinter Review
With this patch we try to take the lock in WalkStackMain64 and give up if we are unable to do so (we have to because the suspended thread may be holding it).

It simplifies the code quite a bit and I don't expect this to have a significant effect on performance.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8838022 - Flags: review?(mstange)
Looks green on Try and seems to work fine locally. Mike, here's a Win64 build in case you have some time to test it:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a533cee08cf8af41e41f95e672530fcf7798ea5b&group_state=expanded
Flags: needinfo?(mconley)
Comment on attachment 8838022 [details] [diff] [review]
Patch

Review of attachment 8838022 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. Returning early here will cause aData->pc_count to stay at zero, which will make MozStackWalk return false, so we're adequately reflecting failure to our callers. The profiler could choose to react by annotating samples that failed to walk the stack, and show that in the UI, for example. Not sure how common that's going to be, though.
Attachment #8838022 - Flags: review?(mstange) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ce8c513421e
Take the Win64 stackwalk lock in WalkStackMain64 to avoid deadlocks. r=mstange
I guess I'll try it once it's on central - it only ever hit me about once per day anyways.
Flags: needinfo?(mconley)
https://hg.mozilla.org/mozilla-central/rev/1ce8c513421e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.