1337499 - Hang at RtlpLookupFunctionEntry when stackwalking in Win64 builds

Reporter

Description

•

9 years ago

http://searchfox.org/mozilla-central/rev/f5077ad52f8b90183e73038869f6140f0afbf427/mozglue/misc/StackWalk.cpp#389 STR: 1) In a 64-bit build of Nightly for Windows, install the Gecko Profiler Add-on 2) Enable the Gecko Profiler Add-on, so that you are profiling. 3) Proceed to surf the internet 4) Wait for your browser to eventually hang and never recover. ER: Should not hang like that, even when profiling. AR: We landed patches to fix something similar to this in bug 1334933, but there's still more work to do here.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 1

•

9 years ago

Stack from the main thread: ntdll.dll!RtlLookupFunctionEntry() Unknown > mozglue.dll!WalkStackMain64(WalkStackData * aData) Line 391 C++ mozglue.dll!MozStackWalk(void(*)(unsigned int, void *, void *, void *) aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned __int64 aThread, void * aPlatformData) Line 612 C++ xul.dll!Sampler::doNativeBacktrace(ThreadInfo & aInfo, TickSample * aSample) Line 1108 C++ xul.dll!Sampler::InplaceTick(TickSample * sample) Line 1325 C++ xul.dll!Sampler::GetBacktrace() Line 1402 C++ xul.dll!profiler_get_backtrace() Line 1111 C++ xul.dll!mozilla::PresShell::ScheduleReflow() Line 9023 C++ xul.dll!mozilla::PresShell::FrameNeedsReflow(nsIFrame * aFrame, nsIPresShell::IntrinsicDirty aIntrinsicDirty, nsFrameState aBitToAdd, nsIPresShell::ReflowRootHandling aRootHandling) Line 2804 C++ xul.dll!nsBlockFrame::RemoveFrame(mozilla::layout::FrameChildListID aListID, nsIFrame * aOldFrame) Line 5294 C++ xul.dll!nsFrameManager::RemoveFrame(mozilla::layout::FrameChildListID aListID, nsIFrame * aOldFrame) Line 510 C++ xul.dll!nsCSSFrameConstructor::ContentRemoved(nsIContent * aContainer, nsIContent * aChild, nsIContent * aOldNextSibling, nsCSSFrameConstructor::RemoveFlags aFlags, bool * aDidReconstruct, nsIContent * * aDestroyedFramesFor) Line 8528 C++ xul.dll!mozilla::PresShell::ContentRemoved(nsIDocument * aDocument, nsIContent * aMaybeContainer, nsIContent * aChild, int aIndexInContainer, nsIContent * aPreviousSibling) Line 4498 C++ xul.dll!nsNodeUtils::ContentRemoved(nsINode * aContainer, nsIContent * aChild, int aIndexInContainer, nsIContent * aPreviousSibling) Line 226 C++ xul.dll!nsINode::doRemoveChildAt(unsigned int aIndex, bool aNotify, nsIContent * aKid, nsAttrAndChildArray & aChildArray) Line 1944 C++ xul.dll!mozilla::dom::FragmentOrElement::RemoveChildAt(unsigned int aIndex, bool aNotify) Line 1160 C++ xul.dll!nsINode::Remove() Line 1863 C++ xul.dll!mozilla::dom::ElementBinding::remove(JSContext * cx, JS::Handle<JSObject *> obj, mozilla::dom::Element * self, const JSJitMethodCallArgs & args) Line 3569 C++ xul.dll!mozilla::dom::GenericBindingMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Line 2965 C++

Jan de Mooij [:jandem]

Assignee

Comment 2

•

9 years ago

What are the other threads doing? This is between XPCOM/JS startup and shutdown, so the JIT is definitely not the one poking at Win64 function tables or taking the lock here. There must be a similar issue elsewhere.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 3

•

9 years ago

Note to self to get stacks for each thread from WinDbg next time I hit this.

Flags: needinfo?(mconley)

Jan de Mooij [:jandem]

Assignee

Comment 4

•

9 years ago

I wonder if it's something like: * Main thread takes some internal Windows lock in RtlLookupFunctionEntry. * Background Hang Monitor thread (or something else) suspends the main thread to walk its stack. * It calls RtlLookupFunctionEntry but it can wait forever because the suspended main thread holds the lock. Are we missing the stack walk lock somewhere? (If it's protecting against this too we can't remove it in bug 1336877..)

Jan de Mooij [:jandem]

Assignee

Comment 5

•

9 years ago

I'm pretty sure it must be what I described in comment 4. The stack walk lock was added to keep the BHM/profiler threads from calling RtlLookupFunctionEntry while the JIT was calling RtlAddFunctionEntry or RtlRemoveFunctionEntry. The JIT no longer does that, but we have another, similar situation: the main thread calling MozStackWalk on itself which ends up calling RtlLookupFunctionEntry as well. See comment 4. So this no longer involves SpiderMonkey but happens when walking the stack of a thread that's walking its own stack (but got suspended).

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 6

•

9 years ago

Attached file Stack dumps for hung content process — Details

Just hit this again - this time, it's just the content process hanging.

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 7

•

9 years ago

Is any of that (comment 6) useful?

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 8

•

9 years ago

Yeah it confirms the problem is the main thead calling RtlLookupFunctionEntry (and holding some internal Windows lock), then getting suspended by another thread that calls RtlLookupFunctionEntry as well, so we deadlock. Possible fix is to add this to the start of Sampler::GetBacktrace: #ifdef _WIN64 AcquireStackWalkWorkaroundLock(); #endif And this at the end, right before we return: #ifdef _WIN64 ReleaseStackWalkWorkaroundLock(); #endif And Sampler.cpp may need an #include: #ifdef _WIN64 #include "mozilla/StackWalk_windows.h" #endif I'm not sure if Sampler::GetBacktrace is the best place for this. Let me know if you want a Try build so you could test a patch for this - I can push something tomorrow.

(Away)

Comment 9

•

9 years ago

This looks exactly like the latter part of bug 1263595 comment 53: ScheduleReflow on the main thread wants to log a backtrace, and takes the function table lock while doing so; Hang Monitor thread pauses the main thread in order to walk it, but then the HM thread also wants that function table lock. In that comment I was unsure about whether this was just a one-off missing lock, or if it's a broader class of problem that we could get into more generally. Jan, any thoughts on that?

Jan de Mooij [:jandem]

Assignee

Comment 10

•

9 years ago

(In reply to David Major [:dmajor] from comment #9) > In that comment I was unsure about whether this was just a one-off missing > lock, or if it's a broader class of problem that we could get into more > generally. Jan, any thoughts on that? The best fix might be to have WalkStackMain64 call TryAcquireStackWalkLock if it's walking the main thread's stack, and if we're unable to take that lock we give up. Then the profiler thread and BHM thread no longer need to take the lock, and it will also work automatically for the main thread (and fix this bug). The profiler thread currently checks whether the target thread CanInvokeJS(). I think we no longer need that as all the function table stuff now happens on the main thread or on the profiler/BHM threads (which we don't profile/monitor I think). Does that sound reasonable?

Flags: needinfo?(jdemooij)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 11

•

9 years ago

ni?ing for comment 10

Flags: needinfo?(dmajor)

Jan de Mooij [:jandem]

Assignee

Comment 12

•

9 years ago

NI nathan as well (reviewer of the patch that added the stack walk lock). Proposal is to take the lock (1) in WalkStackMain64 (and give up if it's taken) and (2) In the SpiderMonkey code that calls RtlAddFunctionTable/RtlDeleteFunctionTable during process startup/shutdown. That should ensure at most one thread is doing Rtl*FunctionTable stuff.

Flags: needinfo?(nfroyd)

(Away)

Comment 13

•

9 years ago

I don't know this code well enough to speak for the design. If it makes the hang go away, and we don't see any other hangs crop up, then no objection from me!

Flags: needinfo?(dmajor)

Emanuel Hoogeveen [:ehoogeveen]

Comment 14

•

9 years ago

Not to complicate the discussion, but I'm curious about the priorities here. Should the profiler get priority over whatever WalkStackMain64 is used for, or would we rather have the profiler drop a sample if *it* can't take the lock? I'm mostly worried about the profiler getting in the way of walking the stack during a crash, though I don't know if that could happen with this fix.

Nathan Froyd [:froydnj]

Comment 15

•

9 years ago

(In reply to Jan de Mooij [:jandem] from comment #12) > Proposal is to take the lock (1) in WalkStackMain64 (and give up if it's > taken) and (2) In the SpiderMonkey code that calls > RtlAddFunctionTable/RtlDeleteFunctionTable during process startup/shutdown. I think this is reasonable, but I am a little concerned about the scenarios Emanuel raises. Though I don't really know if the profiler would even be functioning once we crashed...

Flags: needinfo?(nfroyd)

(No longer employed by Mozilla) Aaron Klotz

Comment 16

•

9 years ago

(In reply to Nathan Froyd [:froydnj] from comment #15) > I think this is reasonable, but I am a little concerned about the scenarios > Emanuel raises. Though I don't really know if the profiler would even be > functioning once we crashed... +1. Crash stacks should have priority over profiler sampling.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 17

•

9 years ago

I concur - crash stacks have higher priority. Is that enough information to proceed with a fix?

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 18

•

9 years ago

(In reply to Mike Conley (:mconley) from comment #17) > Is that enough information to proceed with a fix? I'm unfamiliar with the profiler and stack walking code. I'm happy to try to write a patch for the fix I suggested, but if that's not sufficient because we want to make sure crash stacks are prioritized, I don't know what's the best thing to do here. Nathan, what do you think? Who's working on the profiler these days?

Flags: needinfo?(nfroyd)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 19

•

9 years ago

I think mstange has been hacking around in the profiler backend recently, and might have an opinion here.

Flags: needinfo?(mstange)

Nathan Froyd [:froydnj]

Comment 20

•

9 years ago

As Mike said, Markus is the right person to ask. I don't know a way to make sure the crash reporter has priority. The profiler should just drop samples if it can't take the appropriate lock(s), definitely; the crashreporter should just wait for the lock if the lock is currently held, I guess? But what happens if the profiler actually crashes while the lock is held? That's not going to end well.

Flags: needinfo?(nfroyd)

Markus Stange [:mstange]

Comment 21

•

9 years ago

I don't know how the crash reporter factors into this. However, if that's a problem, wouldn't we already have it? Say we crash while the Background Hang Monitor is walking the stack - what happens in that case? Do we get a crash stack? And do the threads other than the one that crashed keep running while we collect the stack? (In reply to Jan de Mooij [:jandem] from comment #10) > The best fix might be to have WalkStackMain64 call TryAcquireStackWalkLock > if it's walking the main thread's stack, and if we're unable to take that > lock we give up. This solution sounds good to me, and should be a strict improvement. I don't completely understand why we treat the main thread specially here, though - is it because that's the only thread that currently tries to synchronously get stacks from itself?

Flags: needinfo?(mstange)

Jan de Mooij [:jandem]

Assignee

Comment 22

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #21) > This solution sounds good to me, and should be a strict improvement. I don't > completely understand why we treat the main thread specially here, though - > is it because that's the only thread that currently tries to synchronously > get stacks from itself? Yeah, but it's probably fine and safer to do that unconditionally for every thread for now. I'll write a patch for this in a bit.

Jan de Mooij [:jandem]

Assignee

Comment 23

•

9 years ago

Attached patch Patch — Details — Splinter Review

With this patch we try to take the lock in WalkStackMain64 and give up if we are unable to do so (we have to because the suspended thread may be holding it). It simplifies the code quite a bit and I don't expect this to have a significant effect on performance.

Assignee: nobody → jdemooij

Status: NEW → ASSIGNED

Flags: needinfo?(jdemooij)

Attachment #8838022 - Flags: review?(mstange)

Jan de Mooij [:jandem]

Assignee

Comment 24

•

9 years ago

Looks green on Try and seems to work fine locally. Mike, here's a Win64 build in case you have some time to test it: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a533cee08cf8af41e41f95e672530fcf7798ea5b&group_state=expanded

Flags: needinfo?(mconley)

Markus Stange [:mstange]

Comment 25

•

9 years ago

Comment on attachment 8838022 [details] [diff] [review] Patch Review of attachment 8838022 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me. Returning early here will cause aData->pc_count to stay at zero, which will make MozStackWalk return false, so we're adequately reflecting failure to our callers. The profiler could choose to react by annotating samples that failed to walk the stack, and show that in the UI, for example. Not sure how common that's going to be, though.

Attachment #8838022 - Flags: review?(mstange) → review+

Pulsebot

Comment 26

•

9 years ago

Pushed by jandemooij@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1ce8c513421e Take the Win64 stackwalk lock in WalkStackMain64 to avoid deadlocks. r=mstange

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 27

•

9 years ago

I guess I'll try it once it's on central - it only ever hit me about once per day anyways.

Flags: needinfo?(mconley)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 28

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1ce8c513421e

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Blocks: win64-rollout

Stack dumps for hung content process 9 years ago Mike Conley (:mconley) (:⚙️) 51.28 KB, text/plain		Details
Patch 9 years ago Jan de Mooij [:jandem] 6.53 KB, patch	mstange : review+	Details \| Diff \| Splinter Review