Closed Bug 1359507 Opened 3 years ago Closed 3 years ago

Mozilla Firefox Nightly 55.0a1 (2017-04-21) (64-bit) goes into unresponsive state and hangs after trying to add attachment on Bugzilla

Categories

(Core :: Gecko Profiler, defect, critical)

55 Branch
x86_64
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 + verified

People

(Reporter: Virtual, Assigned: dmajor)

References

(Blocks 1 open bug)

Details

(Keywords: hang, nightly-community, regression)

Attachments

(2 files)

[Tracking Requested - why for this release]: Regression

STR:
1. Try to add attachment on Bugzilla
2. Enjoy the hang
"Speedy" Regression window (mozilla-central)
Good:
https://ftp.mozilla.org/pub/firefox/nightly/2017/04/2017-04-20-03-03-46-mozilla-central/

Bad:
https://ftp.mozilla.org/pub/firefox/nightly/2017/04/2017-04-21-03-02-41-mozilla-central/

Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=27311156637f9b5d4504373967e01c4241902ae7&tochange=dd530a59750adcaa0d48fa4f69b0cdb52715852a
Summary: Firefox goes into unresponsive state and hangs after trying to add attachment on Bugzilla → Mozilla Firefox Nightly 55.0a1 (2017-04-21) goes into unresponsive state and hangs after trying to add attachment on Bugzilla
It doesn't hang for me. Does this happen on a fresh profile? Can you grab some stacks with WinDbg? https://developer.mozilla.org/en-US/docs/Mozilla/How_to_get_a_stacktrace_with_WinDbg
(the "~*k" step is enough)
Flags: needinfo?(dmajor) → needinfo?(Virtual)
Jim, can you reproduce?
Flags: needinfo?(jmathies)
(In reply to Mike Taylor [:miketaylr] from comment #5)
> Jim, can you reproduce?

Not anymore. The hang was in system code while loading a dll. I think it had something to do with ieframe. I was able to reproduce this over the weekend in a local build but I just checked again and can't now. Haven't seen it in nightly. I did a pull and a clobber yesterday so that might be playing a part here. If it shows back up I'll grab the stack and post but honestly the stack was unactionable for the most part.
Flags: needinfo?(jmathies)
(In reply to David Major [:dmajor] from comment #4)
> It doesn't hang for me.

Maybe it's system dependent,
as it hangs on Windows 7 64-bit with Firefox 64-bit,
Firefox 32-bit is unaffected on Windows 7 64-bit,
and it's not reproducible on Windows 10 64-bit with Firefox 64-bit.


(In reply to David Major [:dmajor] from comment #4)
> Does this happen on a fresh profile?

Yes. Tested on Mozilla Firefox Nightly 55.0a1 (2017-04-21) (64-bit) [Portable] with clean new fresh profile without any addons (extensions, plugins, themes, etc.).


(In reply to David Major [:dmajor] from comment #4)
> Can you grab some stacks with WinDbg?
> https://developer.mozilla.org/en-US/docs/Mozilla/How_to_get_a_stacktrace_with_WinDbg
> (the "~*k" step is enough)

Unfortunately I'm not permitted to install any additional and external software on that system.
Maybe "Create dump file" will do the job?
Flags: needinfo?(Virtual) → needinfo?(dmajor)
Summary: Mozilla Firefox Nightly 55.0a1 (2017-04-21) goes into unresponsive state and hangs after trying to add attachment on Bugzilla → Mozilla Firefox Nightly 55.0a1 (2017-04-21) (64-bit) goes into unresponsive state and hangs after trying to add attachment on Bugzilla
Attached file deadlocked stacks.txt
I caught this in windbg, looks like there's a deadlock occurring occasionally in patched_LdrLoadDll.
Thanks Jim. It's definitely a mutual-wait:

Thread 0:
ntdll!ZwWaitForSingleObject+0xa
ntdll!RtlpWaitOnCriticalSection+0xe8
ntdll!RtlEnterCriticalSection+0xd1     <<<<<<<< waiting on stackwalk lock
mozglue!mozilla::SHA1Sum::SHA1Sum+0x159c
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x18c
ieframe!CRWLock::CRWLock+0x1f
ieframe!CProcessElevationPolicy::CProcessElevationPolicy+0x39
ieframe!_xmm+0xa12
msvcrt!initterm+0x1f
ieframe!DllMainCRTStartup+0x10c
ntdll!LdrpRunInitializeRoutines+0x1fe  <<<<<<< has the loader lock
ntdll!LdrGetProcedureAddressEx+0x2aa
ntdll!LdrGetProcedureAddress+0x11
KERNELBASE!GetProcAddress+0x5c

Thread 25:
ntdll!ZwWaitForSingleObject+0xa
ntdll!RtlpWaitOnCriticalSection+0xe8
ntdll!RtlEnterCriticalSection+0xd1      <<<<<<< waiting on loader lock
ntdll!LdrpLoadDll+0x2c7
ntdll!LdrLoadDll+0xed
mozglue!`anonymous namespace'::patched_LdrLoadDll+0x1a4  <<<<<<< has the stackwalk lock
Assignee: nobody → dmajor
Blocks: 1354611
Flags: needinfo?(dmajor)
> [Tracking Requested - why for this release]: Regression

I will add that this is a new hang and is relatively easy to hit (on 64-bit Win7, at least).

We shouldn't let this escape the nightly channel. I'll back out if I can't fix it quickly.
Thinking about this a bit, nobody should ever have to wait on the stackwalk lock. The intent of the stackwalk lock is to signal "It is not safe to call SuspendThread on me at this time". This ought to be an immediate and infallible operation, and it shouldn't require mutual exclusion. We should even be able to say "don't suspend me" while the profiler is in the middle of tracing another thread!

I'm going to try switching the innards of AcquireStackWalkWorkaroundLock to increment an atomic counter of functions that don't want stackwalking. The profiler can then give up if the counter is nonzero. (In theory this could even be a counter per thread, but I don't want to go there initially.) I discussed something similar with Markus a few weeks ago but ended up not needing it for reasons I don't remember.
Status: NEW → ASSIGNED
Component: Untriaged → DOM: Content Processes
Product: Firefox → Core
Component: DOM: Content Processes → Gecko Profiler
Tracking for 55 since it's a recent regression. Sounds like David has a handle on it already.
Attached patch CounterSplinter Review
This patch replaces {Acquire,Release}StackWalkWorkaroundLock with a global counter of suppressions.

Please closely review my use of Atomic in particular.

Other notes:
- I based the name on JS's AutoSuppressGC.
- dllexport'ing all of struct AutoSuppressStackWalking is a bit odd-looking, but I like that it forces clients to always have their ++ and -- paired up.
- The #ifdef _M_AMD64 was really getting out of hand, so I decided to make this entire family of APIs available only on Win64. Most callers (including the ones yet to land) are already in Win64-only code anyway.
Attachment #8863901 - Flags: review?(nfroyd)
Attachment #8863901 - Flags: review?(mstange)
Comment on attachment 8863901 [details] [diff] [review]
Counter

Review of attachment 8863901 [details] [diff] [review]:
-----------------------------------------------------------------

I can't see anything wrong with this, but I'm not an atomics expert. I'm hoping Nathan can review that part.
Attachment #8863901 - Flags: review?(mstange) → review+
Comment on attachment 8863901 [details] [diff] [review]
Counter

Review of attachment 8863901 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/misc/StackWalk.cpp
@@ +389,5 @@
>  #endif
>  
>  #ifdef _WIN64
> +  // Avoid deadlock; see comment at the definition of stackWalkSupppressions.
> +  if (sStackWalkSuppressions) {

Maybe add another comment to explain how different threads can affect sStackWalkSuppressions between this check and the actual stack walking. Especially: The thread whose stack we'll be walking is currently suspended. If sStackWalkSuppressions has been zero at any point while that thread was suspended, then we know that the suspended thread for sure isn't holding the "lock". Other threads which are not suspended may have incremented sStackWalkSuppressions in the meantime since the check above, but that doesn't matter because we can't deadlock with those.
Comment on attachment 8863901 [details] [diff] [review]
Counter

Review of attachment 8863901 [details] [diff] [review]:
-----------------------------------------------------------------

You know, when efaust did the first patch for the win64 profiler hang, I convinced him that atomics wouldn't work.  Apparently I didn't understand the profiler well enough.

r=me, but Markus is right: this needs a large comment somewhere explaining why atomics are OK and mutexes are not.  Especially why it's OK to only check the atomic variable once before starting stack unwinding, and why we don't have to care about other threads incrementing that variable while we're doing stackwalking.
Attachment #8863901 - Flags: review?(nfroyd) → review+
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab0c15933c74
Replace the stack walk workaround lock with an atomic counter of suppressions. r=mstange,froydnj
https://hg.mozilla.org/mozilla-central/rev/ab0c15933c74
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Looks fixed in Mozilla Firefox Nightly 55.0a1 (2017-05-04) (64-bit).
Thank you very much! \o/
Status: RESOLVED → VERIFIED
Blocks: 1366030
You need to log in before you can comment on or make changes to this bug.