Closed Bug 939231 Opened 8 years ago Closed 7 years ago

Stop requiring trace-malloc for the deadlock detector

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

Attachments

(2 files, 1 obsolete file)

The existing deadlock detector code uses the trace-malloc stack walking
facilities, which is problematic for a few reasons.

1. It is only available in builds with --enable-trace-malloc, which is
   not in the default build configuration.
2. It tries to capture a symbolicated stack trace every time that a lock
   is acquired or released, which is really slow.

This patch changes the deadlock detector to use the XPCOM stack walking
and symbolification facilities, and avoids the symbolification until the
point where we need to print out the call stack, which makes the
deadlock detector a lot faster than it currently is in trace-malloc
builds.
Assignee: nobody → ehsan
Attachment #833079 - Flags: review?(bent.mozilla)
Attached file stderr output
In order to test this patch, I made this change to my tree:

diff --git a/editor/libeditor/base/nsEditor.cpp b/editor/libeditor/base/nsEditor.cpp
index f91d1de..5666b61 100644
--- a/editor/libeditor/base/nsEditor.cpp
+++ b/editor/libeditor/base/nsEditor.cpp
@@ -129,6 +129,13 @@ extern nsIParserService *sParserService;
 //
 //---------------------------------------------------------------------------
 
+#include "mozilla/Monitor.h"
+
+void lock() {
+  static mozilla::Monitor gM ("test");
+  gM.Lock();
+}
+
 nsEditor::nsEditor()
 :  mPlaceHolderName(nullptr)
 ,  mSelState(nullptr)
@@ -150,6 +157,8 @@ nsEditor::nsEditor()
 ,  mDidPostCreate(false)
 ,  mDispatchInputEvent(true)
 {
+  lock();
+  lock();
 }
 
 nsEditor::~nsEditor()

And then started Firefox, which dumped out the correct stacks.
Attached patch Patch (v2)Splinter Review
This should build on more platforms and should stop the leak with the nsTArray inside of kNone in the previous version of the patch.
Attachment #833079 - Attachment is obsolete: true
Attachment #833079 - Flags: review?(bent.mozilla)
Attachment #8333525 - Flags: review?(bent.mozilla)
Comment on attachment 8333525 [details] [diff] [review]
Patch (v2)

Review of attachment 8333525 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/glue/DeadlockDetector.cpp
@@ +19,5 @@
> +
> +nsTArray<void*>
> +CallStack::GetBacktrace()
> +{
> +  nsTArray<void*> callstack;

You could SetCapacity here to something big-ish couldn't you?

@@ +20,5 @@
> +nsTArray<void*>
> +CallStack::GetBacktrace()
> +{
> +  nsTArray<void*> callstack;
> +  NS_StackWalk(StackCallback, 2, 0, &callstack, 0, NULL);

Nit: nullptr

@@ +25,5 @@
> +  return mozilla::Move(callstack);
> +}
> +
> +void
> +CallStack::Print(FILE* f) const

Please stick an fflush at the exits for this function?

::: xpcom/glue/DeadlockDetector.h
@@ +69,5 @@
> +    static nsTArray<void*> GetBacktrace();
> +
> +    static const CallStack NullCallStack()
> +    {
> +      return CallStack(nsTArray<void*>());

I don't understand why this is needed any more. nsTArray already has a mEmptyHdr that is used by default, and it looks to me like this will still create a temporary. I guess you'll Move() it into place, but that just calls SwapElements... and that doesn't have any kind of fast path for swapping two empty arrays as far as I can tell.

Why not just let OrderingEntry::mFirstSeen use its default constructor, and give ResourceAcquisition another constructor that doesn't take a CallStack arg? There's some stuff in BlockingResourceBase that would need fixing, like adding a ClearAcquisitionContext function to match SetAcquisitionContext maybe...
Comment on attachment 8333525 [details] [diff] [review]
Patch (v2)

Review of attachment 8333525 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/glue/DeadlockDetector.cpp
@@ +19,5 @@
> +
> +nsTArray<void*>
> +CallStack::GetBacktrace()
> +{
> +  nsTArray<void*> callstack;

Sure, but why?  This is pretty fast already, and I don't know if I can come up with a magical number which makes sense all the time.  I'm open to suggestions though!

@@ +20,5 @@
> +nsTArray<void*>
> +CallStack::GetBacktrace()
> +{
> +  nsTArray<void*> callstack;
> +  NS_StackWalk(StackCallback, 2, 0, &callstack, 0, NULL);

Haha, it's ironic you caught *me* adding this.  ;-)

::: xpcom/glue/DeadlockDetector.h
@@ +69,5 @@
> +    static nsTArray<void*> GetBacktrace();
> +
> +    static const CallStack NullCallStack()
> +    {
> +      return CallStack(nsTArray<void*>());

Hmm, but if I do that, a callstack is taken every time we create an object of type OrderingEntry or ResourceAcquisition, right?  The purpose of this null value (as far as I can tell) is to enable us to create CallStack objects without actually walking up the stack.

Also, if I'm reading <http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTArray-inl.h#320> correctly, for an empty array, UseAutoArrayBuffer() will return false, so we'll end up taking that fast path.  Am I missing something?

(And also, I'm not sure why we're talking about these micro optimizations...  This code is debug builds only, and it's orders of magnitude faster than what we currently have!)
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #5)
> Sure, but why?  This is pretty fast already, and I don't know if I can come
> up with a magical number which makes sense all the time.  I'm open to
> suggestions though!

I'd do something like 32. It'll be too small for some but will fit the majority, I hope. I'm mostly suggesting this just because it seems unnecessary to take a callstack where we know we're going to malloc a bunch of times in quick succession when we could avoid it.

> Hmm, but if I do that, a callstack is taken every time we create an object
> of type OrderingEntry or ResourceAcquisition, right?  The purpose of this
> null value (as far as I can tell) is to enable us to create CallStack
> objects without actually walking up the stack.

Bah. You're right.

> Also, if I'm reading
> <http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTArray-inl.
> h#320> correctly, for an empty array, UseAutoArrayBuffer() will return
> false, so we'll end up taking that fast path.  Am I missing something?

No, I think you're right, I was just expecting a shortcut for 'this->mHdr == aOther.mHdr' or something and I didn't read much further.

> (And also, I'm not sure why we're talking about these micro optimizations...
> This code is debug builds only, and it's orders of magnitude faster than
> what we currently have!)

Fair point!
Let's please do the fflush too
Attachment #8333525 - Flags: review?(bent.mozilla) → review+
I'll try to investigate this more soon...
OK, so I can't reproduce this locally.  The error in the log is coming from <http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#373>, but the MSDN page for StackWalk64 <http://msdn.microsoft.com/en-us/library/windows/desktop/ms680650%28v=vs.85%29.aspx> says that this function generally doesn't set the last error code, so I'm not sure if the "invalid handle" error is to be trusted.

Benoit, dbaron, bsmedberg, have you guys seen this kind of error before by any chance?
Flags: needinfo?(dbaron)
Flags: needinfo?(bgirard)
Flags: needinfo?(benjamin)
When you say "the error", do you mean the deadlock that we're then crashing in via crashinject? Or Thread 2 in the dump is the worker thread that's doing the backtrace, I think (dbghelp.dll is on the stack):

06:17:12     INFO -  Thread 2
06:17:12     INFO -   0  ntdll.dll + 0xe4f4
06:17:12     INFO -      eip = 0x7c90e4f4   esp = 0x04d4f658   ebp = 0x04d4f668   ebx = 0x092e38a8
06:17:12     INFO -      esi = 0x092e3880   edi = 0x092e38d8   eax = 0x05050000   ecx = 0x00000007
06:17:12     INFO -      edx = 0x7c90e4f4   efl = 0x00000246
06:17:12     INFO -      Found by: given as instruction pointer in context
06:17:12     INFO -   1  mozjs.dll!ParallelSafetyVisitor::visitNewCallObject(js::jit::MNewCallObject *) [ParallelSafetyAnalysis.cpp:4887ddabba31 : 529 + 0x60]
06:17:12     INFO -      eip = 0x01088eb8   esp = 0x04d4f670   ebp = 0x1746d000
06:17:12     INFO -      Found by: previous frame's frame pointer
06:17:12     INFO -   2  xul.dll!`anonymous namespace'::CheckMemAvailable() [AvailableMemoryTracker.cpp:4887ddabba31 : 203 + 0x10]
06:17:12     INFO -      eip = 0x024ae8fb   esp = 0x04d4f694   ebp = 0x1746d000
06:17:12     INFO -      Found by: stack scanning
06:17:12     INFO -   3  xul.dll!`anonymous namespace'::CheckMemAvailable() [AvailableMemoryTracker.cpp:4887ddabba31 : 231 + 0x9]
06:17:12     INFO -      eip = 0x024ae9d0   esp = 0x04d4f698   ebp = 0x1746d000
06:17:12     INFO -      Found by: stack scanning
06:17:12     INFO -   4  xul.dll!`anonymous namespace'::MapViewOfFileHook(void *,unsigned long,unsigned long,unsigned long,unsigned long) [AvailableMemoryTracker.cpp:4887ddabba31 : 275 + 0x4]
06:17:12     INFO -      eip = 0x024aea3a   esp = 0x04d4f6e4   ebp = 0x1746d000
06:17:12     INFO -      Found by: stack scanning
06:17:12     INFO -   5  ntdll.dll + 0xcfdb
06:17:12     INFO -      eip = 0x7c90cfdc   esp = 0x04d4f6f0   ebp = 0x04d4f83c
06:17:12     INFO -      Found by: stack scanning
06:17:12     INFO -   6  0x504ffff
06:17:12     INFO -      eip = 0x05050000   esp = 0x04d4f844   ebp = 0xffffffff
06:17:12     INFO -      Found by: previous frame's frame pointer
06:17:12     INFO -   7  dbghelp.dll + 0x2c37e
06:17:12     INFO -      eip = 0x59a8c37f   esp = 0x04d4f854   ebp = 0xffffffff
06:17:12     INFO -      Found by: stack scanning
06:17:12     INFO -   8  dbghelp.dll + 0x2c49f
06:17:12     INFO -      eip = 0x59a8c4a0   esp = 0x04d4f858   ebp = 0xffffffff
06:17:12     INFO -      Found by: stack scanning

So I wonder if our MapViewOfFile hook is somehow deadlocking via the call to GlobalMemoryStatusEx ?
Flags: needinfo?(benjamin)
No I don't think we're deadlocking here at all, I *think* we're just somehow failing to walk the stack, though I can't really explain how...  Why do you think this is a deadlock?
https://tbpl.mozilla.org/php/getParsedLog.php?id=30892856&tree=Mozilla-Inbound is definitely a mochitest that failed because it stopped responding and we injected a crash in order to get a hang minidump. Whether we are actually deadlocked or are calling stackwalk in an iloop or something else I really don't know.
Flags: needinfo?(bgirard)
Hmm, yeah it seems like this test sits there for 5 minutes before we inject the crash into it.
Ben, for the deadlock detector, can we get away by capturing let's say 20 stack frames max?  Or do we actually need the full stack?
Flags: needinfo?(bent.mozilla)
It doesn't sound like that would be awful but I'd certainly prefer full stacks... How is this relevant to the tinderbox error?
Flags: needinfo?(bent.mozilla)
Nevermind, I think I figured out the problem.  It looks like NS_StackWalk is not thread safe [1], and the log shows that we're trying to call it at the same time from multiple threads.  So this in fact probably NS_StackWalk deadlocking.  (Also see bug 374829 comment 8 which talks about a similar problem.)

Should we do our own locking here?

[1] We create eventStart, and send it to the walker thread, and wait for that thread here: <http://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#548>.  Then, in the walking thread, we wait for eventStart <http://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#429>.  Now if we have multiple of these requests in flight, and end up waiting on them in the wrong order, we deadlock, and I don't see us enforcing any ordering in that code.
Actually there is an implicit ordering because of the thread's event loop which guarantees ordering.  So I guess my deadlock hypothesis in comment 18 is not entirely convincing.
I'm mostly out of ideas here, FWIW.
Assignee: ehsan → nobody
erahm: I think one of your changes removed the deadlock detector's dependence on trace-malloc, right?
Flags: needinfo?(dbaron) → needinfo?(erahm)
Yes, I removed the dependency in bug 1050445. Ehsan might want to review that and follow up if there are more changes he'd like to add. The primary difference is that I left stack traces disabled by default, which I think Ehsan was trying to change.

We can probably close this a duplicate anyhow.
Flags: needinfo?(erahm)
The NS_TRACE_MALLOC-only code is gone from the deadlock detector, so this bug can be closed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.