Closed Bug 973353 Opened 6 years ago Closed 6 years ago

Deadlock when crashing inside IO interposer

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla30

People

(Reporter: ted, Assigned: aklotz)

References

Details

Attachments

(2 files)

My Firefox hung on me, and upon debugging it I determined it was deadlocked because a thread had crashed while holding the IOInterposer lock, and then that thread was waiting on the Breakpad minidump-writing thread, but the minidump-writing thread needs to do IO to write the minidump, so it tried to get the IOInterposer lock...

We should disable the IOInterposer stuff when we hit an exception. Breakpad has a pre-minidump-writing callback that I think we're already using that we can use to do this.
Duplicate of this bug: 973356
"hit an exception" means what exactly, is that a crash scenario? Or an event you might recover from?

Background: MozillaRegisterDebugFD(fd) and MozillaUnRegisterDebugFD(fd) exists to disable the PoisonIOInterposer for specific file descriptors. This serves to avoid recursion and deadlocks like this. So the minidump code should ideally register its file descriptors, just like some logging code does.
But as we now interpose NtCreateFile and NtQueryFullAttributesFile which doesn't take file HANDLE as input, it seems MozillaRegisterDebugFD(fd) won't solve this problem.
(Hmm, I wonder if NtCreateFile and NtQueryFullAttributesFile interposing is causing other problems)

Note, ClearPoisonIOInterposer() and ClearNSPRIOInterposing(), which would definitely do the job, are not safe to call while other threads are running.

So this probably have to be solved by a global boolean that can be flipped to disable IO reporting.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #2)
> "hit an exception" means what exactly, is that a crash scenario? Or an event
> you might recover from?

It's a crash. I wasn't quite able to figure out what the crash was, but it was inside strlen for some reason. Breakpad registers an exception handler that writes a minidump, then we exit the process.

> Background: MozillaRegisterDebugFD(fd) and MozillaUnRegisterDebugFD(fd)
> exists to disable the PoisonIOInterposer for specific file descriptors. This
> serves to avoid recursion and deadlocks like this. So the minidump code
> should ideally register its file descriptors, just like some logging code
> does.
> But as we now interpose NtCreateFile and NtQueryFullAttributesFile which
> doesn't take file HANDLE as input, it seems MozillaRegisterDebugFD(fd) won't
> solve this problem.
> (Hmm, I wonder if NtCreateFile and NtQueryFullAttributesFile interposing is
> causing other problems)

Indeed, Breakpad calls CreateFile, so it doesn't have the file prepared in advance.

> 
> Note, ClearPoisonIOInterposer() and ClearNSPRIOInterposing(), which would
> definitely do the job, are not safe to call while other threads are running.
> 
> So this probably have to be solved by a global boolean that can be flipped
> to disable IO reporting.

Yes. Once we've hit a fatal exception there's no point in worrying about file I/O. We actually already have something in place for the shutdown I/O poisoning:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#852

Is that using the same mechanism? That exact method is only used on non-Windows.
> It's a crash. I wasn't quite able to figure out what the crash was, but it
> was inside strlen for some reason. Breakpad registers an exception handler
> that writes a minidump, then we exit the process.

xul!mozilla::ProfilerIOInterposeObserver::Observe
  const char *filename = (NS_ConvertUTF16toUTF8(aObservation.Filename())).get();

I am guessing the temporary disappeared?
Ah, good catch! It was crashing calling strlen on filename, so that sounds like the actual crash.
So `mozilla::StopLateWriteChecks()` stops late-write-checks, but if other IOInterposeObservers are registered it will not stop the PoisonIOInterposer. Also, it will acquire the lock which is causing a deadlock here.

The easy hack, would be to add and test a boolean called something like `allIOReportingAborted` in `IOInterposer::Report()` before acquiring `mObserverListsLock`. There is no need to lock access to the boolean, as long as it only changes its value from false to true. And this switch is only made when we're crashing.

Oh, and of course fix the disappearing temporary.
We should use Atomic<bool> to suppressing warnings about racing on the bool, but yes, that should be outside the lock to avoid deadlock in this situation.
(In reply to David Major [:dmajor] from comment #4)
> > It's a crash. I wasn't quite able to figure out what the crash was, but it
> > was inside strlen for some reason. Breakpad registers an exception handler
> > that writes a minidump, then we exit the process.
> 
> xul!mozilla::ProfilerIOInterposeObserver::Observe
>   const char *filename =
> (NS_ConvertUTF16toUTF8(aObservation.Filename())).get();
> 
> I am guessing the temporary disappeared?

I specifically pointed out that this temporary was going to be an issue. That's unfortunate that it didn't get fixed.
Filed the crash as bug 973899 so we can fix that separately, keeping this for fixing the deadlock.
This was regressed in bug 902587, and it's making Nightly unusable for me.  I have to back out bug 902587 for now, I'm afraid.
Blocks: 902587
Duplicate of this bug: 973953
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #10)
> This was regressed in bug 902587, and it's making Nightly unusable for me. 
> I have to back out bug 902587 for now, I'm afraid.

We've kept that code in the tree for too long and now it's impossible to back out cleanly.  Aaron said on IRC that he has a very simple fix which will certainly fix this bug so I'll hold off on the backout.
Flags: needinfo?(aklotz)
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #12)
> (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> emailapocalypse) from comment #10)
> > This was regressed in bug 902587, and it's making Nightly unusable for me. 
> > I have to back out bug 902587 for now, I'm afraid.
> 
> We've kept that code in the tree for too long and now it's impossible to
> back out cleanly.  Aaron said on IRC that he has a very simple fix which
> will certainly fix this bug so I'll hold off on the backout.

Landed the crash fix in bug 973899. Obviously we'll want to fix this bug to prevent any future deadlocks.
Flags: needinfo?(aklotz)
(In reply to comment #13)
> (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> emailapocalypse) from comment #12)
> > (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> > emailapocalypse) from comment #10)
> > > This was regressed in bug 902587, and it's making Nightly unusable for me. 
> > > I have to back out bug 902587 for now, I'm afraid.
> > 
> > We've kept that code in the tree for too long and now it's impossible to
> > back out cleanly.  Aaron said on IRC that he has a very simple fix which
> > will certainly fix this bug so I'll hold off on the backout.
> 
> Landed the crash fix in bug 973899. Obviously we'll want to fix this bug to
> prevent any future deadlocks.

Thanks!  Can we please bump the priority of this bug?  This will make it impossible for us to see some crash reports if we hang while collecting them, which seems terrifying.
AFAIK this deadlock only occurs if we crash inside the IO interposer code, with the lock held. We should still definitely fix it, but I don't think it affects other crashes.
(In reply to comment #15)
> AFAIK this deadlock only occurs if we crash inside the IO interposer code, with
> the lock held. We should still definitely fix it, but I don't think it affects
> other crashes.

It would if the crash occurs in another thread racing with this stuff, right?
I'm going to be looking into this bug fairly quickly; I've got a bunch of IOInterposer cleanup in the pipe including a significant reduction in the scope of the lock in question.
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #16)
> It would if the crash occurs in another thread racing with this stuff, right?

I don't *think* so, I don't believe that hitting an exception on one thread immediately pauses other threads. I believe what would happen is that the crashing thread would sit and wait on the minidump-writing thread, which will block on acquiring the IO lock, but whatever thread is currently holding it should continue to run and release the lock.

In any event, this is still terrible and I want it fixed. :)
QA Contact: aklotz
Attached patch Patch v1Splinter Review
This patch adds IOInterposer::Disable() and modifies the CrashReporter filters to use it. We should be calling IOInterposer::Disable() instead of StopLateWriteChecks() because calling the latter does not disable the IOInterposer and its locking.
Attachment #8391446 - Flags: review?(ted)
Attachment #8391446 - Flags: review?(nfroyd)
Attachment #8391446 - Flags: review?(nfroyd) → review+
Oops
Assignee: nobody → aklotz
QA Contact: aklotz
Attachment #8391446 - Flags: review?(ted) → review+
https://hg.mozilla.org/mozilla-central/rev/b8eecfc351b6
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Depends on: 991039
You need to log in before you can comment on or make changes to this bug.