Closed
Bug 973353
Opened 11 years ago
Closed 11 years ago
Deadlock when crashing inside IO interposer
Categories
(Core :: General, defect)
Tracking
()
RESOLVED
FIXED
mozilla30
People
(Reporter: ted, Assigned: bugzilla)
References
Details
Attachments
(2 files)
My Firefox hung on me, and upon debugging it I determined it was deadlocked because a thread had crashed while holding the IOInterposer lock, and then that thread was waiting on the Breakpad minidump-writing thread, but the minidump-writing thread needs to do IO to write the minidump, so it tried to get the IOInterposer lock...
We should disable the IOInterposer stuff when we hit an exception. Breakpad has a pre-minidump-writing callback that I think we're already using that we can use to do this.
Comment 2•11 years ago
|
||
"hit an exception" means what exactly, is that a crash scenario? Or an event you might recover from?
Background: MozillaRegisterDebugFD(fd) and MozillaUnRegisterDebugFD(fd) exists to disable the PoisonIOInterposer for specific file descriptors. This serves to avoid recursion and deadlocks like this. So the minidump code should ideally register its file descriptors, just like some logging code does.
But as we now interpose NtCreateFile and NtQueryFullAttributesFile which doesn't take file HANDLE as input, it seems MozillaRegisterDebugFD(fd) won't solve this problem.
(Hmm, I wonder if NtCreateFile and NtQueryFullAttributesFile interposing is causing other problems)
Note, ClearPoisonIOInterposer() and ClearNSPRIOInterposing(), which would definitely do the job, are not safe to call while other threads are running.
So this probably have to be solved by a global boolean that can be flipped to disable IO reporting.
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #2)
> "hit an exception" means what exactly, is that a crash scenario? Or an event
> you might recover from?
It's a crash. I wasn't quite able to figure out what the crash was, but it was inside strlen for some reason. Breakpad registers an exception handler that writes a minidump, then we exit the process.
> Background: MozillaRegisterDebugFD(fd) and MozillaUnRegisterDebugFD(fd)
> exists to disable the PoisonIOInterposer for specific file descriptors. This
> serves to avoid recursion and deadlocks like this. So the minidump code
> should ideally register its file descriptors, just like some logging code
> does.
> But as we now interpose NtCreateFile and NtQueryFullAttributesFile which
> doesn't take file HANDLE as input, it seems MozillaRegisterDebugFD(fd) won't
> solve this problem.
> (Hmm, I wonder if NtCreateFile and NtQueryFullAttributesFile interposing is
> causing other problems)
Indeed, Breakpad calls CreateFile, so it doesn't have the file prepared in advance.
>
> Note, ClearPoisonIOInterposer() and ClearNSPRIOInterposing(), which would
> definitely do the job, are not safe to call while other threads are running.
>
> So this probably have to be solved by a global boolean that can be flipped
> to disable IO reporting.
Yes. Once we've hit a fatal exception there's no point in worrying about file I/O. We actually already have something in place for the shutdown I/O poisoning:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#852
Is that using the same mechanism? That exact method is only used on non-Windows.
> It's a crash. I wasn't quite able to figure out what the crash was, but it
> was inside strlen for some reason. Breakpad registers an exception handler
> that writes a minidump, then we exit the process.
xul!mozilla::ProfilerIOInterposeObserver::Observe
const char *filename = (NS_ConvertUTF16toUTF8(aObservation.Filename())).get();
I am guessing the temporary disappeared?
Reporter | ||
Comment 5•11 years ago
|
||
Ah, good catch! It was crashing calling strlen on filename, so that sounds like the actual crash.
Comment 6•11 years ago
|
||
So `mozilla::StopLateWriteChecks()` stops late-write-checks, but if other IOInterposeObservers are registered it will not stop the PoisonIOInterposer. Also, it will acquire the lock which is causing a deadlock here.
The easy hack, would be to add and test a boolean called something like `allIOReportingAborted` in `IOInterposer::Report()` before acquiring `mObserverListsLock`. There is no need to lock access to the boolean, as long as it only changes its value from false to true. And this switch is only made when we're crashing.
Oh, and of course fix the disappearing temporary.
Comment 7•11 years ago
|
||
We should use Atomic<bool> to suppressing warnings about racing on the bool, but yes, that should be outside the lock to avoid deadlock in this situation.
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to David Major [:dmajor] from comment #4)
> > It's a crash. I wasn't quite able to figure out what the crash was, but it
> > was inside strlen for some reason. Breakpad registers an exception handler
> > that writes a minidump, then we exit the process.
>
> xul!mozilla::ProfilerIOInterposeObserver::Observe
> const char *filename =
> (NS_ConvertUTF16toUTF8(aObservation.Filename())).get();
>
> I am guessing the temporary disappeared?
I specifically pointed out that this temporary was going to be an issue. That's unfortunate that it didn't get fixed.
Reporter | ||
Comment 9•11 years ago
|
||
Filed the crash as bug 973899 so we can fix that separately, keeping this for fixing the deadlock.
Comment 10•11 years ago
|
||
This was regressed in bug 902587, and it's making Nightly unusable for me. I have to back out bug 902587 for now, I'm afraid.
Blocks: 902587
Comment 12•11 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #10)
> This was regressed in bug 902587, and it's making Nightly unusable for me.
> I have to back out bug 902587 for now, I'm afraid.
We've kept that code in the tree for too long and now it's impossible to back out cleanly. Aaron said on IRC that he has a very simple fix which will certainly fix this bug so I'll hold off on the backout.
Flags: needinfo?(aklotz)
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #12)
> (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> emailapocalypse) from comment #10)
> > This was regressed in bug 902587, and it's making Nightly unusable for me.
> > I have to back out bug 902587 for now, I'm afraid.
>
> We've kept that code in the tree for too long and now it's impossible to
> back out cleanly. Aaron said on IRC that he has a very simple fix which
> will certainly fix this bug so I'll hold off on the backout.
Landed the crash fix in bug 973899. Obviously we'll want to fix this bug to prevent any future deadlocks.
Flags: needinfo?(aklotz)
Comment 14•11 years ago
|
||
(In reply to comment #13)
> (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> emailapocalypse) from comment #12)
> > (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> > emailapocalypse) from comment #10)
> > > This was regressed in bug 902587, and it's making Nightly unusable for me.
> > > I have to back out bug 902587 for now, I'm afraid.
> >
> > We've kept that code in the tree for too long and now it's impossible to
> > back out cleanly. Aaron said on IRC that he has a very simple fix which
> > will certainly fix this bug so I'll hold off on the backout.
>
> Landed the crash fix in bug 973899. Obviously we'll want to fix this bug to
> prevent any future deadlocks.
Thanks! Can we please bump the priority of this bug? This will make it impossible for us to see some crash reports if we hang while collecting them, which seems terrifying.
Reporter | ||
Comment 15•11 years ago
|
||
AFAIK this deadlock only occurs if we crash inside the IO interposer code, with the lock held. We should still definitely fix it, but I don't think it affects other crashes.
Comment 16•11 years ago
|
||
(In reply to comment #15)
> AFAIK this deadlock only occurs if we crash inside the IO interposer code, with
> the lock held. We should still definitely fix it, but I don't think it affects
> other crashes.
It would if the crash occurs in another thread racing with this stuff, right?
Assignee | ||
Comment 17•11 years ago
|
||
I'm going to be looking into this bug fairly quickly; I've got a bunch of IOInterposer cleanup in the pipe including a significant reduction in the scope of the lock in question.
Reporter | ||
Comment 18•11 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #16)
> It would if the crash occurs in another thread racing with this stuff, right?
I don't *think* so, I don't believe that hitting an exception on one thread immediately pauses other threads. I believe what would happen is that the crashing thread would sit and wait on the minidump-writing thread, which will block on acquiring the IO lock, but whatever thread is currently holding it should continue to run and release the lock.
In any event, this is still terrible and I want it fixed. :)
Assignee | ||
Updated•11 years ago
|
QA Contact: aklotz
Assignee | ||
Comment 19•11 years ago
|
||
This patch adds IOInterposer::Disable() and modifies the CrashReporter filters to use it. We should be calling IOInterposer::Disable() instead of StopLateWriteChecks() because calling the latter does not disable the IOInterposer and its locking.
Attachment #8391446 -
Flags: review?(ted)
Attachment #8391446 -
Flags: review?(nfroyd)
Updated•11 years ago
|
Attachment #8391446 -
Flags: review?(nfroyd) → review+
Reporter | ||
Updated•11 years ago
|
Attachment #8391446 -
Flags: review?(ted) → review+
Assignee | ||
Comment 21•11 years ago
|
||
Comment 22•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
You need to log in
before you can comment on or make changes to this bug.
Description
•