Closed Bug 729536 Opened 12 years ago Closed 12 years ago

Deadlock by xul!nsHttpConnectionMgr::Shutdown

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla19
Tracking Status
firefox13 --- affected
firefox14 --- affected
firefox15 --- affected
firefox16 --- affected
firefox17 + fixed
firefox18 + fixed
firefox19 --- fixed

People

(Reporter: m_kato, Assigned: mayhemer)

References

Details

(Keywords: hang, Whiteboard: [snappy:p1][qa-])

Attachments

(3 files, 1 obsolete file)

When shuting down Firefox Nightly, process isn't terminated.
 
thread 0 and thread 8 has dead lock.  Although thread 0 tries shutdown, lock has thread 8.  and thread 8 tries dispatch runnable object, but thread 0 is waiting lock.


Also, I don't know how to reproduce this.

0:007> ~0k 100
Child-SP          RetAddr           Call Site
00000000`0040e868 000007fe`fdd910dc ntdll!ZwWaitForSingleObject+0xa
00000000`0040e870 000007fe`fafea312 KERNELBASE!WaitForSingleObjectEx+0x79
00000000`0040e910 000007fe`ed396159 nspr4!PR_Wait+0x102
00000000`0040e960 000007fe`ed35a900 xul!mozilla::ReentrantMonitor::Wait+0xd
00000000`0040e990 000007fe`ed35a990 xul!nsHttpConnectionMgr::Shutdown+0xa4
00000000`0040e9c0 000007fe`ed16f919 xul!nsHttpHandler::Observe+0x8c
00000000`0040ea60 000007fe`ed172c81 xul!nsObserverList::NotifyObservers+0x79
00000000`0040eb30 000007fe`ed32977b xul!nsObserverService::NotifyObservers+0xa1
00000000`0040eb60 000007fe`ed340272 xul!nsXREDirProvider::DoShutdown+0x7f
00000000`0040eb90 000007fe`ed27324d xul!ScopedXPCOMStartup::~ScopedXPCOMStartup+
0x56
00000000`0040ebc0 00000001`3f7a17ba xul!XRE_main+0x1261
00000000`0040f0e0 00000001`3f7a1b3a firefox!wmain+0x75a
00000000`0040fc60 00000000`775c652d firefox!__tmainCRTStartup+0x11a
00000000`0040fc90 00000000`776fc521 kernel32!BaseThreadInitThunk+0xd
00000000`0040fcc0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

0:007> ~8k 100
Child-SP          RetAddr           Call Site
00000000`05f7f158 000007fe`fdd910dc ntdll!ZwWaitForSingleObject+0xa
00000000`05f7f160 000007fe`fafe1472 KERNELBASE!WaitForSingleObjectEx+0x79
00000000`05f7f200 000007fe`ed3200ce nspr4!PR_WaitCondVar+0x122
00000000`05f7f240 000007fe`edb1463c xul!mozilla::CondVar::Wait+0xe
00000000`05f7f270 000007fe`edb8f299 xul!mozilla::psm::SyncRunnableBase::DispatchToMainThreadAndWait+0x4c
00000000`05f7f2a0 000007fe`edf55f78 xul!nsNSSSocketInfo::GetPreviousCert+0x59
00000000`05f7f2d0 000007fe`fad9538b xul!HandshakeCallback+0x3f8
00000000`05f7f600 000007fe`fad98938 ssl3!ssl3_FinishHandshake+0x73
00000000`05f7f630 000007fe`fad9b83d ssl3!ssl3_HandleFinished+0x400
00000000`05f7f6b0 000007fe`fad9ba25 ssl3!ssl3_HandleHandshakeMessage+0x2c9
00000000`05f7f740 000007fe`fad9bf63 ssl3!ssl3_HandleHandshake+0xe1
00000000`05f7f770 000007fe`fad9c201 ssl3!ssl3_HandleRecord+0x41f
00000000`05f7f880 000007fe`fad9d7d1 ssl3!ssl3_GatherCompleteHandshake+0x205
00000000`05f7f8e0 000007fe`fada3432 ssl3!ssl_GatherRecord1stHandshake+0x35
00000000`05f7f910 000007fe`fada4518 ssl3!ssl_Do1stHandshake+0x56
00000000`05f7f940 000007fe`fada7fbf ssl3!ssl_SecureSend+0x158
00000000`05f7f980 000007fe`edb8f65c ssl3!ssl_Send+0x6b
00000000`05f7f9b0 000007fe`edb90e11 xul!PSMSend+0x8c
00000000`05f7fa00 000007fe`ed2ae8ae xul!nsSSLIOLayerWrite+0x11
00000000`05f7fa40 000007fe`ed66f8a4 xul!nsSocketOutputStream::Write+0xbe
00000000`05f7fa80 000007fe`ed25701a xul!nsHttpConnection::EnsureNPNComplete+0x406878
00000000`05f7fb60 000007fe`ed269015 xul!nsHttpConnection::OnSocketWritable+0x42
00000000`05f7fba0 000007fe`ed1ff41a xul!nsHttpConnection::OnOutputStreamReady+0x21
00000000`05f7fbd0 000007fe`ed1ff52f xul!nsSocketOutputStream::OnSocketReady+0xa6
00000000`05f7fc00 000007fe`ed1da0f1 xul!nsSocketTransport::OnSocketReady+0xf3
00000000`05f7fc30 000007fe`ed04d417 xul!nsSocketTransportService::DoPollIteration+0x141
00000000`05f7fc90 000007fe`ed0f46c6 xul!nsSocketTransportService::Run+0xc7
00000000`05f7fce0 000007fe`ed194a2c xul!nsThread::ProcessNextEvent+0x1c6
00000000`05f7fdb0 000007fe`fafe2508 xul!nsThread::ThreadFunc+0x8c
00000000`05f7fde0 000007fe`fafe128d nspr4!_PR_NativeRunThread+0x158
00000000`05f7fe50 00000000`73f32fdf nspr4!pr_root+0xd
00000000`05f7fe80 00000000`73f33080 MSVCR90!endthreadex+0x47
00000000`05f7feb0 00000000`775c652d MSVCR90!endthreadex+0xe8
00000000`05f7fee0 00000000`776fc521 kernel32!BaseThreadInitThunk+0xd
00000000`05f7ff10 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
nsHttpConnectionMgr::Shutdown may let the event queue loop while waiting for the monitor (using mon.Wait(e.g. 100ms);).
Severity: normal → critical
What do you guys think about Honza's proposed solution in comment 1?

Previously, we have already established the pattern that, if we want the Socket Transport Thread and the main thread to be synced, then we must first block the socket transport thread and then execute the main thread option. This is what CertErrorRunnableRunnable (not a typo) and other code does.

If nsHttpConnectionMgr were to follow this pattern, it would send an event to the socket transport service thread, and then use SyncRunnableBase (or similar) to execute the stuff on the main thread that must be executed on the main thread.

If Honza's proposed solution will not work, then I have another potential solution that would work, but which would be more effort. This would require all of the following:

(a) Dropping support for main-thread-only (i.e. JS)  nsICertBadCertListener2 and nsISSLErrorListener. Every single nsICertBadCertListener2 and nsISSLErrorListener in mozilla-central and comm-central does "return true;" to suppress the prompts that got turned off by default in bug 682329. We can just stop calling bad cert handlers and SSL error listeners completely without having any notable negative effect on Firefox or Thunderbird themselves. The following addons on AMO would be affected:

     ExQuilla for Microsoft Exchange (for Thunderbird)
     MiTM Me - alternate (click-through) cert error UI
     Lightning for Thunderburd -- emulating
            Thunderbird's unsafe cert error exception
            dialog box.
     KeeFox (A password manager)--AFACT, this one is
             *completely* and silently vulnerable to
            MITM attacks because of how it implements
            nsIBadCertListener2.
     Maybe Chatzilla (didn't read the code)

Obviously, the authors of these extensions want to keep their custom override behavior. At the same time, they are all implementing behavior that we (Mozilla) already rejected as unsafe. (In fact, the interface change from nsIBadCertListener to nsIBadCertListener2 was done explicitly to avoid the behavior that these addons are implementing.)

(b) Dropping support for main-thread-only nsICertOverrideService. Our built-in implementation is already thread-safe. I verified on mxr.mozilla.org/addons that no extensions on AMO would be affected by this. The main effect of making this change is that addons like Convergence and Selenium would have to replace their JS-based implementations with native code implementations. IMO, this is not an unreasonable requirement, but bz objected to it before. (I am willing to literally write the C++ code for them.)

(c) Fixing 679144

(d) Reworking the logic that handles PKCS#11 login, so that we don't need SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with the master password before we start any SSL connections in FIPS mode. (This would solve other problems too, like bug 584014 and related bugs.)
(In reply to Brian Smith (:bsmith) from comment #2)
> If Honza's proposed solution will not work, then I have another potential
> solution that would work, but which would be more effort. This would require
> all of the following:
> 
> (a) Dropping support for main-thread-only (i.e. JS)  nsICertBadCertListener2
> and nsISSLErrorListener.

I have figured out a way to avoid this, I think, that doesn't require changing our support for nsICertBadCertListener2 and nsISSLErrorListener.

> (b) Dropping support for main-thread-only nsICertOverrideService. 

Similarly, I was just wrong about this. It is a non-issue, AFAICT.

> (c) Fixing 679144

This will happen anyway.

> (d) Reworking the logic that handles PKCS#11 login, so that we don't need
> SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
> the master password before we start any SSL connections in FIPS mode. (This
> would solve other problems too, like bug 584014 and related bugs.)

This would be the main remaining issue to solving this by reworking code in PSM.
>> (d) Reworking the logic that handles PKCS#11 login, so that we don't need
>> SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
>> the master password before we start any SSL connections in FIPS mode. (This
>> would solve other problems too, like bug 584014 and related bugs.)

> This would be the main remaining issue to solving this by reworking code
> in PSM.

In English: This would be the main issue blocking the solving of this by reworking code in PSM (vs. spinning the event loop).
(In reply to Brian Smith (:bsmith) from comment #3)
> I have figured out a way to avoid this, I think, that doesn't require
> changing our support for nsICertBadCertListener2 and nsISSLErrorListener.

Bug 754374.

> > (d) Reworking the logic that handles PKCS#11 login, so that we don't need
> > SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with
> > the master password before we start any SSL connections in FIPS mode. (This
> > would solve other problems too, like bug 584014 and related bugs.)
> 
> This would be the main remaining issue to solving this by reworking code in
> PSM.

The PKCS#11 password prompt is a modal dialog that should (AFAICT) block us from shutting down. That is bad UI, but AFAICT it means it is a non-issue for this bug.
OS: Windows 7 → All
Hardware: x86_64 → All
Attached file OS X stack
I'm experiencing what appears to be the same issue on OS X 10.7.4.
Keywords: hang
Whiteboard: [snappy]
I hit this shutting down today.

Can we implement the solution from comment 1 now and rework this code later?
Also doesn't this code need to handle spurious notifications of the condvar?
Whiteboard: [snappy] → [snappy:p1]
I've been hitting this hang again quite a few times lately.  Is someone able to take this snappy:p1?
I saw this several times, with exactly the same OS X stack trace as in MattN's attachment. Most of the times were while shutting down to install Aurora updates, but the most recent time was immediately after waking up from sleep.
I'll try to fix this.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Honza, here's the patch I wrote for this. It removes the "previous cert" caching.

I don't know yet what the performance impact of this would be. The case where it would be problematic is where the top-level document is EV and the session has been resumed. In that case, the SecureBrowserUI and/or browser.js will eventually call "IsExtendedValidation()" on the main thread, which will do a certificate validation, which can block on network I/O and/or disk I/O.

Note that this problematic case can happen anyway, because this optimization only works (if ever) in cases where you navigate from one page to another on the same site. And, it's only for EV, AFAICT.
Attachment #667096 - Flags: review?(honzab.moz)
tracking since this is showing up a lot lately and we'd consider an uplift if the risk/reward ratio was positive.
(In reply to Brian Smith (:bsmith) from comment #15)
> Created attachment 667096 [details] [diff] [review]
> remove GetPreviousCert to avoid most common case of deadlock
> 
> Honza, here's the patch I wrote for this. It removes the "previous cert"
> caching.
> 
> I don't know yet what the performance impact of this would be. The case
> where it would be problematic is where the top-level document is EV and the
> session has been resumed. In that case, the SecureBrowserUI and/or
> browser.js will eventually call "IsExtendedValidation()" on the main thread,
> which will do a certificate validation, which can block on network I/O
> and/or disk I/O.
> 
> Note that this problematic case can happen anyway, because this optimization
> only works (if ever) in cases where you navigate from one page to another on
> the same site. And, it's only for EV, AFAICT.


Sorry, I missed this review requirement first.  

Brian, when you start working on a bug or steal it from somebody (w/o asking him, btw) then at least please change the assignee field to let people around you know.

Then, why aren't you fixing this bug according my comment 1 that would be a general fix?
Attached patch v1Splinter Review
- ::Shutdown() no longer indefinitely waits for monitor notification
- introduced a new event handler called on the main thread that sets a flag ; posted from OnMsgShutdown() handler on the socket thread
- ::Shutdown() loops with ProcessNextEvent until the flag is set

(-w patch for simpler review coming)
Attachment #668452 - Flags: review?(cbiesinger)
(In reply to Honza Bambas (:mayhemer) from comment #17)
> Brian, when you start working on a bug or steal it from somebody (w/o asking
> him, btw) then at least please change the assignee field to let people
> around you know.
> 
> Then, why aren't you fixing this bug according my comment 1 that would be a
> general fix?

This patch is in my queue for an unrelated reason. It just happens to resolve the most common cause of this bug. I just posted it to be helpful. I'm not trying to steal the bug.
(In reply to Brian Smith (:bsmith) from comment #20)
> This patch is in my queue for an unrelated reason. It just happens to
> resolve the most common cause of this bug. I just posted it to be helpful.
> I'm not trying to steal the bug.

Aha, then sorry for my overreaction.  

Do you intend to remove more of this sync stuff?  If so, then I think it would be good to move it to a different bug.  For this bug, so close to release bounce, I think it is safer to take a simpler and more general patch in Necko then touch relatively non-transparent and sensitive PSM code.
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Do you intend to remove more of this sync stuff?  If so, then I think it
> would be good to move it to a different bug.  For this bug, so close to
> release bounce, I think it is safer to take a simpler and more general patch
> in Necko then touch relatively non-transparent and sensitive PSM code.

If you think that spinning the event loop is not going to be problematic, then I agree that is a reasonable short-term solution.

Regardless of this bug, I want to remove GetPreviousCert, because it can actually be bad for performance (due to event dispatching/queuing overhead), because of the nsIDocShell dependency it adds, and because there is a better, more effective way to cache results. However, that better way to cache results requires several more patches to implement.

My intention is to eventually remove all the SyncRunnableBase stuff from PSM. I have a plan for that, which we should discuss. However, it requires make significant changes, including even UI changes for master password. So, definitely we shouldn't block progress on this bug on that.
(In reply to Brian Smith (:bsmith) from comment #22)
> (In reply to Honza Bambas (:mayhemer) from comment #21)
> > Do you intend to remove more of this sync stuff?  If so, then I think it
> > would be good to move it to a different bug.  For this bug, so close to
> > release bounce, I think it is safer to take a simpler and more general patch
> > in Necko then touch relatively non-transparent and sensitive PSM code.
> 
> If you think that spinning the event loop is not going to be problematic,
> then I agree that is a reasonable short-term solution.

I've been checking the code and it is clear we do the event loop spin short after manager shutdown anyway.  During the looping I've introduced the manager is already in an unconditional shutdown state (there is no ref to sts that prevents using the manager mostly).  So I don't feel there would be a regression unless well hidden.

And I also think this way of shutdown is more proper then indefinite waiting for a monitor notification.  PSM may not be the only code that blocks the socket thread by a sync event for the main thread.

> 
> Regardless of this bug, I want to remove GetPreviousCert, because it can
> actually be bad for performance (due to event dispatching/queuing overhead),
> because of the nsIDocShell dependency it adds, and because there is a
> better, more effective way to cache results. However, that better way to
> cache results requires several more patches to implement.
> 
> My intention is to eventually remove all the SyncRunnableBase stuff from
> PSM. I have a plan for that, which we should discuss. However, it requires
> make significant changes, including even UI changes for master password. So,
> definitely we shouldn't block progress on this bug on that.

OK, I agree.
Attachment #667096 - Attachment is obsolete: true
Attachment #667096 - Flags: review?(honzab.moz)
Attachment #668452 - Flags: review?(cbiesinger) → review+
Is the reviewed patch covering the complete fix here?  If so, please get this tested and landed to trunk. To take this fix on Beta we'll want it in earlier so there's time to watch for any regressions.  That means having this assessed for the risks of uplift in the coming week.
https://hg.mozilla.org/mozilla-central/rev/8141e5eb9f0b
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
We're looking for beta uplift here - please nominate asap
Comment on attachment 668452 [details] [diff] [review]
v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): None
User impact if declined: Shutdown deadlock
Testing completed (on m-c, etc.): Spent 6 days of baking on m-c, seems not to cause any trouble
Risk to taking this patch (and alternatives if risky): Low, patch is simple and very straightforward
String or UUID changes made by this patch: None
Attachment #668452 - Flags: approval-mozilla-beta?
Attachment #668452 - Flags: approval-mozilla-aurora?
Attachment #668452 - Flags: approval-mozilla-beta?
Attachment #668452 - Flags: approval-mozilla-beta+
Attachment #668452 - Flags: approval-mozilla-aurora?
Attachment #668452 - Flags: approval-mozilla-aurora+
Thanks for branch-landing this, Ryan.
Is there anything in particular QA can do to verify this is fixed? Any advice on exploratory regression testing we can perform?
Whiteboard: [snappy:p1] → [snappy:p1][qa?]
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #32)
> Is there anything in particular QA can do to verify this is fixed? Any
> advice on exploratory regression testing we can perform?

I don't think this is simple to reproduce.  You could try visit an https page during shutdown, but the race is so close you probably won't be lucky.

I'd just leave this be verified by feedback from users.
Lukas, is there any Beta feedback we can look at to inform the verification? Otherwise there's nothing QA can really do here.
Whiteboard: [snappy:p1][qa?] → [snappy:p1][qa-]
So I still see this issue with Thunderbird 17.0 beta. Today I got an upgrade notice to the latest beta. After applying the downloaded update I triggered a restart and now a thunderbird process is hanging around for about 15 minutes. This is on OS X 10.7.5.
Henrik, if you can attach to a hung Thunderbird with the OS X Activity Monitor and do a "Sample Process", please file a new bug, copied to me, with the process sample attached.
Flags: needinfo?(hskupin)
Yes please, provide stack if you can.  Thanks.
I killed the process already. But I will remember and do it when I see it the next time. It's something which happens occasionally after a couple of days using hibernation mode. So it will take a bit until I can come back with further information.
Flags: needinfo?(hskupin)
Btw. could that be bug 803983?
(In reply to Henrik Skupin (:whimboo) from comment #39)
> Btw. could that be bug 803983?

Yes, let's follow any further Thunderbird shutdown hangs on bug 803983.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: