Last Comment Bug 710176 - Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish
: Socket transport service thread pegs the CPU spinning to send data on a SSL s...
Status: RESOLVED FIXED
[psm-roadblock][patch in attachment 5...
: hang, perf, regression
Product: Core
Classification: Components
Component: Security: PSM (show other bugs)
: 11 Branch
: All All
: P1 blocker with 1 vote (vote)
: mozilla12
Assigned To: Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
:
Mentors:
: 716071 721130 (view as bug list)
Depends on:
Blocks: 542832 674147
  Show dependency treegraph
 
Reported: 2011-12-13 04:24 PST by Olli Pettay [:smaug]
Modified: 2012-06-21 06:42 PDT (History)
28 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed


Attachments
screenshot (153.55 KB, image/png)
2012-01-07 04:13 PST, Olli Pettay [:smaug]
no flags Details
Profile from Andreas Gal from bug 716071 (59.32 KB, text/plain)
2012-01-07 18:04 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details
screenshot (120.92 KB, image/png)
2012-01-22 01:45 PST, Olli Pettay [:smaug]
no flags Details
WIP: Part 1: Only do version intolerance timeout before certificate validation starts (11.72 KB, patch)
2012-01-23 01:56 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
brian: feedback-
Details | Diff | Splinter Review
[NOT FOR CHECKIN] force cert validation to take a very long time so we can reliably reproduce this (1.08 KB, patch)
2012-01-23 01:58 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
WIP: Only do version intolerance timeout before or after certificate validation. (5.88 KB, patch)
2012-01-24 01:00 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing] (2.76 KB, patch)
2012-01-25 21:55 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
kaie: review+
Details | Diff | Splinter Review
Part 1: Do not do TLS intolerance timeout during cert validation [v3] (5.52 KB, patch)
2012-01-31 04:30 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
brian: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
Part 2: Import ssl_Poll fix from bug 542832 [v3] (4.17 KB, patch)
2012-01-31 04:34 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
brian: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
Actually abort the connection on send/receive if handshake timed out (8.52 KB, patch)
2012-02-04 20:31 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
kaie: review+
bugs: feedback+
honzab.moz: feedback+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Olli Pettay [:smaug] 2011-12-13 04:24:17 PST
I think there is some recent regression, since I haven't seen the behavior in FF9 builds:
When I start Firefox with lots of tabs (including few tbpl, and ~100 bugzilla pages, all https) Firefox uses over 100% cpu time and loads pages slowly.

When I got the problem last time few days ago, Zoom profiler showed most of the time
was spent doing something SSL related. This time my Zoom evaluation license was unfortunately expired.

Has there been any changes to how we handle SSL?
Comment 1 Patrick McManus [:mcmanus] 2011-12-13 05:56:28 PST
(In reply to Olli Pettay [:smaug] from comment #0)

> 
> Has there been any changes to how we handle SSL?

smaug, big changes in nightly, yes.

Can you supply about:buildconfig changeset id's for either of the instances? (probably just the recent one I guess).
Comment 2 Josh Aas 2011-12-13 08:49:49 PST
Perhaps this is caused by processing large numbers of SSL connects at the same time? Previously, as I understand it, we could only do one at a time.
Comment 3 Honza Bambas (:mayhemer) 2011-12-13 12:04:37 PST
Could that be bug 705755?
Comment 4 Olli Pettay [:smaug] 2011-12-20 11:38:20 PST
I think I got this twice yesterday
Comment 5 Honza Bambas (:mayhemer) 2011-12-21 02:06:12 PST
(In reply to Olli Pettay [:smaug] from comment #4)
> I think I got this twice yesterday

Can you try patch in bug 705755?
Comment 6 Honza Bambas (:mayhemer) 2011-12-21 02:27:38 PST
Also as Josh suggested, this could be related to bug 695789 (no patch so far).
Comment 7 Honza Bambas (:mayhemer) 2011-12-21 03:10:23 PST
Olli, at least some more detailed data would be appreciated like sites your were browsing, what exact version did you have, retry with a clean profile etc.

Thanks.
Comment 8 Olli Pettay [:smaug] 2011-12-21 05:56:55 PST
Unfortunately I don't have much information to share. I use Nightlies and occasionally
one thread, which is not the main thread, starts to take 100% cpu and page loading stops.
IIRC page loading stops to "Waiting for ...". Also, IRC messages (via Chatzilla) don't get through.
The only way to solve the problem is to restart browser.

When I was using a build of my own (which had symbols) profiler showed that most of the time
was spent in some SSL method. Unfortunately I don't recall which method.
I'll use builds with symbols again and try to get more information.

The sites I have open are tbpl and bugzilla.mozilla.org (lots of tabs, all using https so when FF 
starts, and I switch through the tab there can be hundreds of active https requests from Gecko to Necko, I presume) and I use Chatzilla connected to freenode, irc.w3.org, and SSL version of moznet.
Comment 9 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-21 18:57:17 PST
Please, please, attach a minidump or a stacktrace or whatever you can, the next time you run into this. I have not been able to reproduce it. Thanks in advance.
Comment 10 Honza Bambas (:mayhemer) 2011-12-22 15:09:49 PST
Olli, also if you are e.g. on windows, you can use Process Explorer (just google for it, downloadable from tech net), which is a much better version of the Task Manager.

When you then click a process, you have a tab "Thread", you can see which thread eats cpu and dblclining a thread will give you a stack trace.  Works quit nice.

When you had one page in state Waiting for, did you try to open a new URL (what ever ssl or non-ssl) and did that new page had loaded?
Comment 11 Alex Keybl [:akeybl] 2011-12-27 07:58:07 PST
The uncommon use case, lack of reproducibility, and lack of duplicate reports prevents this from needing to be tracked. If any of these variables change, please re-nominate for tracking.
Comment 12 Olli Pettay [:smaug] 2012-01-07 04:13:46 PST
Created attachment 586656 [details]
screenshot

A it happened again. I had lots of tabs open, some
rather active ones, like tbpl and gmail and facebook, and
I resumed laptop and connected to internet using slow and bad-latency
network. FF UI stayed responsive, but networking basically hanged, and closing FF
didn't work. I had to kill it.
Comment 13 Olli Pettay [:smaug] 2012-01-07 04:15:35 PST
this is a bad regression. Asking tracking FF11/12 again.
Comment 14 Olli Pettay [:smaug] 2012-01-07 08:14:22 PST
Now it happened with a very fast and low latency connection too.
Comment 15 Olli Pettay [:smaug] 2012-01-07 08:15:19 PST
Comment on attachment 586656 [details]
screenshot

I do have the real log of this too, but apparently sysprof creates only non-human readable logs.
Comment 16 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-07 17:53:40 PST
*** Bug 716071 has been marked as a duplicate of this bug. ***
Comment 17 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-07 18:01:57 PST
I think the problem is that the SSL poll implementation is returning "ready to read and/or write" but its read/write methods return "would block" in the case that the it has finished the handshake but the server's certificate has not been validated yet. This seems to cause the socket transport service to spin, trying over and over again to send data on the connection without actually being able to send any data.

I believe the fix is to change the SSL socket's poll method to return "not ready for reading or writing" when it is stuck waiting for SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be called.
Comment 18 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-07 18:04:25 PST
Created attachment 586743 [details]
Profile from Andreas Gal from bug 716071
Comment 19 Honza Bambas (:mayhemer) 2012-01-08 04:35:47 PST
(In reply to Brian Smith (:bsmith) from comment #17)
> I believe the fix is to change the SSL socket's poll method to return "not
> ready for reading or writing" when it is stuck waiting for
> SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be
> called.


Something I was a bit afraid of..  And didn't catch apparently...

When the socket is in state of cert verification process, i.e. is blocking, poll implementation of the ssl socket has to return 0 (nsSSLIOLayerPoll).  That will bypass poll of the TCP (NSPR) socket layer.

However, I was comparing poll implementation with the state before SSL thread had been introduced.  Wasn't there a similar bug before SSL thread had been introduced?
Comment 20 Honza Bambas (:mayhemer) 2012-01-08 04:37:38 PST
(In reply to Honza Bambas (:mayhemer) from comment #19)
> However, I was comparing poll implementation with the state before SSL
> thread had been introduced.  Wasn't there a similar bug before SSL thread
> had been introduced?

There couldn't be!  We were not using async cert verification.  So, this is actually revealing a bug in libssl.

I think the real fix should be made in NSS.
Comment 21 Honza Bambas (:mayhemer) 2012-01-08 04:44:10 PST
Please see comment 17.  I believe the fix should be made in libssl since it is the layer that accepts WOULD_BLOCK from authCertificate callback and waits for resume with SSL_AuthCertificateComplete().
Comment 22 Alex Keybl [:akeybl] 2012-01-09 11:38:41 PST
Sounds like others are now seeing this on Aurora. Tracking for 10/11.
Comment 23 Olli Pettay [:smaug] 2012-01-13 04:25:23 PST
Today I got even the whole UI hang. Something in main thread ended up waiting for
SSL threads, which had a lock and was waiting for something else. CPU usage was 0.
Unfortunately I didn't save the stack.
Comment 24 Olli Pettay [:smaug] 2012-01-15 15:56:46 PST
Any WIP patches to test? I'd be happy try anything which could reduce the likelihood for the
networking hangs.
Comment 25 Olli Pettay [:smaug] 2012-01-17 08:20:58 PST
Adding 'hang', since this is really a hang in networking. Browser UI is working, but you can't do
anything with it.
Comment 26 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-17 12:31:37 PST
Olli, I wonder why you run into this so frequently whereas I never do.

Please try this patch:
https://bug542832.bugzilla.mozilla.org/attachment.cgi?id=589264

This patch should fix this issue and is awaiting review.
Comment 27 Olli Pettay [:smaug] 2012-01-17 12:36:30 PST
Maybe my connection has something to do with it. I use often a 3G connection in a place where
connection is occasionally very unreliable (huge ping times for some time).

Also, I do have 100+ tabs open and most of them are using https. Not all
the tabs are really active, but there are pages like gmail, tbpl (multiple), facebook, twitter, zimbra,
...
Comment 28 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-17 13:08:05 PST
Yes, that is exactly the situation where this fix would would help. Please apply the patch and let me know if you see any similar problem with the patch. (stack trace would be very helpful too.)
Comment 29 Fanolian 2012-01-19 09:58:39 PST
(In reply to Honza Bambas (:mayhemer) from comment #10)
> Olli, also if you are e.g. on windows, you can use Process Explorer (just
> google for it, downloadable from tech net), which is a much better version
> of the Task Manager.
> 
> When you then click a process, you have a tab "Thread", you can see which
> thread eats cpu and dblclining a thread will give you a stack trace.  Works
> quit nice.
> 
> When you had one page in state Waiting for, did you try to open a new URL
> (what ever ssl or non-ssl) and did that new page had loaded?

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120119 Firefox/12.0a1

I may have run into this problem a couple of times since early December. At first I guessed it was related to bug 692260 but I never had that problem before and the workaround in bug 692260 did not work for me.

Anyway, a few times Nightly cannot load webpages suddenly for a minute or so but resumes normal afterwards without restarting. During that period if I open other sites from bookmark (SSL or non-SSL) the new pages do not load as well. However I can load them in Chrome in the mean time. I do not have many tabs opened (<10) but many of them are SSL sites. This usually happens after I restart Nightly/Firefox_Portable rapidly a few times, all with different profiles (for testing), and return to my main profile.

I have recorded 3 instances from Process Explorer during the network hang, but I recorded the one which hogs Nightly only:
1.
TID 3060
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
nspr4.dll!PR_Poll+0x10a
nspr4.dll!PR_Poll+0x13
xul.dll!??1gfxFontGroup@@UAE@XZ+0xe36
xul.dll!??1gfxFontGroup@@UAE@XZ+0xc7c
xul.dll!?SurfaceDestroyFunc@gfxASurface@@CAXPAX@Z+0x24d7
===================================

2.
TID 596
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x16f5
xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf
================================

3.
TID 2416
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
nspr4.dll!PR_dtoa+0x9d6
nspr4.dll!PR_Unlock+0x32
nspr4.dll!PR_Poll+0x3c9
nspr4.dll!PR_Poll+0x13
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x1876
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x169e
xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf
========================================

If my case is related to this bug, what should I, as a common user, record or pay attention to when I run into this next time?
Comment 30 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-21 20:35:15 PST
Fanolian. That you for the detailed info. It seems to confirm what I suspected as the cause of this bug. I think I have a patch that fixes this in bug 542832. I am just waiting for reviews.
Comment 31 Olli Pettay [:smaug] 2012-01-22 01:45:48 PST
Created attachment 590552 [details]
screenshot

Got the problem again with a build which should have the ssl fix.

I applied the patch and compile using make -f client.mk build
I assume nothing unsual is needed in this case.
Comment 32 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-23 01:56:41 PST
Created attachment 590661 [details] [diff] [review]
WIP: Part 1: Only do version intolerance timeout before certificate validation starts

I can reproduce this reliably using the next patch.

We modified libssl's poll method so that it won't return PR_POLL_READ or PR_POLL_WRITE when we are blocked only on cert validation. But, I failed to account for the special cases in PSM's poll method wrapper which can return PR_POLL_READ and/or PR_POLL_WRITE even when libssl's implementation wouldn't, because of the TLS version intolerance processing.

Fortunately, it seems pretty straightforward to improve the version intolerance processing so that it only happens *before* certificate validation; if we get to the point where we are validating the server's certificate, we KNOW the server wasn't version intolerant. Why?: we receive the certificates only after we've successfully parsed and accepted valid server hello; we will never get a valid server hello that libssl would accept for the connection in the version intolerance case, AFAICT.

Not only should this change fix this bug, but it should also cause fewer false positives regarding version intolerance; that is, with a patch like this, we would end up wrongly and unnecessarily dropping from TLS 1.0 to SSL 3.0 in fewer circumstances than before. So, a double-win.
Comment 33 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-23 01:58:31 PST
Created attachment 590662 [details] [diff] [review]
[NOT FOR CHECKIN] force cert validation to take a very long time so we can reliably reproduce this

Honza, with this patch you can connect to any SSL server and reproduce the spinning. Sometimes I had to do a reload and/or open the same site in a second tab to do so.
Comment 34 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-23 23:30:42 PST
Comment on attachment 590661 [details] [diff] [review]
WIP: Part 1: Only do version intolerance timeout before certificate validation starts

This does not account for every type of TLS intolerance. We need to be *very* careful about changing TLS intolerance checking. Will find a new solution.
Comment 35 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-24 01:00:48 PST
Created attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Don't review this too hard. I haven't even tested it yet. Really just looking for feedback on the approach, though I will take an r+ if I somehow got it right on the first (re)try.
Comment 36 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-24 01:07:07 PST
By the way, I know I could have avoided adding two more member variables, but I will need them later for performance measurements (telemetry and maybe for implementing the navigation API).
Comment 37 Honza Bambas (:mayhemer) 2012-01-24 10:02:59 PST
Comment on attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Review of attachment 591018 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +1059,1 @@
>            > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS));

I understand the idea - to subtract the time of the verification.  Sounds good, but the code is wrong for case the interval wraps. 

Having a condition "now > (mCertVerificationEnded - mCertVerificationStarted)" must make you feel something is not right ;)

You are ok with:

((now - mHandshakeStartTime) - (mCertVerificationEnded - mCertVerificationStarted))
> PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS));


In general, this looks like a minimalistic hit into the current code.  I'll go through this more deeply and give you more comments soon.
Comment 38 Honza Bambas (:mayhemer) 2012-01-24 12:31:05 PST
Comment on attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Review of attachment 591018 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

My description of the cause:
- handshake timed according PSM out (socketInfo->HandshakeTimeout() == true => poll for write | except to push above layer forward
- socket writable
- ssl_Send:ssl3_GatherCompleteHandshake: ss->ssl3.hs.restartTarget != NULL => WOULD_BLOCK since nobody called restart so far (i.e. verification is still pending)

Based on this, I can say the patch is correct with minimal impact.

I also think of resetting the time out start value when we finish cert verification, but that might have a performance negative impact - by prolonging the timeout - or even potential security sensitive impact, not that I would know of right now.  That's for a security review we don't have time for now.
Comment 39 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-25 13:08:29 PST
*** Bug 721130 has been marked as a duplicate of this bug. ***
Comment 40 Steve Fink [:sfink] [:s:] 2012-01-25 13:42:52 PST
Good, I was guessing this bug would already be known. Not that it helps now, but here's a repeating snippet from NSPR_LOG_MODULES='nsSocketTransport:5':

706737920[7f743675b6a0]: poll timeout: 65535
706737920[7f743675b6a0]:     timeout = 65535000 milliseconds
706737920[7f743675b6a0]:     ...returned after 0 milliseconds
706737920[7f743675b6a0]: nsSocketTransport::OnSocketReady [this=77bb6a0 outFlags=6]
706737920[7f743675b6a0]: nsSocketOutputStream::OnSocketReady [this=77bb7f8 cond=0]
706737920[7f743675b6a0]: nsSocketOutputStream::Write [this=77bb7f8 count=396]
706737920[7f743675b6a0]:   calling PR_Write [count=396]
706737920[7f743675b6a0]:   PR_Write returned [n=-1]
706737920[7f743675b6a0]: nsSocketOutputStream::AsyncWait [this=77bb7f8]
706737920[7f743675b6a0]: STS poll iter [1]

...which seemed odd to me, since I knew it was ending up doing a recvfrom(). This bug explains it.

Now I have to go dig up a build from before this was introduced. Any hints? Or is there a way to turn off the async cert validation/SSL thread?
Comment 41 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-25 14:09:37 PST
sfink, for some reason this bug occurs very rarely for me even using the latest Nightly. But, recently I did have it happen to me. Note that when it happens, eventually the hang will "go away" after ~1 minute and everything will work properly again.

The patch that introduced this bug is in bug 674147. So, you would need to find a build from before then.
Comment 42 Steve Fink [:sfink] [:s:] 2012-01-25 17:45:50 PST
This is 100% reproducible for me while on my home network, and I've left it for at least 23 minutes without it resolving itself. But this is with around 300 tabs, maybe a dozen pointing to google properties.

I'm a little confused by which patch is which -- would you like me to try one?

Thanks for the pointer to the bug! I'm trying to track down another major perf problem and this one was masking it. And the other one has been there since at least FF9, so I can easily back up to before bug 674147. But I don't think it's networking-related.
Comment 43 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-25 21:55:03 PST
Created attachment 591698 [details] [diff] [review]
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]

Here is the patch that adds the fix for the bug to mozilla-central until the fix can land in the next NSS release. It is a slightly modified version of the patch that Wan-Teh already reviewed.
Comment 44 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-25 21:58:23 PST
(In reply to Steve Fink [:sfink] from comment #42)
> This is 100% reproducible for me while on my home network, and I've left it
> for at least 23 minutes without it resolving itself. But this is with around
> 300 tabs, maybe a dozen pointing to google properties.
> 
> I'm a little confused by which patch is which -- would you like me to try
> one?

Please try the patch that I just attached here.

It isn't surprising that we would find this problem on Google properties, because the Google servers are probably much faster than the OCSP responder.
Comment 45 Steve Fink [:sfink] [:s:] 2012-01-25 22:32:05 PST
Building now, but I thought I should mention that although it appears to be 100% reproducible (for now), it doesn't happen immediately after starting up. It takes about 3 or 4 minutes before it kicks in. I detect it by running

  watch -d 'ps -L <firefox-pid>'

and watching the 'TIME' column. It'll constantly increase for the first entry (the main thread), but none of the other threads will accumulate very much time until the problem kicks in. At that point, the 5th listed thread (which will always be the firefox pid plus 4, which probably only holds true on Linux) will start accumulating time. It makes it easy to spot. (And to verify, I'll run strace -p <problematic-tid> to see lots of recvfrom() calls.)
Comment 46 Steve Fink [:sfink] [:s:] 2012-01-26 09:23:52 PST
I ran with the patch this morning. I'm still getting the problem. It's very slightly different -- previously, it was always mainpid+4 that was the thread that spun on recvfrom(); now it is mainpid+7. Which could be a complete red herring.

It's going against the same server as yesterday. Just that thread has accumulated 55 minutes of runtime now, so I don't think it's going to resolve itself anytime soon.

Anything you'd like me to try? Unfortunately for this, I'm about to drive into work, and I doubt I'll be able to replicate it on the work network. I could try artificially slowing it down -- do you have any ideas about what network characteristics would be relevant? I could use netem to add latency, drop packets, or reorder packets, if any of those seem relevant. (Not that I'm not noticing any dropped or reordered packets from my home network.) It might involve buffer sizes or something rather than simple latency, and my netem-fu is weak.

Wow! That's a total lie! I was basing that on the 1e100.net server that I happened to get when resolving www.google.com. If I switch to the actual one I'm going against, which is pz-in-f84.1e100.net, I'm seeing *horrible* network performance -- 64% packet loss, 34ms ping times. (The first google server I tried had 0% packet loss, 13ms pings.) So I hypothesize that packet loss is the relevant trigger, and perhaps this problem shows up when one of google's servers gets messed up. Let me check what it looks like from the MV network... 0% packet loss, 24ms pings. And yes, it's still bad from my local network: 70% loss, 33-50ms rtt.

Any idea why I keep getting that server when I restore my session? That hostname does not appear in my sessionstore.js. Nor does the IP (74.125.127.84). But every time I restart the browser, it ends up going to that exact server. I hope you don't need to be coming from my public IP to reproduce this...
Comment 47 Kai Engert (:kaie) 2012-01-26 09:26:11 PST
Comment on attachment 591698 [details] [diff] [review]
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]

I understand this is for early testing of bug 542832, which has not yet a final review of this patch.

r=kaie for the early testing in mozilla-central, only
Comment 48 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-26 14:46:31 PST
(In reply to Steve Fink [:sfink] from comment #46)
> I ran with the patch this morning. I'm still getting the problem. It's very
> slightly different -- previously, it was always mainpid+4 that was the
> thread that spun on recvfrom(); now it is mainpid+7. Which could be a
> complete red herring.
> 
> It's going against the same server as yesterday. Just that thread has
> accumulated 55 minutes of runtime now, so I don't think it's going to
> resolve itself anytime soon.

Sorry, you need to apply the WIP patch above AND the ssl_Poll patch above. I am going to clean the WIP patch up and check both of them into mozilla-central today.
Comment 49 Dietrich Ayala (:dietrich) 2012-01-31 01:21:01 PST
A bunch of us are seeing this bug (or at least similar characteristics) at the performance work week - possibly the hotel's dodgy connections are highlighting the problem?
Comment 51 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-31 04:30:02 PST
Created attachment 593043 [details] [diff] [review]
Part 1: Do not do TLS intolerance timeout during cert validation [v3]

[Approval Request Comment]
Regression caused by (bug #): bug 674147
User impact if declined: User will experience infrequent hangs of the browser when connecting to SSL sites
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): See Honza's comment above. This seems like the lowest-risk change that we can make to solve the problem.
String changes made by this patch: none
Comment 52 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-01-31 04:34:01 PST
Created attachment 593044 [details] [diff] [review]
Part 2: Import ssl_Poll fix from bug 542832 [v3]

Kai, you said that this should only land on mozilla-central but mozilla-aurora and mozilla-beta also REALLY need this fix. We should land this on Aurora and Beta ASAP.

[Approval Request Comment]
Regression caused by (bug #): bug 674147
User impact if declined: Users will sometimes experience long hangs when connecting to SSL sites.
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): There is a moderate amount of risk to this patch, but the alternative is to back out bug 674147 and everything that depends on it, including SPDY, which would be even more risky as some of the SPDY changes also affect non-SPDY HTTP code.
String changes made by this patch: none
Comment 54 Olli Pettay [:smaug] 2012-02-01 12:00:24 PST
Comment on attachment 590552 [details]
screenshot

Got this again today, Gecko/20120201 Firefox/13.0a1
Comment 55 Alex Keybl [:akeybl] 2012-02-02 06:48:18 PST
Comment on attachment 593043 [details] [diff] [review]
Part 1: Do not do TLS intolerance timeout during cert validation [v3]

[Triage Comment]
This has now had a couple of days to bake on m-c. Should prevent browser hangs with SSL. Given where we are in the cycle, approving for both Aurora 12 and Beta 11.
Comment 56 Olli Pettay [:smaug] 2012-02-04 06:51:34 PST
This is really horrible with Fosdem network. Has already happened several times this morning.
Comment 57 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-04 11:20:49 PST
Please try running Firefox with logging enabled and send me the log after it happens again:

    NSPR_LOG_MODULES=pipnss:5,nsSocketTransport:5,timestamp NSPR_LOG_FILE=log.txt

Also, do you have SPDY enabled? If so, after reproducing it once, please disable the SPDY support (network.http.spdy.enabled=false).

I think this may be a difference in behavior between Linux and Windows, because I cannot reproduce this on Windows 7 x64 at all after my fix landed. I will try on Linux now.
Comment 58 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-04 20:31:06 PST
Created attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Olli, do you think you have times where you are waiting 25 seconds or more to connect to a website in these situations? If so, then it might be caused by yet another problem with the handshake timeout that I think I just found. Please try this patch and report back.

If this patch does NOT solve your problem, then try the following:
1. Disable TLS 1.0 (security.enable_tls=true) and enable SSL 3.0 (security.enable_ssl3=true).
2. Disable OCSP (security.OCSP.enabled=0).

And report back the results. (But, try the above patch first).

ON timeout, the PSM poll method returns "go ahead, try to send or receive" so that the send/receive methods can return the timeout error. Except, the send/receive methods never actually check for timeout like the old SSL-thread-based ones did. Oops!

Still, I am having trouble reproducing the busy loop even without this patch, so I am not sure it is the complete solution.
Comment 59 Olli Pettay [:smaug] 2012-02-05 03:27:00 PST
(In reply to Brian Smith (:bsmith) from comment #58)
> Created attachment 594508 [details] [diff] [review]
> Actually abort the connection on send/receive if handshake timed out
> 
> Olli, do you think you have times where you are waiting 25 seconds or more
> to connect to a website in these situations?
Very possible. I've seen to problem usually when I'm using a bad 3G connection or
here @Fosdem where network connection is occasionally slow.


I'll test the patch ASAP.
Comment 60 Olli Pettay [:smaug] 2012-02-05 06:46:18 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

I've been running a build with this patch in a bad network, and
so far no problems. Yesterday I got the hang several times using a 
build without the patch.
Comment 61 Steve Fink [:sfink] [:s:] 2012-02-05 21:14:44 PST
I'm also still seeing this with the 2012-02-03 nightly, from my home DSL network that has some questionable latency to certain sites. I'm recompiling with the patch now, though I don't know if I'll have time to try it out before I go into the work network tomorrow.

Same as smaug, the symptom is a complete hang of any networking, but the UI is responsive. I left it going for about 2 full days and it never made progress on the network connections; it was still spinning on recvfrom() when I killed it.

I'm not sure if this is related, but I'm also getting a permanent hang on shutdown.

And unfortunately, the last time I killed it, it mangled my sessionstore so I can no longer restore that session. Which may make it harder to reproduce, among other things.
Comment 62 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-05 23:33:29 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Kai, the most important part of this patch is the added line:

    socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage);

That fixes *this* bug, but in a way that still causes TLS intolerance due to handshake timeout to be handled incorrectly. The rest of the patch is about getting TLS intolerance to work correctly in the timeout scenerio.
Comment 63 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-05 23:35:57 PST
Here is the tryserver run, still in progress:
https://tbpl.mozilla.org/?tree=Try&rev=ca57a65c69e6
Comment 64 Steve Fink [:sfink] [:s:] 2012-02-06 11:06:43 PST
My browser seems to be much happier with that patch (I managed to restore my session with mv _closedWindows/* windows/). There's another thread spinning on something else, but that appears to be intentional and not network related.
Comment 65 Olli Pettay [:smaug] 2012-02-06 11:12:09 PST
I haven't got any networking hangs after I started to use the latest patch.
Comment 66 Honza Bambas (:mayhemer) 2012-02-06 11:32:14 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Review of attachment 594508 [details] [diff] [review]:
-----------------------------------------------------------------

Good, but I would be interested in what is the code path that causes the loop w/o this patch, for reference.  Brian, can you please comment on the bug with it?  Thanks.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +2121,4 @@
>      NS_ASSERTION(in_flags & PR_POLL_EXCEPT,
>                   "caller did not poll for EXCEPT (handshake timeout)");
>      *out_flags = in_flags | PR_POLL_EXCEPT;
> +    socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage);

In genenral, RESET_ERROR can only be used when we haven't sent any application data.  If this happens during rehandshake we may duplicate send of a POST, for instance.  PR_CONNECT_ABORTED_ERROR should then be used.  Doesn't necessarily need to be handled in this bug.
Comment 67 Olli Pettay [:smaug] 2012-02-11 03:58:06 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

[Approval Request Comment]
Regression caused by (bug #): ?
User impact if declined: FF networking hangs
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky):
String changes made by this patch:
Comment 68 Kai Engert (:kaie) 2012-02-11 04:21:52 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Thanks everyone for your work and testing on this bug.

Regarding Honza's warning related to RESET_ERROR,
is it correct that we only ever allow a handshake-timeout at the very beginning of a new socket?
In other words, we never run through the handshake-timeout code whenever a socket might perform a SSL renegotiation (I believe that would be handled transparently inside libSSL at the NSS level). Under this assumption, if we're still at the initial handshake, then we haven't yet transfered any application data, and setting RESET_ERROR should be fine.

Your calling of checkHandshake seems reasonable from what I remember.

Given that you have thorougly tested this patch and have gotten positive feedback for a very bad regression, I'm fine with this patch.

r=kaie


This is also a reminder that we urgently need to get bug 542832 sorted and get the newer NSS landed into mozilla-beta.
Comment 69 Olli Pettay [:smaug] 2012-02-13 09:08:41 PST
Any chance to get this landed? I just accidentally run a build without the fix and got the
hang immediately when I switched to a bad network.
Comment 70 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-13 13:01:54 PST
(In reply to Honza Bambas (:mayhemer) from comment #66)
> Good, but I would be interested in what is the code path that causes the
> loop w/o this patch, for reference.  Brian, can you please comment on the
> bug with it?  Thanks.

nsSSLIOLayerPoll hits the HandshakeTimeout condition, and tells the application the socket is readable/writable. The application tries to PR_Write, but the PR_Write ends up doing a read on the network socket, because we are waiting for one of the server's hello messages. But, we don't have that message yet, so that read returns "would block" and so PR_Write returns "would block." Then, the application calls poll again, which again hits the HandshakeTimeout condition, and it all repeats.

> In genenral, RESET_ERROR can only be used when we haven't sent any
> application data.

To expand on what Kai said, we should never be timing out after we've sent application data; there is no positive gain in doing TLS intolerance timeouts during a renegotiation.
Comment 71 Marco Bonardo [::mak] (Away 6-20 Aug) 2012-02-14 02:39:57 PST
https://hg.mozilla.org/mozilla-central/rev/fbf93932695e

can this be closed?
Comment 72 Alex Keybl [:akeybl] 2012-02-14 11:21:33 PST
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

[Triage Comment]
Prevents hangs caused by recent changes in SSL handling. Approved for Aurora 12 and Beta 11.
Comment 73 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-17 14:33:09 PST
https://hg.mozilla.org/releases/mozilla-aurora/rev/25a023e0faf7

https://hg.mozilla.org/releases/mozilla-beta/rev/871a1595b9c2
https://hg.mozilla.org/releases/mozilla-beta/rev/592ef428da70
https://hg.mozilla.org/releases/mozilla-beta/rev/8dfa3f56e4b4

(Note: mozilla-aurora only needed Part 3, because Part 1 and Part 2 landed on mozilla-central before Aurora branched.)
Comment 74 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-02-17 14:34:13 PST
(In reply to Marco Bonardo [:mak] from comment #71)
> https://hg.mozilla.org/mozilla-central/rev/fbf93932695e
> 
> can this be closed?

Yes.
Comment 75 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-02-22 11:46:32 PST
What is the STR for this bug so QA can verify it?
Comment 76 Honza Bambas (:mayhemer) 2012-02-22 12:27:40 PST
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #75)
> What is the STR for this bug so QA can verify it?

I think it has been verified in comment 65.
Comment 77 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-02-22 12:30:03 PST
Hmm...I'm not sure that makes this confirmed verified on Firefox 11 and 12. I'm a little unclear what, if anything QA can do to verify this is fixed across all patched versions.
Comment 78 Phil Ringnalda (:philor, back in August) 2012-02-22 12:44:12 PST
It probably hasn't been verified by anyone on m-a or m-b, though.

STR:

1. Grab a MiFi (or tethering to a phone would probably work).
2. Head out into the country, you want to get down to just a couple of bars of reception, one would be better. Take your pre-patch and post-patch builds along, you don't want to be downloading them over that connection.
3. Load tbpl.mozilla.org. Wouldn't hurt to open several tabs of tbpl to different busy trees, say mozilla-inbound and try and profiling and maple.
4. Watch the CPU, or just listen for the fan, since you're out in the country and it's quiet so you'll hear it spin up. Once it does, verify that no networking is working, that you can open a tab and try to load some unrelated site, but it won't actually load until you kill the tbpl tabs. Then comes the fun part, doing it again with the post-patch builds, and just sitting around watching the clouds go by and listening to the birds, while waiting to not hear the fan spin up. You're welcome :)
Comment 79 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-02-22 22:30:25 PST
Given those steps, QA will be hard-pressed to verify this fix. I think it would be far more reasonable if Olly could verify this on Firefox 11 and 12 as was done for Firefox 13, presumably, in comment 65.

Note You need to log in before you can comment on or make changes to this bug.