Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish

RESOLVED FIXED in Firefox 11

Status

()

Core
Security: PSM
P1
blocker
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: smaug, Assigned: briansmith)

Tracking

({hang, perf, regression})

11 Branch
mozilla12
hang, perf, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox11+ fixed, firefox12+ fixed)

Details

(Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn])

Attachments

(7 attachments, 3 obsolete attachments)

153.55 KB, image/png
Details
59.32 KB, text/plain
Details
120.92 KB, image/png
Details
1.08 KB, patch
Details | Diff | Splinter Review
5.52 KB, patch
briansmith
: review+
Details | Diff | Splinter Review
4.17 KB, patch
briansmith
: review+
Details | Diff | Splinter Review
8.52 KB, patch
kaie
: review+
smaug
: feedback+
mayhemer
: feedback+
Details | Diff | Splinter Review
(Reporter)

Description

6 years ago
I think there is some recent regression, since I haven't seen the behavior in FF9 builds:
When I start Firefox with lots of tabs (including few tbpl, and ~100 bugzilla pages, all https) Firefox uses over 100% cpu time and loads pages slowly.

When I got the problem last time few days ago, Zoom profiler showed most of the time
was spent doing something SSL related. This time my Zoom evaluation license was unfortunately expired.

Has there been any changes to how we handle SSL?
(In reply to Olli Pettay [:smaug] from comment #0)

> 
> Has there been any changes to how we handle SSL?

smaug, big changes in nightly, yes.

Can you supply about:buildconfig changeset id's for either of the instances? (probably just the recent one I guess).

Comment 2

6 years ago
Perhaps this is caused by processing large numbers of SSL connects at the same time? Previously, as I understand it, we could only do one at a time.
Could that be bug 705755?
Keywords: perf
(Reporter)

Comment 4

6 years ago
I think I got this twice yesterday
(Reporter)

Updated

6 years ago
Severity: major → blocker
tracking-firefox11: --- → ?
tracking-firefox12: --- → ?

Updated

6 years ago
Assignee: nobody → honzab.moz
(In reply to Olli Pettay [:smaug] from comment #4)
> I think I got this twice yesterday

Can you try patch in bug 705755?
Also as Josh suggested, this could be related to bug 695789 (no patch so far).
Olli, at least some more detailed data would be appreciated like sites your were browsing, what exact version did you have, retry with a clean profile etc.

Thanks.
(Reporter)

Comment 8

6 years ago
Unfortunately I don't have much information to share. I use Nightlies and occasionally
one thread, which is not the main thread, starts to take 100% cpu and page loading stops.
IIRC page loading stops to "Waiting for ...". Also, IRC messages (via Chatzilla) don't get through.
The only way to solve the problem is to restart browser.

When I was using a build of my own (which had symbols) profiler showed that most of the time
was spent in some SSL method. Unfortunately I don't recall which method.
I'll use builds with symbols again and try to get more information.

The sites I have open are tbpl and bugzilla.mozilla.org (lots of tabs, all using https so when FF 
starts, and I switch through the tab there can be hundreds of active https requests from Gecko to Necko, I presume) and I use Chatzilla connected to freenode, irc.w3.org, and SSL version of moznet.
Please, please, attach a minidump or a stacktrace or whatever you can, the next time you run into this. I have not been able to reproduce it. Thanks in advance.
Olli, also if you are e.g. on windows, you can use Process Explorer (just google for it, downloadable from tech net), which is a much better version of the Task Manager.

When you then click a process, you have a tab "Thread", you can see which thread eats cpu and dblclining a thread will give you a stack trace.  Works quit nice.

When you had one page in state Waiting for, did you try to open a new URL (what ever ssl or non-ssl) and did that new page had loaded?
The uncommon use case, lack of reproducibility, and lack of duplicate reports prevents this from needing to be tracked. If any of these variables change, please re-nominate for tracking.
tracking-firefox11: ? → -
tracking-firefox12: ? → -
(Reporter)

Comment 12

6 years ago
Created attachment 586656 [details]
screenshot

A it happened again. I had lots of tabs open, some
rather active ones, like tbpl and gmail and facebook, and
I resumed laptop and connected to internet using slow and bad-latency
network. FF UI stayed responsive, but networking basically hanged, and closing FF
didn't work. I had to kill it.
(Reporter)

Comment 13

6 years ago
this is a bad regression. Asking tracking FF11/12 again.
tracking-firefox11: - → ?
(Reporter)

Comment 14

6 years ago
Now it happened with a very fast and low latency connection too.
(Reporter)

Comment 15

6 years ago
Comment on attachment 586656 [details]
screenshot

I do have the real log of this too, but apparently sysprof creates only non-human readable logs.
Duplicate of this bug: 716071
I think the problem is that the SSL poll implementation is returning "ready to read and/or write" but its read/write methods return "would block" in the case that the it has finished the handshake but the server's certificate has not been validated yet. This seems to cause the socket transport service to spin, trying over and over again to send data on the connection without actually being able to send any data.

I believe the fix is to change the SSL socket's poll method to return "not ready for reading or writing" when it is stuck waiting for SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be called.
Assignee: honzab.moz → nobody
Blocks: 674147, 542832
Severity: blocker → major
Component: Networking → Security: PSM
QA Contact: networking → psm
Hardware: x86_64 → All
Summary: Something in networking and/or SSL layer takes lots of processing power → Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish
Version: unspecified → 11 Branch
Assignee: nobody → bsmith
Created attachment 586743 [details]
Profile from Andreas Gal from bug 716071
(In reply to Brian Smith (:bsmith) from comment #17)
> I believe the fix is to change the SSL socket's poll method to return "not
> ready for reading or writing" when it is stuck waiting for
> SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be
> called.


Something I was a bit afraid of..  And didn't catch apparently...

When the socket is in state of cert verification process, i.e. is blocking, poll implementation of the ssl socket has to return 0 (nsSSLIOLayerPoll).  That will bypass poll of the TCP (NSPR) socket layer.

However, I was comparing poll implementation with the state before SSL thread had been introduced.  Wasn't there a similar bug before SSL thread had been introduced?
(In reply to Honza Bambas (:mayhemer) from comment #19)
> However, I was comparing poll implementation with the state before SSL
> thread had been introduced.  Wasn't there a similar bug before SSL thread
> had been introduced?

There couldn't be!  We were not using async cert verification.  So, this is actually revealing a bug in libssl.

I think the real fix should be made in NSS.
Please see comment 17.  I believe the fix should be made in libssl since it is the layer that accepts WOULD_BLOCK from authCertificate callback and waits for resume with SSL_AuthCertificateComplete().
Sounds like others are now seeing this on Aurora. Tracking for 10/11.
tracking-firefox11: ? → +
tracking-firefox12: - → +

Updated

6 years ago
Keywords: regression
Whiteboard: [psm-roadblock]
(Reporter)

Comment 23

6 years ago
Today I got even the whole UI hang. Something in main thread ended up waiting for
SSL threads, which had a lock and was waiting for something else. CPU usage was 0.
Unfortunately I didn't save the stack.
(Reporter)

Comment 24

6 years ago
Any WIP patches to test? I'd be happy try anything which could reduce the likelihood for the
networking hangs.
(Reporter)

Comment 25

6 years ago
Adding 'hang', since this is really a hang in networking. Browser UI is working, but you can't do
anything with it.
Severity: major → blocker
Keywords: hang
Olli, I wonder why you run into this so frequently whereas I never do.

Please try this patch:
https://bug542832.bugzilla.mozilla.org/attachment.cgi?id=589264

This patch should fix this issue and is awaiting review.
(Reporter)

Comment 27

6 years ago
Maybe my connection has something to do with it. I use often a 3G connection in a place where
connection is occasionally very unreliable (huge ping times for some time).

Also, I do have 100+ tabs open and most of them are using https. Not all
the tabs are really active, but there are pages like gmail, tbpl (multiple), facebook, twitter, zimbra,
...
Yes, that is exactly the situation where this fix would would help. Please apply the patch and let me know if you see any similar problem with the patch. (stack trace would be very helpful too.)

Comment 29

6 years ago
(In reply to Honza Bambas (:mayhemer) from comment #10)
> Olli, also if you are e.g. on windows, you can use Process Explorer (just
> google for it, downloadable from tech net), which is a much better version
> of the Task Manager.
> 
> When you then click a process, you have a tab "Thread", you can see which
> thread eats cpu and dblclining a thread will give you a stack trace.  Works
> quit nice.
> 
> When you had one page in state Waiting for, did you try to open a new URL
> (what ever ssl or non-ssl) and did that new page had loaded?

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120119 Firefox/12.0a1

I may have run into this problem a couple of times since early December. At first I guessed it was related to bug 692260 but I never had that problem before and the workaround in bug 692260 did not work for me.

Anyway, a few times Nightly cannot load webpages suddenly for a minute or so but resumes normal afterwards without restarting. During that period if I open other sites from bookmark (SSL or non-SSL) the new pages do not load as well. However I can load them in Chrome in the mean time. I do not have many tabs opened (<10) but many of them are SSL sites. This usually happens after I restart Nightly/Firefox_Portable rapidly a few times, all with different profiles (for testing), and return to my main profile.

I have recorded 3 instances from Process Explorer during the network hang, but I recorded the one which hogs Nightly only:
1.
TID 3060
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
nspr4.dll!PR_Poll+0x10a
nspr4.dll!PR_Poll+0x13
xul.dll!??1gfxFontGroup@@UAE@XZ+0xe36
xul.dll!??1gfxFontGroup@@UAE@XZ+0xc7c
xul.dll!?SurfaceDestroyFunc@gfxASurface@@CAXPAX@Z+0x24d7
===================================

2.
TID 596
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731
xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x16f5
xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf
================================

3.
TID 2416
Start Address: MSVCR80.dll!_endthreadex+0x61

ntoskrnl.exe!memset+0x64a
ntoskrnl.exe!KeWaitForMultipleObjects+0xd52
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!__misaligned_access+0xba4
ntoskrnl.exe!__misaligned_access+0x1821
ntoskrnl.exe!__misaligned_access+0x1a97
nspr4.dll!PR_dtoa+0x9d6
nspr4.dll!PR_Unlock+0x32
nspr4.dll!PR_Poll+0x3c9
nspr4.dll!PR_Poll+0x13
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x1876
xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x169e
xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf
========================================

If my case is related to this bug, what should I, as a common user, record or pay attention to when I run into this next time?
Whiteboard: [psm-roadblock] → [psm-roadblock][patch in attachment 589264 in bug 542832]
Fanolian. That you for the detailed info. It seems to confirm what I suspected as the cause of this bug. I think I have a patch that fixes this in bug 542832. I am just waiting for reviews.
(Reporter)

Comment 31

6 years ago
Created attachment 590552 [details]
screenshot

Got the problem again with a build which should have the ssl fix.

I applied the patch and compile using make -f client.mk build
I assume nothing unsual is needed in this case.
Attachment #590552 - Attachment is patch: false
Attachment #590552 - Attachment mime type: text/plain → image/png
Created attachment 590661 [details] [diff] [review]
WIP: Part 1: Only do version intolerance timeout before certificate validation starts

I can reproduce this reliably using the next patch.

We modified libssl's poll method so that it won't return PR_POLL_READ or PR_POLL_WRITE when we are blocked only on cert validation. But, I failed to account for the special cases in PSM's poll method wrapper which can return PR_POLL_READ and/or PR_POLL_WRITE even when libssl's implementation wouldn't, because of the TLS version intolerance processing.

Fortunately, it seems pretty straightforward to improve the version intolerance processing so that it only happens *before* certificate validation; if we get to the point where we are validating the server's certificate, we KNOW the server wasn't version intolerant. Why?: we receive the certificates only after we've successfully parsed and accepted valid server hello; we will never get a valid server hello that libssl would accept for the connection in the version intolerance case, AFAICT.

Not only should this change fix this bug, but it should also cause fewer false positives regarding version intolerance; that is, with a patch like this, we would end up wrongly and unnecessarily dropping from TLS 1.0 to SSL 3.0 in fewer circumstances than before. So, a double-win.
Attachment #590661 - Flags: feedback?(honzab.moz)
Created attachment 590662 [details] [diff] [review]
[NOT FOR CHECKIN] force cert validation to take a very long time so we can reliably reproduce this

Honza, with this patch you can connect to any SSL server and reproduce the spinning. Sometimes I had to do a reload and/or open the same site in a second tab to do so.
Comment on attachment 590661 [details] [diff] [review]
WIP: Part 1: Only do version intolerance timeout before certificate validation starts

This does not account for every type of TLS intolerance. We need to be *very* careful about changing TLS intolerance checking. Will find a new solution.
Attachment #590661 - Flags: feedback?(honzab.moz) → feedback-
Created attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Don't review this too hard. I haven't even tested it yet. Really just looking for feedback on the approach, though I will take an r+ if I somehow got it right on the first (re)try.
Attachment #590661 - Attachment is obsolete: true
Attachment #591018 - Flags: review?(honzab.moz)
By the way, I know I could have avoided adding two more member variables, but I will need them later for performance measurements (telemetry and maybe for implementing the navigation API).
Comment on attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Review of attachment 591018 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +1059,1 @@
>            > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS));

I understand the idea - to subtract the time of the verification.  Sounds good, but the code is wrong for case the interval wraps. 

Having a condition "now > (mCertVerificationEnded - mCertVerificationStarted)" must make you feel something is not right ;)

You are ok with:

((now - mHandshakeStartTime) - (mCertVerificationEnded - mCertVerificationStarted))
> PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS));


In general, this looks like a minimalistic hit into the current code.  I'll go through this more deeply and give you more comments soon.
Comment on attachment 591018 [details] [diff] [review]
WIP: Only do version intolerance timeout before or after certificate validation.

Review of attachment 591018 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

My description of the cause:
- handshake timed according PSM out (socketInfo->HandshakeTimeout() == true => poll for write | except to push above layer forward
- socket writable
- ssl_Send:ssl3_GatherCompleteHandshake: ss->ssl3.hs.restartTarget != NULL => WOULD_BLOCK since nobody called restart so far (i.e. verification is still pending)

Based on this, I can say the patch is correct with minimal impact.

I also think of resetting the time out start value when we finish cert verification, but that might have a performance negative impact - by prolonging the timeout - or even potential security sensitive impact, not that I would know of right now.  That's for a security review we don't have time for now.
Attachment #591018 - Flags: review?(honzab.moz) → review+
Duplicate of this bug: 721130
Good, I was guessing this bug would already be known. Not that it helps now, but here's a repeating snippet from NSPR_LOG_MODULES='nsSocketTransport:5':

706737920[7f743675b6a0]: poll timeout: 65535
706737920[7f743675b6a0]:     timeout = 65535000 milliseconds
706737920[7f743675b6a0]:     ...returned after 0 milliseconds
706737920[7f743675b6a0]: nsSocketTransport::OnSocketReady [this=77bb6a0 outFlags=6]
706737920[7f743675b6a0]: nsSocketOutputStream::OnSocketReady [this=77bb7f8 cond=0]
706737920[7f743675b6a0]: nsSocketOutputStream::Write [this=77bb7f8 count=396]
706737920[7f743675b6a0]:   calling PR_Write [count=396]
706737920[7f743675b6a0]:   PR_Write returned [n=-1]
706737920[7f743675b6a0]: nsSocketOutputStream::AsyncWait [this=77bb7f8]
706737920[7f743675b6a0]: STS poll iter [1]

...which seemed odd to me, since I knew it was ending up doing a recvfrom(). This bug explains it.

Now I have to go dig up a build from before this was introduced. Any hints? Or is there a way to turn off the async cert validation/SSL thread?
sfink, for some reason this bug occurs very rarely for me even using the latest Nightly. But, recently I did have it happen to me. Note that when it happens, eventually the hang will "go away" after ~1 minute and everything will work properly again.

The patch that introduced this bug is in bug 674147. So, you would need to find a build from before then.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
This is 100% reproducible for me while on my home network, and I've left it for at least 23 minutes without it resolving itself. But this is with around 300 tabs, maybe a dozen pointing to google properties.

I'm a little confused by which patch is which -- would you like me to try one?

Thanks for the pointer to the bug! I'm trying to track down another major perf problem and this one was masking it. And the other one has been there since at least FF9, so I can easily back up to before bug 674147. But I don't think it's networking-related.
Created attachment 591698 [details] [diff] [review]
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]

Here is the patch that adds the fix for the bug to mozilla-central until the fix can land in the next NSS release. It is a slightly modified version of the patch that Wan-Teh already reviewed.
Attachment #591698 - Flags: review?(kaie)
(In reply to Steve Fink [:sfink] from comment #42)
> This is 100% reproducible for me while on my home network, and I've left it
> for at least 23 minutes without it resolving itself. But this is with around
> 300 tabs, maybe a dozen pointing to google properties.
> 
> I'm a little confused by which patch is which -- would you like me to try
> one?

Please try the patch that I just attached here.

It isn't surprising that we would find this problem on Google properties, because the Google servers are probably much faster than the OCSP responder.
Building now, but I thought I should mention that although it appears to be 100% reproducible (for now), it doesn't happen immediately after starting up. It takes about 3 or 4 minutes before it kicks in. I detect it by running

  watch -d 'ps -L <firefox-pid>'

and watching the 'TIME' column. It'll constantly increase for the first entry (the main thread), but none of the other threads will accumulate very much time until the problem kicks in. At that point, the 5th listed thread (which will always be the firefox pid plus 4, which probably only holds true on Linux) will start accumulating time. It makes it easy to spot. (And to verify, I'll run strace -p <problematic-tid> to see lots of recvfrom() calls.)
I ran with the patch this morning. I'm still getting the problem. It's very slightly different -- previously, it was always mainpid+4 that was the thread that spun on recvfrom(); now it is mainpid+7. Which could be a complete red herring.

It's going against the same server as yesterday. Just that thread has accumulated 55 minutes of runtime now, so I don't think it's going to resolve itself anytime soon.

Anything you'd like me to try? Unfortunately for this, I'm about to drive into work, and I doubt I'll be able to replicate it on the work network. I could try artificially slowing it down -- do you have any ideas about what network characteristics would be relevant? I could use netem to add latency, drop packets, or reorder packets, if any of those seem relevant. (Not that I'm not noticing any dropped or reordered packets from my home network.) It might involve buffer sizes or something rather than simple latency, and my netem-fu is weak.

Wow! That's a total lie! I was basing that on the 1e100.net server that I happened to get when resolving www.google.com. If I switch to the actual one I'm going against, which is pz-in-f84.1e100.net, I'm seeing *horrible* network performance -- 64% packet loss, 34ms ping times. (The first google server I tried had 0% packet loss, 13ms pings.) So I hypothesize that packet loss is the relevant trigger, and perhaps this problem shows up when one of google's servers gets messed up. Let me check what it looks like from the MV network... 0% packet loss, 24ms pings. And yes, it's still bad from my local network: 70% loss, 33-50ms rtt.

Any idea why I keep getting that server when I restore my session? That hostname does not appear in my sessionstore.js. Nor does the IP (74.125.127.84). But every time I restart the browser, it ends up going to that exact server. I hope you don't need to be coming from my public IP to reproduce this...

Comment 47

6 years ago
Comment on attachment 591698 [details] [diff] [review]
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]

I understand this is for early testing of bug 542832, which has not yet a final review of this patch.

r=kaie for the early testing in mozilla-central, only
Attachment #591698 - Flags: review?(kaie) → review+

Updated

6 years ago
Attachment #591698 - Attachment description: Import ssl_Poll fix from bug 542832 → Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]
(In reply to Steve Fink [:sfink] from comment #46)
> I ran with the patch this morning. I'm still getting the problem. It's very
> slightly different -- previously, it was always mainpid+4 that was the
> thread that spun on recvfrom(); now it is mainpid+7. Which could be a
> complete red herring.
> 
> It's going against the same server as yesterday. Just that thread has
> accumulated 55 minutes of runtime now, so I don't think it's going to
> resolve itself anytime soon.

Sorry, you need to apply the WIP patch above AND the ssl_Poll patch above. I am going to clean the WIP patch up and check both of them into mozilla-central today.
A bunch of us are seeing this bug (or at least similar characteristics) at the performance work week - possibly the hotel's dodgy connections are highlighting the problem?
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc9db5c75379
https://hg.mozilla.org/integration/mozilla-inbound/rev/b26ac73e4554
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
Created attachment 593043 [details] [diff] [review]
Part 1: Do not do TLS intolerance timeout during cert validation [v3]

[Approval Request Comment]
Regression caused by (bug #): bug 674147
User impact if declined: User will experience infrequent hangs of the browser when connecting to SSL sites
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): See Honza's comment above. This seems like the lowest-risk change that we can make to solve the problem.
String changes made by this patch: none
Attachment #591018 - Attachment is obsolete: true
Attachment #593043 - Flags: review+
Attachment #593043 - Flags: approval-mozilla-beta?
Attachment #593043 - Flags: approval-mozilla-aurora?
Created attachment 593044 [details] [diff] [review]
Part 2: Import ssl_Poll fix from bug 542832 [v3]

Kai, you said that this should only land on mozilla-central but mozilla-aurora and mozilla-beta also REALLY need this fix. We should land this on Aurora and Beta ASAP.

[Approval Request Comment]
Regression caused by (bug #): bug 674147
User impact if declined: Users will sometimes experience long hangs when connecting to SSL sites.
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky): There is a moderate amount of risk to this patch, but the alternative is to back out bug 674147 and everything that depends on it, including SPDY, which would be even more risky as some of the SPDY changes also affect non-SPDY HTTP code.
String changes made by this patch: none
Attachment #591698 - Attachment is obsolete: true
Attachment #593044 - Flags: review+
Attachment #593044 - Flags: approval-mozilla-beta?
Attachment #593044 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/fc9db5c75379
https://hg.mozilla.org/mozilla-central/rev/b26ac73e4554
Status: NEW → RESOLVED
Last Resolved: 6 years ago
status-firefox12: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
(Reporter)

Comment 54

6 years ago
Comment on attachment 590552 [details]
screenshot

Got this again today, Gecko/20120201 Firefox/13.0a1
(Reporter)

Updated

6 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P1
Whiteboard: [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
Comment on attachment 593043 [details] [diff] [review]
Part 1: Do not do TLS intolerance timeout during cert validation [v3]

[Triage Comment]
This has now had a couple of days to bake on m-c. Should prevent browser hangs with SSL. Given where we are in the cycle, approving for both Aurora 12 and Beta 11.
Attachment #593043 - Flags: approval-mozilla-beta?
Attachment #593043 - Flags: approval-mozilla-beta+
Attachment #593043 - Flags: approval-mozilla-aurora?
Attachment #593043 - Flags: approval-mozilla-aurora+

Updated

6 years ago
Attachment #593044 - Flags: approval-mozilla-beta?
Attachment #593044 - Flags: approval-mozilla-beta+
Attachment #593044 - Flags: approval-mozilla-aurora?
Attachment #593044 - Flags: approval-mozilla-aurora+
(Reporter)

Comment 56

6 years ago
This is really horrible with Fosdem network. Has already happened several times this morning.
Please try running Firefox with logging enabled and send me the log after it happens again:

    NSPR_LOG_MODULES=pipnss:5,nsSocketTransport:5,timestamp NSPR_LOG_FILE=log.txt

Also, do you have SPDY enabled? If so, after reproducing it once, please disable the SPDY support (network.http.spdy.enabled=false).

I think this may be a difference in behavior between Linux and Windows, because I cannot reproduce this on Windows 7 x64 at all after my fix landed. I will try on Linux now.
Created attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Olli, do you think you have times where you are waiting 25 seconds or more to connect to a website in these situations? If so, then it might be caused by yet another problem with the handshake timeout that I think I just found. Please try this patch and report back.

If this patch does NOT solve your problem, then try the following:
1. Disable TLS 1.0 (security.enable_tls=true) and enable SSL 3.0 (security.enable_ssl3=true).
2. Disable OCSP (security.OCSP.enabled=0).

And report back the results. (But, try the above patch first).

ON timeout, the PSM poll method returns "go ahead, try to send or receive" so that the send/receive methods can return the timeout error. Except, the send/receive methods never actually check for timeout like the old SSL-thread-based ones did. Oops!

Still, I am having trouble reproducing the busy loop even without this patch, so I am not sure it is the complete solution.
Attachment #594508 - Flags: feedback?(honzab.moz)
Attachment #594508 - Flags: feedback?(bugs)
(Reporter)

Comment 59

6 years ago
(In reply to Brian Smith (:bsmith) from comment #58)
> Created attachment 594508 [details] [diff] [review]
> Actually abort the connection on send/receive if handshake timed out
> 
> Olli, do you think you have times where you are waiting 25 seconds or more
> to connect to a website in these situations?
Very possible. I've seen to problem usually when I'm using a bad 3G connection or
here @Fosdem where network connection is occasionally slow.


I'll test the patch ASAP.
(Reporter)

Comment 60

6 years ago
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

I've been running a build with this patch in a bad network, and
so far no problems. Yesterday I got the hang several times using a 
build without the patch.
Attachment #594508 - Flags: feedback?(bugs) → feedback+
I'm also still seeing this with the 2012-02-03 nightly, from my home DSL network that has some questionable latency to certain sites. I'm recompiling with the patch now, though I don't know if I'll have time to try it out before I go into the work network tomorrow.

Same as smaug, the symptom is a complete hang of any networking, but the UI is responsive. I left it going for about 2 full days and it never made progress on the network connections; it was still spinning on recvfrom() when I killed it.

I'm not sure if this is related, but I'm also getting a permanent hang on shutdown.

And unfortunately, the last time I killed it, it mangled my sessionstore so I can no longer restore that session. Which may make it harder to reproduce, among other things.
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Kai, the most important part of this patch is the added line:

    socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage);

That fixes *this* bug, but in a way that still causes TLS intolerance due to handshake timeout to be handled incorrectly. The rest of the patch is about getting TLS intolerance to work correctly in the timeout scenerio.
Attachment #594508 - Flags: review?(kaie)
Here is the tryserver run, still in progress:
https://tbpl.mozilla.org/?tree=Try&rev=ca57a65c69e6
My browser seems to be much happier with that patch (I managed to restore my session with mv _closedWindows/* windows/). There's another thread spinning on something else, but that appears to be intentional and not network related.
(Reporter)

Comment 65

6 years ago
I haven't got any networking hangs after I started to use the latest patch.
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Review of attachment 594508 [details] [diff] [review]:
-----------------------------------------------------------------

Good, but I would be interested in what is the code path that causes the loop w/o this patch, for reference.  Brian, can you please comment on the bug with it?  Thanks.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +2121,4 @@
>      NS_ASSERTION(in_flags & PR_POLL_EXCEPT,
>                   "caller did not poll for EXCEPT (handshake timeout)");
>      *out_flags = in_flags | PR_POLL_EXCEPT;
> +    socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage);

In genenral, RESET_ERROR can only be used when we haven't sent any application data.  If this happens during rehandshake we may duplicate send of a POST, for instance.  PR_CONNECT_ABORTED_ERROR should then be used.  Doesn't necessarily need to be handled in this bug.
Attachment #594508 - Flags: feedback?(honzab.moz) → feedback+
(Reporter)

Comment 67

6 years ago
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

[Approval Request Comment]
Regression caused by (bug #): ?
User impact if declined: FF networking hangs
Testing completed (on m-c, etc.): 
Risk to taking this patch (and alternatives if risky):
String changes made by this patch:
Attachment #594508 - Flags: approval-mozilla-beta?
Attachment #594508 - Flags: approval-mozilla-aurora?

Comment 68

6 years ago
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

Thanks everyone for your work and testing on this bug.

Regarding Honza's warning related to RESET_ERROR,
is it correct that we only ever allow a handshake-timeout at the very beginning of a new socket?
In other words, we never run through the handshake-timeout code whenever a socket might perform a SSL renegotiation (I believe that would be handled transparently inside libSSL at the NSS level). Under this assumption, if we're still at the initial handshake, then we haven't yet transfered any application data, and setting RESET_ERROR should be fine.

Your calling of checkHandshake seems reasonable from what I remember.

Given that you have thorougly tested this patch and have gotten positive feedback for a very bad regression, I'm fine with this patch.

r=kaie


This is also a reminder that we urgently need to get bug 542832 sorted and get the newer NSS landed into mozilla-beta.
Attachment #594508 - Flags: review?(kaie) → review+
(Reporter)

Comment 69

6 years ago
Any chance to get this landed? I just accidentally run a build without the fix and got the
hang immediately when I switched to a bad network.
(In reply to Honza Bambas (:mayhemer) from comment #66)
> Good, but I would be interested in what is the code path that causes the
> loop w/o this patch, for reference.  Brian, can you please comment on the
> bug with it?  Thanks.

nsSSLIOLayerPoll hits the HandshakeTimeout condition, and tells the application the socket is readable/writable. The application tries to PR_Write, but the PR_Write ends up doing a read on the network socket, because we are waiting for one of the server's hello messages. But, we don't have that message yet, so that read returns "would block" and so PR_Write returns "would block." Then, the application calls poll again, which again hits the HandshakeTimeout condition, and it all repeats.

> In genenral, RESET_ERROR can only be used when we haven't sent any
> application data.

To expand on what Kai said, we should never be timing out after we've sent application data; there is no positive gain in doing TLS intolerance timeouts during a renegotiation.
https://hg.mozilla.org/mozilla-central/rev/fbf93932695e

can this be closed?
Comment on attachment 594508 [details] [diff] [review]
Actually abort the connection on send/receive if handshake timed out

[Triage Comment]
Prevents hangs caused by recent changes in SSL handling. Approved for Aurora 12 and Beta 11.
Attachment #594508 - Flags: approval-mozilla-beta?
Attachment #594508 - Flags: approval-mozilla-beta+
Attachment #594508 - Flags: approval-mozilla-aurora?
Attachment #594508 - Flags: approval-mozilla-aurora+

Updated

6 years ago
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1]
https://hg.mozilla.org/releases/mozilla-aurora/rev/25a023e0faf7

https://hg.mozilla.org/releases/mozilla-beta/rev/871a1595b9c2
https://hg.mozilla.org/releases/mozilla-beta/rev/592ef428da70
https://hg.mozilla.org/releases/mozilla-beta/rev/8dfa3f56e4b4

(Note: mozilla-aurora only needed Part 3, because Part 1 and Part 2 landed on mozilla-central before Aurora branched.)
status-firefox11: --- → fixed
(In reply to Marco Bonardo [:mak] from comment #71)
> https://hg.mozilla.org/mozilla-central/rev/fbf93932695e
> 
> can this be closed?

Yes.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
What is the STR for this bug so QA can verify it?
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?]
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #75)
> What is the STR for this bug so QA can verify it?

I think it has been verified in comment 65.
Hmm...I'm not sure that makes this confirmed verified on Firefox 11 and 12. I'm a little unclear what, if anything QA can do to verify this is fixed across all patched versions.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-]
It probably hasn't been verified by anyone on m-a or m-b, though.

STR:

1. Grab a MiFi (or tethering to a phone would probably work).
2. Head out into the country, you want to get down to just a couple of bars of reception, one would be better. Take your pre-patch and post-patch builds along, you don't want to be downloading them over that connection.
3. Load tbpl.mozilla.org. Wouldn't hurt to open several tabs of tbpl to different busy trees, say mozilla-inbound and try and profiling and maple.
4. Watch the CPU, or just listen for the fan, since you're out in the country and it's quiet so you'll hear it spin up. Once it does, verify that no networking is working, that you can open a tab and try to load some unrelated site, but it won't actually load until you kill the tbpl tabs. Then comes the fun part, doing it again with the post-patch builds, and just sitting around watching the clouds go by and listening to the birds, while waiting to not hear the fan spin up. You're welcome :)
Given those steps, QA will be hard-pressed to verify this fix. I think it would be far more reasonable if Olly could verify this on Firefox 11 and 12 as was done for Firefox 13, presumably, in comment 65.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn]
You need to log in before you can comment on or make changes to this bug.