Closed Bug 710176 Opened 13 years ago Closed 13 years ago

Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish

Categories

(Core :: Security: PSM, defect, P1)

11 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla12
Tracking Status
firefox11 + fixed
firefox12 + fixed

People

(Reporter: smaug, Assigned: briansmith)

References

Details

(Keywords: hang, perf, regression, Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn])

Attachments

(7 files, 3 obsolete files)

153.55 KB, image/png
Details
59.32 KB, text/plain
Details
120.92 KB, image/png
Details
1.08 KB, patch
Details | Diff | Splinter Review
5.52 KB, patch
briansmith
: review+
Details | Diff | Splinter Review
4.17 KB, patch
briansmith
: review+
Details | Diff | Splinter Review
8.52 KB, patch
KaiE
: review+
smaug
: feedback+
mayhemer
: feedback+
Details | Diff | Splinter Review
I think there is some recent regression, since I haven't seen the behavior in FF9 builds: When I start Firefox with lots of tabs (including few tbpl, and ~100 bugzilla pages, all https) Firefox uses over 100% cpu time and loads pages slowly. When I got the problem last time few days ago, Zoom profiler showed most of the time was spent doing something SSL related. This time my Zoom evaluation license was unfortunately expired. Has there been any changes to how we handle SSL?
(In reply to Olli Pettay [:smaug] from comment #0) > > Has there been any changes to how we handle SSL? smaug, big changes in nightly, yes. Can you supply about:buildconfig changeset id's for either of the instances? (probably just the recent one I guess).
Perhaps this is caused by processing large numbers of SSL connects at the same time? Previously, as I understand it, we could only do one at a time.
Could that be bug 705755?
Keywords: perf
I think I got this twice yesterday
Severity: major → blocker
Assignee: nobody → honzab.moz
(In reply to Olli Pettay [:smaug] from comment #4) > I think I got this twice yesterday Can you try patch in bug 705755?
Also as Josh suggested, this could be related to bug 695789 (no patch so far).
Olli, at least some more detailed data would be appreciated like sites your were browsing, what exact version did you have, retry with a clean profile etc. Thanks.
Unfortunately I don't have much information to share. I use Nightlies and occasionally one thread, which is not the main thread, starts to take 100% cpu and page loading stops. IIRC page loading stops to "Waiting for ...". Also, IRC messages (via Chatzilla) don't get through. The only way to solve the problem is to restart browser. When I was using a build of my own (which had symbols) profiler showed that most of the time was spent in some SSL method. Unfortunately I don't recall which method. I'll use builds with symbols again and try to get more information. The sites I have open are tbpl and bugzilla.mozilla.org (lots of tabs, all using https so when FF starts, and I switch through the tab there can be hundreds of active https requests from Gecko to Necko, I presume) and I use Chatzilla connected to freenode, irc.w3.org, and SSL version of moznet.
Please, please, attach a minidump or a stacktrace or whatever you can, the next time you run into this. I have not been able to reproduce it. Thanks in advance.
Olli, also if you are e.g. on windows, you can use Process Explorer (just google for it, downloadable from tech net), which is a much better version of the Task Manager. When you then click a process, you have a tab "Thread", you can see which thread eats cpu and dblclining a thread will give you a stack trace. Works quit nice. When you had one page in state Waiting for, did you try to open a new URL (what ever ssl or non-ssl) and did that new page had loaded?
The uncommon use case, lack of reproducibility, and lack of duplicate reports prevents this from needing to be tracked. If any of these variables change, please re-nominate for tracking.
Attached image screenshot
A it happened again. I had lots of tabs open, some rather active ones, like tbpl and gmail and facebook, and I resumed laptop and connected to internet using slow and bad-latency network. FF UI stayed responsive, but networking basically hanged, and closing FF didn't work. I had to kill it.
this is a bad regression. Asking tracking FF11/12 again.
Now it happened with a very fast and low latency connection too.
Comment on attachment 586656 [details] screenshot I do have the real log of this too, but apparently sysprof creates only non-human readable logs.
I think the problem is that the SSL poll implementation is returning "ready to read and/or write" but its read/write methods return "would block" in the case that the it has finished the handshake but the server's certificate has not been validated yet. This seems to cause the socket transport service to spin, trying over and over again to send data on the connection without actually being able to send any data. I believe the fix is to change the SSL socket's poll method to return "not ready for reading or writing" when it is stuck waiting for SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be called.
Assignee: honzab.moz → nobody
Blocks: 674147, 542832
Severity: blocker → major
Component: Networking → Security: PSM
QA Contact: networking → psm
Hardware: x86_64 → All
Summary: Something in networking and/or SSL layer takes lots of processing power → Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish
Version: unspecified → 11 Branch
Assignee: nobody → bsmith
(In reply to Brian Smith (:bsmith) from comment #17) > I believe the fix is to change the SSL socket's poll method to return "not > ready for reading or writing" when it is stuck waiting for > SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be > called. Something I was a bit afraid of.. And didn't catch apparently... When the socket is in state of cert verification process, i.e. is blocking, poll implementation of the ssl socket has to return 0 (nsSSLIOLayerPoll). That will bypass poll of the TCP (NSPR) socket layer. However, I was comparing poll implementation with the state before SSL thread had been introduced. Wasn't there a similar bug before SSL thread had been introduced?
(In reply to Honza Bambas (:mayhemer) from comment #19) > However, I was comparing poll implementation with the state before SSL > thread had been introduced. Wasn't there a similar bug before SSL thread > had been introduced? There couldn't be! We were not using async cert verification. So, this is actually revealing a bug in libssl. I think the real fix should be made in NSS.
Please see comment 17. I believe the fix should be made in libssl since it is the layer that accepts WOULD_BLOCK from authCertificate callback and waits for resume with SSL_AuthCertificateComplete().
Sounds like others are now seeing this on Aurora. Tracking for 10/11.
Keywords: regression
Whiteboard: [psm-roadblock]
Today I got even the whole UI hang. Something in main thread ended up waiting for SSL threads, which had a lock and was waiting for something else. CPU usage was 0. Unfortunately I didn't save the stack.
Any WIP patches to test? I'd be happy try anything which could reduce the likelihood for the networking hangs.
Adding 'hang', since this is really a hang in networking. Browser UI is working, but you can't do anything with it.
Severity: major → blocker
Keywords: hang
Olli, I wonder why you run into this so frequently whereas I never do. Please try this patch: https://bug542832.bugzilla.mozilla.org/attachment.cgi?id=589264 This patch should fix this issue and is awaiting review.
Maybe my connection has something to do with it. I use often a 3G connection in a place where connection is occasionally very unreliable (huge ping times for some time). Also, I do have 100+ tabs open and most of them are using https. Not all the tabs are really active, but there are pages like gmail, tbpl (multiple), facebook, twitter, zimbra, ...
Yes, that is exactly the situation where this fix would would help. Please apply the patch and let me know if you see any similar problem with the patch. (stack trace would be very helpful too.)
(In reply to Honza Bambas (:mayhemer) from comment #10) > Olli, also if you are e.g. on windows, you can use Process Explorer (just > google for it, downloadable from tech net), which is a much better version > of the Task Manager. > > When you then click a process, you have a tab "Thread", you can see which > thread eats cpu and dblclining a thread will give you a stack trace. Works > quit nice. > > When you had one page in state Waiting for, did you try to open a new URL > (what ever ssl or non-ssl) and did that new page had loaded? Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120119 Firefox/12.0a1 I may have run into this problem a couple of times since early December. At first I guessed it was related to bug 692260 but I never had that problem before and the workaround in bug 692260 did not work for me. Anyway, a few times Nightly cannot load webpages suddenly for a minute or so but resumes normal afterwards without restarting. During that period if I open other sites from bookmark (SSL or non-SSL) the new pages do not load as well. However I can load them in Chrome in the mean time. I do not have many tabs opened (<10) but many of them are SSL sites. This usually happens after I restart Nightly/Firefox_Portable rapidly a few times, all with different profiles (for testing), and return to my main profile. I have recorded 3 instances from Process Explorer during the network hang, but I recorded the one which hogs Nightly only: 1. TID 3060 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 nspr4.dll!PR_Poll+0x10a nspr4.dll!PR_Poll+0x13 xul.dll!??1gfxFontGroup@@UAE@XZ+0xe36 xul.dll!??1gfxFontGroup@@UAE@XZ+0xc7c xul.dll!?SurfaceDestroyFunc@gfxASurface@@CAXPAX@Z+0x24d7 =================================== 2. TID 596 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x16f5 xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf ================================ 3. TID 2416 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 nspr4.dll!PR_dtoa+0x9d6 nspr4.dll!PR_Unlock+0x32 nspr4.dll!PR_Poll+0x3c9 nspr4.dll!PR_Poll+0x13 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x1876 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x169e xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf ======================================== If my case is related to this bug, what should I, as a common user, record or pay attention to when I run into this next time?
Whiteboard: [psm-roadblock] → [psm-roadblock][patch in attachment 589264 in bug 542832]
Fanolian. That you for the detailed info. It seems to confirm what I suspected as the cause of this bug. I think I have a patch that fixes this in bug 542832. I am just waiting for reviews.
Attached image screenshot
Got the problem again with a build which should have the ssl fix. I applied the patch and compile using make -f client.mk build I assume nothing unsual is needed in this case.
Attachment #590552 - Attachment is patch: false
Attachment #590552 - Attachment mime type: text/plain → image/png
I can reproduce this reliably using the next patch. We modified libssl's poll method so that it won't return PR_POLL_READ or PR_POLL_WRITE when we are blocked only on cert validation. But, I failed to account for the special cases in PSM's poll method wrapper which can return PR_POLL_READ and/or PR_POLL_WRITE even when libssl's implementation wouldn't, because of the TLS version intolerance processing. Fortunately, it seems pretty straightforward to improve the version intolerance processing so that it only happens *before* certificate validation; if we get to the point where we are validating the server's certificate, we KNOW the server wasn't version intolerant. Why?: we receive the certificates only after we've successfully parsed and accepted valid server hello; we will never get a valid server hello that libssl would accept for the connection in the version intolerance case, AFAICT. Not only should this change fix this bug, but it should also cause fewer false positives regarding version intolerance; that is, with a patch like this, we would end up wrongly and unnecessarily dropping from TLS 1.0 to SSL 3.0 in fewer circumstances than before. So, a double-win.
Attachment #590661 - Flags: feedback?(honzab.moz)
Honza, with this patch you can connect to any SSL server and reproduce the spinning. Sometimes I had to do a reload and/or open the same site in a second tab to do so.
Comment on attachment 590661 [details] [diff] [review] WIP: Part 1: Only do version intolerance timeout before certificate validation starts This does not account for every type of TLS intolerance. We need to be *very* careful about changing TLS intolerance checking. Will find a new solution.
Attachment #590661 - Flags: feedback?(honzab.moz) → feedback-
Don't review this too hard. I haven't even tested it yet. Really just looking for feedback on the approach, though I will take an r+ if I somehow got it right on the first (re)try.
Attachment #590661 - Attachment is obsolete: true
Attachment #591018 - Flags: review?(honzab.moz)
By the way, I know I could have avoided adding two more member variables, but I will need them later for performance measurements (telemetry and maybe for implementing the navigation API).
Comment on attachment 591018 [details] [diff] [review] WIP: Only do version intolerance timeout before or after certificate validation. Review of attachment 591018 [details] [diff] [review]: ----------------------------------------------------------------- ::: security/manager/ssl/src/nsNSSIOLayer.cpp @@ +1059,1 @@ > > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS)); I understand the idea - to subtract the time of the verification. Sounds good, but the code is wrong for case the interval wraps. Having a condition "now > (mCertVerificationEnded - mCertVerificationStarted)" must make you feel something is not right ;) You are ok with: ((now - mHandshakeStartTime) - (mCertVerificationEnded - mCertVerificationStarted)) > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS)); In general, this looks like a minimalistic hit into the current code. I'll go through this more deeply and give you more comments soon.
Comment on attachment 591018 [details] [diff] [review] WIP: Only do version intolerance timeout before or after certificate validation. Review of attachment 591018 [details] [diff] [review]: ----------------------------------------------------------------- r=honzab My description of the cause: - handshake timed according PSM out (socketInfo->HandshakeTimeout() == true => poll for write | except to push above layer forward - socket writable - ssl_Send:ssl3_GatherCompleteHandshake: ss->ssl3.hs.restartTarget != NULL => WOULD_BLOCK since nobody called restart so far (i.e. verification is still pending) Based on this, I can say the patch is correct with minimal impact. I also think of resetting the time out start value when we finish cert verification, but that might have a performance negative impact - by prolonging the timeout - or even potential security sensitive impact, not that I would know of right now. That's for a security review we don't have time for now.
Attachment #591018 - Flags: review?(honzab.moz) → review+
Good, I was guessing this bug would already be known. Not that it helps now, but here's a repeating snippet from NSPR_LOG_MODULES='nsSocketTransport:5': 706737920[7f743675b6a0]: poll timeout: 65535 706737920[7f743675b6a0]: timeout = 65535000 milliseconds 706737920[7f743675b6a0]: ...returned after 0 milliseconds 706737920[7f743675b6a0]: nsSocketTransport::OnSocketReady [this=77bb6a0 outFlags=6] 706737920[7f743675b6a0]: nsSocketOutputStream::OnSocketReady [this=77bb7f8 cond=0] 706737920[7f743675b6a0]: nsSocketOutputStream::Write [this=77bb7f8 count=396] 706737920[7f743675b6a0]: calling PR_Write [count=396] 706737920[7f743675b6a0]: PR_Write returned [n=-1] 706737920[7f743675b6a0]: nsSocketOutputStream::AsyncWait [this=77bb7f8] 706737920[7f743675b6a0]: STS poll iter [1] ...which seemed odd to me, since I knew it was ending up doing a recvfrom(). This bug explains it. Now I have to go dig up a build from before this was introduced. Any hints? Or is there a way to turn off the async cert validation/SSL thread?
sfink, for some reason this bug occurs very rarely for me even using the latest Nightly. But, recently I did have it happen to me. Note that when it happens, eventually the hang will "go away" after ~1 minute and everything will work properly again. The patch that introduced this bug is in bug 674147. So, you would need to find a build from before then.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
This is 100% reproducible for me while on my home network, and I've left it for at least 23 minutes without it resolving itself. But this is with around 300 tabs, maybe a dozen pointing to google properties. I'm a little confused by which patch is which -- would you like me to try one? Thanks for the pointer to the bug! I'm trying to track down another major perf problem and this one was masking it. And the other one has been there since at least FF9, so I can easily back up to before bug 674147. But I don't think it's networking-related.
Here is the patch that adds the fix for the bug to mozilla-central until the fix can land in the next NSS release. It is a slightly modified version of the patch that Wan-Teh already reviewed.
Attachment #591698 - Flags: review?(kaie)
(In reply to Steve Fink [:sfink] from comment #42) > This is 100% reproducible for me while on my home network, and I've left it > for at least 23 minutes without it resolving itself. But this is with around > 300 tabs, maybe a dozen pointing to google properties. > > I'm a little confused by which patch is which -- would you like me to try > one? Please try the patch that I just attached here. It isn't surprising that we would find this problem on Google properties, because the Google servers are probably much faster than the OCSP responder.
Building now, but I thought I should mention that although it appears to be 100% reproducible (for now), it doesn't happen immediately after starting up. It takes about 3 or 4 minutes before it kicks in. I detect it by running watch -d 'ps -L <firefox-pid>' and watching the 'TIME' column. It'll constantly increase for the first entry (the main thread), but none of the other threads will accumulate very much time until the problem kicks in. At that point, the 5th listed thread (which will always be the firefox pid plus 4, which probably only holds true on Linux) will start accumulating time. It makes it easy to spot. (And to verify, I'll run strace -p <problematic-tid> to see lots of recvfrom() calls.)
I ran with the patch this morning. I'm still getting the problem. It's very slightly different -- previously, it was always mainpid+4 that was the thread that spun on recvfrom(); now it is mainpid+7. Which could be a complete red herring. It's going against the same server as yesterday. Just that thread has accumulated 55 minutes of runtime now, so I don't think it's going to resolve itself anytime soon. Anything you'd like me to try? Unfortunately for this, I'm about to drive into work, and I doubt I'll be able to replicate it on the work network. I could try artificially slowing it down -- do you have any ideas about what network characteristics would be relevant? I could use netem to add latency, drop packets, or reorder packets, if any of those seem relevant. (Not that I'm not noticing any dropped or reordered packets from my home network.) It might involve buffer sizes or something rather than simple latency, and my netem-fu is weak. Wow! That's a total lie! I was basing that on the 1e100.net server that I happened to get when resolving www.google.com. If I switch to the actual one I'm going against, which is pz-in-f84.1e100.net, I'm seeing *horrible* network performance -- 64% packet loss, 34ms ping times. (The first google server I tried had 0% packet loss, 13ms pings.) So I hypothesize that packet loss is the relevant trigger, and perhaps this problem shows up when one of google's servers gets messed up. Let me check what it looks like from the MV network... 0% packet loss, 24ms pings. And yes, it's still bad from my local network: 70% loss, 33-50ms rtt. Any idea why I keep getting that server when I restore my session? That hostname does not appear in my sessionstore.js. Nor does the IP (74.125.127.84). But every time I restart the browser, it ends up going to that exact server. I hope you don't need to be coming from my public IP to reproduce this...
Comment on attachment 591698 [details] [diff] [review] Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing] I understand this is for early testing of bug 542832, which has not yet a final review of this patch. r=kaie for the early testing in mozilla-central, only
Attachment #591698 - Flags: review?(kaie) → review+
Attachment #591698 - Attachment description: Import ssl_Poll fix from bug 542832 → Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]
(In reply to Steve Fink [:sfink] from comment #46) > I ran with the patch this morning. I'm still getting the problem. It's very > slightly different -- previously, it was always mainpid+4 that was the > thread that spun on recvfrom(); now it is mainpid+7. Which could be a > complete red herring. > > It's going against the same server as yesterday. Just that thread has > accumulated 55 minutes of runtime now, so I don't think it's going to > resolve itself anytime soon. Sorry, you need to apply the WIP patch above AND the ssl_Poll patch above. I am going to clean the WIP patch up and check both of them into mozilla-central today.
A bunch of us are seeing this bug (or at least similar characteristics) at the performance work week - possibly the hotel's dodgy connections are highlighting the problem?
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
[Approval Request Comment] Regression caused by (bug #): bug 674147 User impact if declined: User will experience infrequent hangs of the browser when connecting to SSL sites Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): See Honza's comment above. This seems like the lowest-risk change that we can make to solve the problem. String changes made by this patch: none
Attachment #591018 - Attachment is obsolete: true
Attachment #593043 - Flags: review+
Attachment #593043 - Flags: approval-mozilla-beta?
Attachment #593043 - Flags: approval-mozilla-aurora?
Kai, you said that this should only land on mozilla-central but mozilla-aurora and mozilla-beta also REALLY need this fix. We should land this on Aurora and Beta ASAP. [Approval Request Comment] Regression caused by (bug #): bug 674147 User impact if declined: Users will sometimes experience long hangs when connecting to SSL sites. Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): There is a moderate amount of risk to this patch, but the alternative is to back out bug 674147 and everything that depends on it, including SPDY, which would be even more risky as some of the SPDY changes also affect non-SPDY HTTP code. String changes made by this patch: none
Attachment #591698 - Attachment is obsolete: true
Attachment #593044 - Flags: review+
Attachment #593044 - Flags: approval-mozilla-beta?
Attachment #593044 - Flags: approval-mozilla-aurora?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
Comment on attachment 590552 [details] screenshot Got this again today, Gecko/20120201 Firefox/13.0a1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P1
Whiteboard: [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]
Comment on attachment 593043 [details] [diff] [review] Part 1: Do not do TLS intolerance timeout during cert validation [v3] [Triage Comment] This has now had a couple of days to bake on m-c. Should prevent browser hangs with SSL. Given where we are in the cycle, approving for both Aurora 12 and Beta 11.
Attachment #593043 - Flags: approval-mozilla-beta?
Attachment #593043 - Flags: approval-mozilla-beta+
Attachment #593043 - Flags: approval-mozilla-aurora?
Attachment #593043 - Flags: approval-mozilla-aurora+
Attachment #593044 - Flags: approval-mozilla-beta?
Attachment #593044 - Flags: approval-mozilla-beta+
Attachment #593044 - Flags: approval-mozilla-aurora?
Attachment #593044 - Flags: approval-mozilla-aurora+
This is really horrible with Fosdem network. Has already happened several times this morning.
Please try running Firefox with logging enabled and send me the log after it happens again: NSPR_LOG_MODULES=pipnss:5,nsSocketTransport:5,timestamp NSPR_LOG_FILE=log.txt Also, do you have SPDY enabled? If so, after reproducing it once, please disable the SPDY support (network.http.spdy.enabled=false). I think this may be a difference in behavior between Linux and Windows, because I cannot reproduce this on Windows 7 x64 at all after my fix landed. I will try on Linux now.
Olli, do you think you have times where you are waiting 25 seconds or more to connect to a website in these situations? If so, then it might be caused by yet another problem with the handshake timeout that I think I just found. Please try this patch and report back. If this patch does NOT solve your problem, then try the following: 1. Disable TLS 1.0 (security.enable_tls=true) and enable SSL 3.0 (security.enable_ssl3=true). 2. Disable OCSP (security.OCSP.enabled=0). And report back the results. (But, try the above patch first). ON timeout, the PSM poll method returns "go ahead, try to send or receive" so that the send/receive methods can return the timeout error. Except, the send/receive methods never actually check for timeout like the old SSL-thread-based ones did. Oops! Still, I am having trouble reproducing the busy loop even without this patch, so I am not sure it is the complete solution.
Attachment #594508 - Flags: feedback?(honzab.moz)
Attachment #594508 - Flags: feedback?(bugs)
(In reply to Brian Smith (:bsmith) from comment #58) > Created attachment 594508 [details] [diff] [review] > Actually abort the connection on send/receive if handshake timed out > > Olli, do you think you have times where you are waiting 25 seconds or more > to connect to a website in these situations? Very possible. I've seen to problem usually when I'm using a bad 3G connection or here @Fosdem where network connection is occasionally slow. I'll test the patch ASAP.
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out I've been running a build with this patch in a bad network, and so far no problems. Yesterday I got the hang several times using a build without the patch.
Attachment #594508 - Flags: feedback?(bugs) → feedback+
I'm also still seeing this with the 2012-02-03 nightly, from my home DSL network that has some questionable latency to certain sites. I'm recompiling with the patch now, though I don't know if I'll have time to try it out before I go into the work network tomorrow. Same as smaug, the symptom is a complete hang of any networking, but the UI is responsive. I left it going for about 2 full days and it never made progress on the network connections; it was still spinning on recvfrom() when I killed it. I'm not sure if this is related, but I'm also getting a permanent hang on shutdown. And unfortunately, the last time I killed it, it mangled my sessionstore so I can no longer restore that session. Which may make it harder to reproduce, among other things.
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Kai, the most important part of this patch is the added line: socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage); That fixes *this* bug, but in a way that still causes TLS intolerance due to handshake timeout to be handled incorrectly. The rest of the patch is about getting TLS intolerance to work correctly in the timeout scenerio.
Attachment #594508 - Flags: review?(kaie)
Here is the tryserver run, still in progress: https://tbpl.mozilla.org/?tree=Try&rev=ca57a65c69e6
My browser seems to be much happier with that patch (I managed to restore my session with mv _closedWindows/* windows/). There's another thread spinning on something else, but that appears to be intentional and not network related.
I haven't got any networking hangs after I started to use the latest patch.
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Review of attachment 594508 [details] [diff] [review]: ----------------------------------------------------------------- Good, but I would be interested in what is the code path that causes the loop w/o this patch, for reference. Brian, can you please comment on the bug with it? Thanks. ::: security/manager/ssl/src/nsNSSIOLayer.cpp @@ +2121,4 @@ > NS_ASSERTION(in_flags & PR_POLL_EXCEPT, > "caller did not poll for EXCEPT (handshake timeout)"); > *out_flags = in_flags | PR_POLL_EXCEPT; > + socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage); In genenral, RESET_ERROR can only be used when we haven't sent any application data. If this happens during rehandshake we may duplicate send of a POST, for instance. PR_CONNECT_ABORTED_ERROR should then be used. Doesn't necessarily need to be handled in this bug.
Attachment #594508 - Flags: feedback?(honzab.moz) → feedback+
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out [Approval Request Comment] Regression caused by (bug #): ? User impact if declined: FF networking hangs Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): String changes made by this patch:
Attachment #594508 - Flags: approval-mozilla-beta?
Attachment #594508 - Flags: approval-mozilla-aurora?
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Thanks everyone for your work and testing on this bug. Regarding Honza's warning related to RESET_ERROR, is it correct that we only ever allow a handshake-timeout at the very beginning of a new socket? In other words, we never run through the handshake-timeout code whenever a socket might perform a SSL renegotiation (I believe that would be handled transparently inside libSSL at the NSS level). Under this assumption, if we're still at the initial handshake, then we haven't yet transfered any application data, and setting RESET_ERROR should be fine. Your calling of checkHandshake seems reasonable from what I remember. Given that you have thorougly tested this patch and have gotten positive feedback for a very bad regression, I'm fine with this patch. r=kaie This is also a reminder that we urgently need to get bug 542832 sorted and get the newer NSS landed into mozilla-beta.
Attachment #594508 - Flags: review?(kaie) → review+
Any chance to get this landed? I just accidentally run a build without the fix and got the hang immediately when I switched to a bad network.
(In reply to Honza Bambas (:mayhemer) from comment #66) > Good, but I would be interested in what is the code path that causes the > loop w/o this patch, for reference. Brian, can you please comment on the > bug with it? Thanks. nsSSLIOLayerPoll hits the HandshakeTimeout condition, and tells the application the socket is readable/writable. The application tries to PR_Write, but the PR_Write ends up doing a read on the network socket, because we are waiting for one of the server's hello messages. But, we don't have that message yet, so that read returns "would block" and so PR_Write returns "would block." Then, the application calls poll again, which again hits the HandshakeTimeout condition, and it all repeats. > In genenral, RESET_ERROR can only be used when we haven't sent any > application data. To expand on what Kai said, we should never be timing out after we've sent application data; there is no positive gain in doing TLS intolerance timeouts during a renegotiation.
Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out [Triage Comment] Prevents hangs caused by recent changes in SSL handling. Approved for Aurora 12 and Beta 11.
Attachment #594508 - Flags: approval-mozilla-beta?
Attachment #594508 - Flags: approval-mozilla-beta+
Attachment #594508 - Flags: approval-mozilla-aurora?
Attachment #594508 - Flags: approval-mozilla-aurora+
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1]
(In reply to Marco Bonardo [:mak] from comment #71) > https://hg.mozilla.org/mozilla-central/rev/fbf93932695e > > can this be closed? Yes.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
What is the STR for this bug so QA can verify it?
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?]
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #75) > What is the STR for this bug so QA can verify it? I think it has been verified in comment 65.
Hmm...I'm not sure that makes this confirmed verified on Firefox 11 and 12. I'm a little unclear what, if anything QA can do to verify this is fixed across all patched versions.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-]
It probably hasn't been verified by anyone on m-a or m-b, though. STR: 1. Grab a MiFi (or tethering to a phone would probably work). 2. Head out into the country, you want to get down to just a couple of bars of reception, one would be better. Take your pre-patch and post-patch builds along, you don't want to be downloading them over that connection. 3. Load tbpl.mozilla.org. Wouldn't hurt to open several tabs of tbpl to different busy trees, say mozilla-inbound and try and profiling and maple. 4. Watch the CPU, or just listen for the fan, since you're out in the country and it's quiet so you'll hear it spin up. Once it does, verify that no networking is working, that you can open a tab and try to load some unrelated site, but it won't actually load until you kill the tbpl tabs. Then comes the fun part, doing it again with the post-patch builds, and just sitting around watching the clouds go by and listening to the birds, while waiting to not hear the fan spin up. You're welcome :)
Given those steps, QA will be hard-pressed to verify this fix. I think it would be far more reasonable if Olly could verify this on Firefox 11 and 12 as was done for Firefox 13, presumably, in comment 65.
Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: