Closed Bug 710176 Opened 13 years ago Closed 13 years ago

Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla12

Tracking Flags:

Tracking

Status

firefox11

fixed

firefox12

fixed

People

(Reporter: smaug, Assigned: briansmith)

References

Details

(Keywords: hang, perf, regression, Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn])

Attachments

(7 files, 3 obsolete files)

screenshot 13 years ago Olli Pettay [:smaug][bugs@pettay.fi] 153.55 KB, image/png		Details
Profile from Andreas Gal from bug 716071 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 59.32 KB, text/plain		Details
screenshot 13 years ago Olli Pettay [:smaug][bugs@pettay.fi] 120.92 KB, image/png		Details
WIP: Part 1: Only do version intolerance timeout before certificate validation starts 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 11.72 KB, patch	briansmith : feedback-	Details \| Diff \| Splinter Review
[NOT FOR CHECKIN] force cert validation to take a very long time so we can reliably reproduce this 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 1.08 KB, patch		Details \| Diff \| Splinter Review
WIP: Only do version intolerance timeout before or after certificate validation. 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 5.88 KB, patch	mayhemer : review+	Details \| Diff \| Splinter Review
Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing] 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2.76 KB, patch	KaiE : review+	Details \| Diff \| Splinter Review
Part 1: Do not do TLS intolerance timeout during cert validation [v3] 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 5.52 KB, patch	briansmith : review+ akeybl : approval-mozilla-aurora+ akeybl : approval-mozilla-beta+	Details \| Diff \| Splinter Review
Part 2: Import ssl_Poll fix from bug 542832 [v3] 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 4.17 KB, patch	briansmith : review+ akeybl : approval-mozilla-aurora+ akeybl : approval-mozilla-beta+	Details \| Diff \| Splinter Review
Actually abort the connection on send/receive if handshake timed out 13 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 8.52 KB, patch	KaiE : review+ smaug : feedback+ mayhemer : feedback+ akeybl : approval-mozilla-aurora+ akeybl : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Description

•

13 years ago

I think there is some recent regression, since I haven't seen the behavior in FF9 builds: When I start Firefox with lots of tabs (including few tbpl, and ~100 bugzilla pages, all https) Firefox uses over 100% cpu time and loads pages slowly. When I got the problem last time few days ago, Zoom profiler showed most of the time was spent doing something SSL related. This time my Zoom evaluation license was unfortunately expired. Has there been any changes to how we handle SSL?

Patrick McManus [:mcmanus]

Comment 1

•

13 years ago

(In reply to Olli Pettay [:smaug] from comment #0) > > Has there been any changes to how we handle SSL? smaug, big changes in nightly, yes. Can you supply about:buildconfig changeset id's for either of the instances? (probably just the recent one I guess).

Josh Aas

Comment 2

•

13 years ago

Perhaps this is caused by processing large numbers of SSL connects at the same time? Previously, as I understand it, we could only do one at a time.

Honza Bambas (:mayhemer)

Comment 3

•

13 years ago

Could that be bug 705755?

Wayne Mery (:wsmwk)

Updated

•

13 years ago

Keywords: perf

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 4

•

13 years ago

I think I got this twice yesterday

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Updated

•

13 years ago

Severity: major → blocker

tracking-firefox11: --- → ?

tracking-firefox12: --- → ?

Josh Aas

Updated

•

13 years ago

Assignee: nobody → honzab.moz

Honza Bambas (:mayhemer)

Comment 5

•

13 years ago

(In reply to Olli Pettay [:smaug] from comment #4) > I think I got this twice yesterday Can you try patch in bug 705755?

Honza Bambas (:mayhemer)

Comment 6

•

13 years ago

Also as Josh suggested, this could be related to bug 695789 (no patch so far).

Honza Bambas (:mayhemer)

Comment 7

•

13 years ago

Olli, at least some more detailed data would be appreciated like sites your were browsing, what exact version did you have, retry with a clean profile etc. Thanks.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 8

•

13 years ago

Unfortunately I don't have much information to share. I use Nightlies and occasionally one thread, which is not the main thread, starts to take 100% cpu and page loading stops. IIRC page loading stops to "Waiting for ...". Also, IRC messages (via Chatzilla) don't get through. The only way to solve the problem is to restart browser. When I was using a build of my own (which had symbols) profiler showed that most of the time was spent in some SSL method. Unfortunately I don't recall which method. I'll use builds with symbols again and try to get more information. The sites I have open are tbpl and bugzilla.mozilla.org (lots of tabs, all using https so when FF starts, and I switch through the tab there can be hundreds of active https requests from Gecko to Necko, I presume) and I use Chatzilla connected to freenode, irc.w3.org, and SSL version of moznet.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 9

•

13 years ago

Please, please, attach a minidump or a stacktrace or whatever you can, the next time you run into this. I have not been able to reproduce it. Thanks in advance.

Honza Bambas (:mayhemer)

Comment 10

•

13 years ago

Olli, also if you are e.g. on windows, you can use Process Explorer (just google for it, downloadable from tech net), which is a much better version of the Task Manager. When you then click a process, you have a tab "Thread", you can see which thread eats cpu and dblclining a thread will give you a stack trace. Works quit nice. When you had one page in state Waiting for, did you try to open a new URL (what ever ssl or non-ssl) and did that new page had loaded?

Alex Keybl [:akeybl]

Comment 11

•

13 years ago

The uncommon use case, lack of reproducibility, and lack of duplicate reports prevents this from needing to be tracked. If any of these variables change, please re-nominate for tracking.

tracking-firefox11: ? → -

tracking-firefox12: ? → -

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 12

•

13 years ago

Attached image screenshot — Details

A it happened again. I had lots of tabs open, some rather active ones, like tbpl and gmail and facebook, and I resumed laptop and connected to internet using slow and bad-latency network. FF UI stayed responsive, but networking basically hanged, and closing FF didn't work. I had to kill it.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 13

•

13 years ago

this is a bad regression. Asking tracking FF11/12 again.

tracking-firefox11: - → ?

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 14

•

13 years ago

Now it happened with a very fast and low latency connection too.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 15

•

13 years ago

Comment on attachment 586656 [details] screenshot I do have the real log of this too, but apparently sysprof creates only non-human readable logs.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 17

•

13 years ago

I think the problem is that the SSL poll implementation is returning "ready to read and/or write" but its read/write methods return "would block" in the case that the it has finished the handshake but the server's certificate has not been validated yet. This seems to cause the socket transport service to spin, trying over and over again to send data on the connection without actually being able to send any data. I believe the fix is to change the SSL socket's poll method to return "not ready for reading or writing" when it is stuck waiting for SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be called.

Assignee: honzab.moz → nobody

Blocks: 674147, 542832

Severity: blocker → major

Component: Networking → Security: PSM

QA Contact: networking → psm

Hardware: x86_64 → All

Summary: Something in networking and/or SSL layer takes lots of processing power → Socket transport service thread pegs the CPU spinning to send data on a SSL socket that is blocking waiting for certificate validation to finish

Version: unspecified → 11 Branch

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Updated

•

13 years ago

Assignee: nobody → bsmith

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 18

•

13 years ago

Attached file Profile from Andreas Gal from bug 716071 — Details

Honza Bambas (:mayhemer)

Comment 19

•

13 years ago

(In reply to Brian Smith (:bsmith) from comment #17) > I believe the fix is to change the SSL socket's poll method to return "not > ready for reading or writing" when it is stuck waiting for > SSL_RestartHandshakeAfterAuthCertificate/SSL_AuthCertificateComplete to be > called. Something I was a bit afraid of.. And didn't catch apparently... When the socket is in state of cert verification process, i.e. is blocking, poll implementation of the ssl socket has to return 0 (nsSSLIOLayerPoll). That will bypass poll of the TCP (NSPR) socket layer. However, I was comparing poll implementation with the state before SSL thread had been introduced. Wasn't there a similar bug before SSL thread had been introduced?

Honza Bambas (:mayhemer)

Comment 20

•

13 years ago

(In reply to Honza Bambas (:mayhemer) from comment #19) > However, I was comparing poll implementation with the state before SSL > thread had been introduced. Wasn't there a similar bug before SSL thread > had been introduced? There couldn't be! We were not using async cert verification. So, this is actually revealing a bug in libssl. I think the real fix should be made in NSS.

Honza Bambas (:mayhemer)

Comment 21

•

13 years ago

Please see comment 17. I believe the fix should be made in libssl since it is the layer that accepts WOULD_BLOCK from authCertificate callback and waits for resume with SSL_AuthCertificateComplete().

Alex Keybl [:akeybl]

Comment 22

•

13 years ago

Sounds like others are now seeing this on Aurora. Tracking for 10/11.

tracking-firefox11: ? → +

tracking-firefox12: - → +

Kai Engert (:KaiE:)

Updated

•

13 years ago

Keywords: regression

Whiteboard: [psm-roadblock]

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 23

•

13 years ago

Today I got even the whole UI hang. Something in main thread ended up waiting for SSL threads, which had a lock and was waiting for something else. CPU usage was 0. Unfortunately I didn't save the stack.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 24

•

13 years ago

Any WIP patches to test? I'd be happy try anything which could reduce the likelihood for the networking hangs.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 25

•

13 years ago

Adding 'hang', since this is really a hang in networking. Browser UI is working, but you can't do anything with it.

Severity: major → blocker

Keywords: hang

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 26

•

13 years ago

Olli, I wonder why you run into this so frequently whereas I never do. Please try this patch: https://bug542832.bugzilla.mozilla.org/attachment.cgi?id=589264 This patch should fix this issue and is awaiting review.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 27

•

13 years ago

Maybe my connection has something to do with it. I use often a 3G connection in a place where connection is occasionally very unreliable (huge ping times for some time). Also, I do have 100+ tabs open and most of them are using https. Not all the tabs are really active, but there are pages like gmail, tbpl (multiple), facebook, twitter, zimbra, ...

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 28

•

13 years ago

Yes, that is exactly the situation where this fix would would help. Please apply the patch and let me know if you see any similar problem with the patch. (stack trace would be very helpful too.)

Fanolian

Comment 29

•

13 years ago

(In reply to Honza Bambas (:mayhemer) from comment #10) > Olli, also if you are e.g. on windows, you can use Process Explorer (just > google for it, downloadable from tech net), which is a much better version > of the Task Manager. > > When you then click a process, you have a tab "Thread", you can see which > thread eats cpu and dblclining a thread will give you a stack trace. Works > quit nice. > > When you had one page in state Waiting for, did you try to open a new URL > (what ever ssl or non-ssl) and did that new page had loaded? Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120119 Firefox/12.0a1 I may have run into this problem a couple of times since early December. At first I guessed it was related to bug 692260 but I never had that problem before and the workaround in bug 692260 did not work for me. Anyway, a few times Nightly cannot load webpages suddenly for a minute or so but resumes normal afterwards without restarting. During that period if I open other sites from bookmark (SSL or non-SSL) the new pages do not load as well. However I can load them in Chrome in the mean time. I do not have many tabs opened (<10) but many of them are SSL sites. This usually happens after I restart Nightly/Firefox_Portable rapidly a few times, all with different profiles (for testing), and return to my main profile. I have recorded 3 instances from Process Explorer during the network hang, but I recorded the one which hogs Nightly only: 1. TID 3060 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 nspr4.dll!PR_Poll+0x10a nspr4.dll!PR_Poll+0x13 xul.dll!??1gfxFontGroup@@UAE@XZ+0xe36 xul.dll!??1gfxFontGroup@@UAE@XZ+0xc7c xul.dll!?SurfaceDestroyFunc@gfxASurface@@CAXPAX@Z+0x24d7 =================================== 2. TID 596 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x3731 xul.dll!?Contains@?$BaseRect@NUgfxRect@@UgfxPoint@@UgfxSize@@UgfxMargin@@@gfx@mozilla@@QBE_NABUgfxRect@@@Z+0x36b4 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x16f5 xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf ================================ 3. TID 2416 Start Address: MSVCR80.dll!_endthreadex+0x61 ntoskrnl.exe!memset+0x64a ntoskrnl.exe!KeWaitForMultipleObjects+0xd52 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!__misaligned_access+0xba4 ntoskrnl.exe!__misaligned_access+0x1821 ntoskrnl.exe!__misaligned_access+0x1a97 nspr4.dll!PR_dtoa+0x9d6 nspr4.dll!PR_Unlock+0x32 nspr4.dll!PR_Poll+0x3c9 nspr4.dll!PR_Poll+0x13 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x1876 xul.dll!??0gfxImageSurface@@QAE@ABUnsIntSize@@W4gfxImageFormat@gfxASurface@@@Z+0x169e xul.dll!?GetLastChild@ContainerLayer@layers@mozilla@@UAEPAVLayer@23@XZ+0xaf ======================================== If my case is related to this bug, what should I, as a common user, record or pay attention to when I run into this next time?

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Updated

•

13 years ago

Whiteboard: [psm-roadblock] → [psm-roadblock][patch in attachment 589264 in bug 542832]

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 30

•

13 years ago

Fanolian. That you for the detailed info. It seems to confirm what I suspected as the cause of this bug. I think I have a patch that fixes this in bug 542832. I am just waiting for reviews.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 31

•

13 years ago

Attached image screenshot — Details

Got the problem again with a build which should have the ssl fix. I applied the patch and compile using make -f client.mk build I assume nothing unsual is needed in this case.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Updated

•

13 years ago

Attachment #590552 - Attachment is patch: false

Attachment #590552 - Attachment mime type: text/plain → image/png

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 32

•

13 years ago

Attached patch WIP: Part 1: Only do version intolerance timeout before certificate validation starts (obsolete) — Details — Splinter Review

I can reproduce this reliably using the next patch. We modified libssl's poll method so that it won't return PR_POLL_READ or PR_POLL_WRITE when we are blocked only on cert validation. But, I failed to account for the special cases in PSM's poll method wrapper which can return PR_POLL_READ and/or PR_POLL_WRITE even when libssl's implementation wouldn't, because of the TLS version intolerance processing. Fortunately, it seems pretty straightforward to improve the version intolerance processing so that it only happens *before* certificate validation; if we get to the point where we are validating the server's certificate, we KNOW the server wasn't version intolerant. Why?: we receive the certificates only after we've successfully parsed and accepted valid server hello; we will never get a valid server hello that libssl would accept for the connection in the version intolerance case, AFAICT. Not only should this change fix this bug, but it should also cause fewer false positives regarding version intolerance; that is, with a patch like this, we would end up wrongly and unnecessarily dropping from TLS 1.0 to SSL 3.0 in fewer circumstances than before. So, a double-win.

Attachment #590661 - Flags: feedback?(honzab.moz)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 33

•

13 years ago

Attached patch [NOT FOR CHECKIN] force cert validation to take a very long time so we can reliably reproduce this — Details — Splinter Review

Honza, with this patch you can connect to any SSL server and reproduce the spinning. Sometimes I had to do a reload and/or open the same site in a second tab to do so.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 34

•

13 years ago

Comment on attachment 590661 [details] [diff] [review] WIP: Part 1: Only do version intolerance timeout before certificate validation starts This does not account for every type of TLS intolerance. We need to be *very* careful about changing TLS intolerance checking. Will find a new solution.

Attachment #590661 - Flags: feedback?(honzab.moz) → feedback-

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 35

•

13 years ago

Attached patch WIP: Only do version intolerance timeout before or after certificate validation. (obsolete) — Details — Splinter Review

Don't review this too hard. I haven't even tested it yet. Really just looking for feedback on the approach, though I will take an r+ if I somehow got it right on the first (re)try.

Attachment #590661 - Attachment is obsolete: true

Attachment #591018 - Flags: review?(honzab.moz)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 36

•

13 years ago

By the way, I know I could have avoided adding two more member variables, but I will need them later for performance measurements (telemetry and maybe for implementing the navigation API).

Honza Bambas (:mayhemer)

Comment 37

•

13 years ago

Comment on attachment 591018 [details] [diff] [review] WIP: Only do version intolerance timeout before or after certificate validation. Review of attachment 591018 [details] [diff] [review]: ----------------------------------------------------------------- ::: security/manager/ssl/src/nsNSSIOLayer.cpp @@ +1059,1 @@ > > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS)); I understand the idea - to subtract the time of the verification. Sounds good, but the code is wrong for case the interval wraps. Having a condition "now > (mCertVerificationEnded - mCertVerificationStarted)" must make you feel something is not right ;) You are ok with: ((now - mHandshakeStartTime) - (mCertVerificationEnded - mCertVerificationStarted)) > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS)); In general, this looks like a minimalistic hit into the current code. I'll go through this more deeply and give you more comments soon.

Honza Bambas (:mayhemer)

Comment 38

•

13 years ago

Comment on attachment 591018 [details] [diff] [review] WIP: Only do version intolerance timeout before or after certificate validation. Review of attachment 591018 [details] [diff] [review]: ----------------------------------------------------------------- r=honzab My description of the cause: - handshake timed according PSM out (socketInfo->HandshakeTimeout() == true => poll for write | except to push above layer forward - socket writable - ssl_Send:ssl3_GatherCompleteHandshake: ss->ssl3.hs.restartTarget != NULL => WOULD_BLOCK since nobody called restart so far (i.e. verification is still pending) Based on this, I can say the patch is correct with minimal impact. I also think of resetting the time out start value when we finish cert verification, but that might have a performance negative impact - by prolonging the timeout - or even potential security sensitive impact, not that I would know of right now. That's for a security review we don't have time for now.

Attachment #591018 - Flags: review?(honzab.moz) → review+

Steve Fink [:sfink] [:s:]

Comment 40

•

13 years ago

Good, I was guessing this bug would already be known. Not that it helps now, but here's a repeating snippet from NSPR_LOG_MODULES='nsSocketTransport:5': 706737920[7f743675b6a0]: poll timeout: 65535 706737920[7f743675b6a0]: timeout = 65535000 milliseconds 706737920[7f743675b6a0]: ...returned after 0 milliseconds 706737920[7f743675b6a0]: nsSocketTransport::OnSocketReady [this=77bb6a0 outFlags=6] 706737920[7f743675b6a0]: nsSocketOutputStream::OnSocketReady [this=77bb7f8 cond=0] 706737920[7f743675b6a0]: nsSocketOutputStream::Write [this=77bb7f8 count=396] 706737920[7f743675b6a0]: calling PR_Write [count=396] 706737920[7f743675b6a0]: PR_Write returned [n=-1] 706737920[7f743675b6a0]: nsSocketOutputStream::AsyncWait [this=77bb7f8] 706737920[7f743675b6a0]: STS poll iter [1] ...which seemed odd to me, since I knew it was ending up doing a recvfrom(). This bug explains it. Now I have to go dig up a build from before this was introduced. Any hints? Or is there a way to turn off the async cert validation/SSL thread?

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 41

•

13 years ago

sfink, for some reason this bug occurs very rarely for me even using the latest Nightly. But, recently I did have it happen to me. Note that when it happens, eventually the hang will "go away" after ~1 minute and everything will work properly again. The patch that introduced this bug is in bug 674147. So, you would need to find a build from before then.

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]

Steve Fink [:sfink] [:s:]

Comment 42

•

13 years ago

This is 100% reproducible for me while on my home network, and I've left it for at least 23 minutes without it resolving itself. But this is with around 300 tabs, maybe a dozen pointing to google properties. I'm a little confused by which patch is which -- would you like me to try one? Thanks for the pointer to the bug! I'm trying to track down another major perf problem and this one was masking it. And the other one has been there since at least FF9, so I can easily back up to before bug 674147. But I don't think it's networking-related.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 43

•

13 years ago

Attached patch Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing] (obsolete) — Details — Splinter Review

Here is the patch that adds the fix for the bug to mozilla-central until the fix can land in the next NSS release. It is a slightly modified version of the patch that Wan-Teh already reviewed.

Attachment #591698 - Flags: review?(kaie)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 44

•

13 years ago

(In reply to Steve Fink [:sfink] from comment #42) > This is 100% reproducible for me while on my home network, and I've left it > for at least 23 minutes without it resolving itself. But this is with around > 300 tabs, maybe a dozen pointing to google properties. > > I'm a little confused by which patch is which -- would you like me to try > one? Please try the patch that I just attached here. It isn't surprising that we would find this problem on Google properties, because the Google servers are probably much faster than the OCSP responder.

Steve Fink [:sfink] [:s:]

Comment 45

•

13 years ago

Building now, but I thought I should mention that although it appears to be 100% reproducible (for now), it doesn't happen immediately after starting up. It takes about 3 or 4 minutes before it kicks in. I detect it by running watch -d 'ps -L <firefox-pid>' and watching the 'TIME' column. It'll constantly increase for the first entry (the main thread), but none of the other threads will accumulate very much time until the problem kicks in. At that point, the 5th listed thread (which will always be the firefox pid plus 4, which probably only holds true on Linux) will start accumulating time. It makes it easy to spot. (And to verify, I'll run strace -p <problematic-tid> to see lots of recvfrom() calls.)

Steve Fink [:sfink] [:s:]

Comment 46

•

13 years ago

I ran with the patch this morning. I'm still getting the problem. It's very slightly different -- previously, it was always mainpid+4 that was the thread that spun on recvfrom(); now it is mainpid+7. Which could be a complete red herring. It's going against the same server as yesterday. Just that thread has accumulated 55 minutes of runtime now, so I don't think it's going to resolve itself anytime soon. Anything you'd like me to try? Unfortunately for this, I'm about to drive into work, and I doubt I'll be able to replicate it on the work network. I could try artificially slowing it down -- do you have any ideas about what network characteristics would be relevant? I could use netem to add latency, drop packets, or reorder packets, if any of those seem relevant. (Not that I'm not noticing any dropped or reordered packets from my home network.) It might involve buffer sizes or something rather than simple latency, and my netem-fu is weak. Wow! That's a total lie! I was basing that on the 1e100.net server that I happened to get when resolving www.google.com. If I switch to the actual one I'm going against, which is pz-in-f84.1e100.net, I'm seeing *horrible* network performance -- 64% packet loss, 34ms ping times. (The first google server I tried had 0% packet loss, 13ms pings.) So I hypothesize that packet loss is the relevant trigger, and perhaps this problem shows up when one of google's servers gets messed up. Let me check what it looks like from the MV network... 0% packet loss, 24ms pings. And yes, it's still bad from my local network: 70% loss, 33-50ms rtt. Any idea why I keep getting that server when I restore my session? That hostname does not appear in my sessionstore.js. Nor does the IP (74.125.127.84). But every time I restart the browser, it ends up going to that exact server. I hope you don't need to be coming from my public IP to reproduce this...

Kai Engert (:KaiE:)

Comment 47

•

13 years ago

Comment on attachment 591698 [details] [diff] [review] Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing] I understand this is for early testing of bug 542832, which has not yet a final review of this patch. r=kaie for the early testing in mozilla-central, only

Attachment #591698 - Flags: review?(kaie) → review+

Kai Engert (:KaiE:)

Updated

•

13 years ago

Attachment #591698 - Attachment description: Import ssl_Poll fix from bug 542832 → Import ssl_Poll fix from bug 542832 [NSS patch - mozilla-central testing]

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 48

•

13 years ago

(In reply to Steve Fink [:sfink] from comment #46) > I ran with the patch this morning. I'm still getting the problem. It's very > slightly different -- previously, it was always mainpid+4 that was the > thread that spun on recvfrom(); now it is mainpid+7. Which could be a > complete red herring. > > It's going against the same server as yesterday. Just that thread has > accumulated 55 minutes of runtime now, so I don't think it's going to > resolve itself anytime soon. Sorry, you need to apply the WIP patch above AND the ssl_Poll patch above. I am going to clean the WIP patch up and check both of them into mozilla-central today.

Dietrich Ayala (:dietrich)

Comment 49

•

13 years ago

A bunch of us are seeing this bug (or at least similar characteristics) at the performance work week - possibly the hotel's dodgy connections are highlighting the problem?

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 50

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/fc9db5c75379 https://hg.mozilla.org/integration/mozilla-inbound/rev/b26ac73e4554

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 51

•

13 years ago

Attached patch Part 1: Do not do TLS intolerance timeout during cert validation [v3] — Details — Splinter Review

[Approval Request Comment] Regression caused by (bug #): bug 674147 User impact if declined: User will experience infrequent hangs of the browser when connecting to SSL sites Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): See Honza's comment above. This seems like the lowest-risk change that we can make to solve the problem. String changes made by this patch: none

Attachment #591018 - Attachment is obsolete: true

Attachment #593043 - Flags: review+

Attachment #593043 - Flags: approval-mozilla-beta?

Attachment #593043 - Flags: approval-mozilla-aurora?

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 52

•

13 years ago

Attached patch Part 2: Import ssl_Poll fix from bug 542832 [v3] — Details — Splinter Review

Kai, you said that this should only land on mozilla-central but mozilla-aurora and mozilla-beta also REALLY need this fix. We should land this on Aurora and Beta ASAP. [Approval Request Comment] Regression caused by (bug #): bug 674147 User impact if declined: Users will sometimes experience long hangs when connecting to SSL sites. Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): There is a moderate amount of risk to this patch, but the alternative is to back out bug 674147 and everything that depends on it, including SPDY, which would be even more risky as some of the SPDY changes also affect non-SPDY HTTP code. String changes made by this patch: none

Attachment #591698 - Attachment is obsolete: true

Attachment #593044 - Flags: review+

Attachment #593044 - Flags: approval-mozilla-beta?

Attachment #593044 - Flags: approval-mozilla-aurora?

Ed Morley [:emorley]

Comment 53

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/fc9db5c75379 https://hg.mozilla.org/mozilla-central/rev/b26ac73e4554

Status: NEW → RESOLVED

Closed: 13 years ago

status-firefox12: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla12

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 54

•

13 years ago

Comment on attachment 590552 [details] screenshot Got this again today, Gecko/20120201 Firefox/13.0a1

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Updated

•

13 years ago

Priority: -- → P1

Whiteboard: [inbound][psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy]

Alex Keybl [:akeybl]

Comment 55

•

13 years ago

Comment on attachment 593043 [details] [diff] [review] Part 1: Do not do TLS intolerance timeout during cert validation [v3] [Triage Comment] This has now had a couple of days to bake on m-c. Should prevent browser hangs with SSL. Given where we are in the cycle, approving for both Aurora 12 and Beta 11.

Attachment #593043 - Flags: approval-mozilla-beta?

Attachment #593043 - Flags: approval-mozilla-beta+

Attachment #593043 - Flags: approval-mozilla-aurora?

Attachment #593043 - Flags: approval-mozilla-aurora+

Alex Keybl [:akeybl]

Updated

•

13 years ago

Attachment #593044 - Flags: approval-mozilla-beta?

Attachment #593044 - Flags: approval-mozilla-beta+

Attachment #593044 - Flags: approval-mozilla-aurora?

Attachment #593044 - Flags: approval-mozilla-aurora+

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 56

•

13 years ago

This is really horrible with Fosdem network. Has already happened several times this morning.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 57

•

13 years ago

Please try running Firefox with logging enabled and send me the log after it happens again: NSPR_LOG_MODULES=pipnss:5,nsSocketTransport:5,timestamp NSPR_LOG_FILE=log.txt Also, do you have SPDY enabled? If so, after reproducing it once, please disable the SPDY support (network.http.spdy.enabled=false). I think this may be a difference in behavior between Linux and Windows, because I cannot reproduce this on Windows 7 x64 at all after my fix landed. I will try on Linux now.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 58

•

13 years ago

Attached patch Actually abort the connection on send/receive if handshake timed out — Details — Splinter Review

Olli, do you think you have times where you are waiting 25 seconds or more to connect to a website in these situations? If so, then it might be caused by yet another problem with the handshake timeout that I think I just found. Please try this patch and report back. If this patch does NOT solve your problem, then try the following: 1. Disable TLS 1.0 (security.enable_tls=true) and enable SSL 3.0 (security.enable_ssl3=true). 2. Disable OCSP (security.OCSP.enabled=0). And report back the results. (But, try the above patch first). ON timeout, the PSM poll method returns "go ahead, try to send or receive" so that the send/receive methods can return the timeout error. Except, the send/receive methods never actually check for timeout like the old SSL-thread-based ones did. Oops! Still, I am having trouble reproducing the busy loop even without this patch, so I am not sure it is the complete solution.

Attachment #594508 - Flags: feedback?(honzab.moz)

Attachment #594508 - Flags: feedback?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 59

•

13 years ago

(In reply to Brian Smith (:bsmith) from comment #58) > Created attachment 594508 [details] [diff] [review] > Actually abort the connection on send/receive if handshake timed out > > Olli, do you think you have times where you are waiting 25 seconds or more > to connect to a website in these situations? Very possible. I've seen to problem usually when I'm using a bad 3G connection or here @Fosdem where network connection is occasionally slow. I'll test the patch ASAP.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 60

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out I've been running a build with this patch in a bad network, and so far no problems. Yesterday I got the hang several times using a build without the patch.

Attachment #594508 - Flags: feedback?(bugs) → feedback+

Steve Fink [:sfink] [:s:]

Comment 61

•

13 years ago

I'm also still seeing this with the 2012-02-03 nightly, from my home DSL network that has some questionable latency to certain sites. I'm recompiling with the patch now, though I don't know if I'll have time to try it out before I go into the work network tomorrow. Same as smaug, the symptom is a complete hang of any networking, but the UI is responsive. I left it going for about 2 full days and it never made progress on the network connections; it was still spinning on recvfrom() when I killed it. I'm not sure if this is related, but I'm also getting a permanent hang on shutdown. And unfortunately, the last time I killed it, it mangled my sessionstore so I can no longer restore that session. Which may make it harder to reproduce, among other things.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 62

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Kai, the most important part of this patch is the added line: socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage); That fixes *this* bug, but in a way that still causes TLS intolerance due to handshake timeout to be handled incorrectly. The rest of the patch is about getting TLS intolerance to work correctly in the timeout scenerio.

Attachment #594508 - Flags: review?(kaie)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 63

•

13 years ago

Here is the tryserver run, still in progress: https://tbpl.mozilla.org/?tree=Try&rev=ca57a65c69e6

Steve Fink [:sfink] [:s:]

Comment 64

•

13 years ago

My browser seems to be much happier with that patch (I managed to restore my session with mv _closedWindows/* windows/). There's another thread spinning on something else, but that appears to be intentional and not network related.

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 65

•

13 years ago

I haven't got any networking hangs after I started to use the latest patch.

Honza Bambas (:mayhemer)

Comment 66

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Review of attachment 594508 [details] [diff] [review]: ----------------------------------------------------------------- Good, but I would be interested in what is the code path that causes the loop w/o this patch, for reference. Brian, can you please comment on the bug with it? Thanks. ::: security/manager/ssl/src/nsNSSIOLayer.cpp @@ +2121,4 @@ > NS_ASSERTION(in_flags & PR_POLL_EXCEPT, > "caller did not poll for EXCEPT (handshake timeout)"); > *out_flags = in_flags | PR_POLL_EXCEPT; > + socketInfo->SetCanceled(PR_CONNECT_RESET_ERROR, PlainErrorMessage); In genenral, RESET_ERROR can only be used when we haven't sent any application data. If this happens during rehandshake we may duplicate send of a POST, for instance. PR_CONNECT_ABORTED_ERROR should then be used. Doesn't necessarily need to be handled in this bug.

Attachment #594508 - Flags: feedback?(honzab.moz) → feedback+

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 67

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out [Approval Request Comment] Regression caused by (bug #): ? User impact if declined: FF networking hangs Testing completed (on m-c, etc.): Risk to taking this patch (and alternatives if risky): String changes made by this patch:

Attachment #594508 - Flags: approval-mozilla-beta?

Attachment #594508 - Flags: approval-mozilla-aurora?

Kai Engert (:KaiE:)

Comment 68

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out Thanks everyone for your work and testing on this bug. Regarding Honza's warning related to RESET_ERROR, is it correct that we only ever allow a handshake-timeout at the very beginning of a new socket? In other words, we never run through the handshake-timeout code whenever a socket might perform a SSL renegotiation (I believe that would be handled transparently inside libSSL at the NSS level). Under this assumption, if we're still at the initial handshake, then we haven't yet transfered any application data, and setting RESET_ERROR should be fine. Your calling of checkHandshake seems reasonable from what I remember. Given that you have thorougly tested this patch and have gotten positive feedback for a very bad regression, I'm fine with this patch. r=kaie This is also a reminder that we urgently need to get bug 542832 sorted and get the newer NSS landed into mozilla-beta.

Attachment #594508 - Flags: review?(kaie) → review+

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 69

•

13 years ago

Any chance to get this landed? I just accidentally run a build without the fix and got the hang immediately when I switched to a bad network.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 70

•

13 years ago

(In reply to Honza Bambas (:mayhemer) from comment #66) > Good, but I would be interested in what is the code path that causes the > loop w/o this patch, for reference. Brian, can you please comment on the > bug with it? Thanks. nsSSLIOLayerPoll hits the HandshakeTimeout condition, and tells the application the socket is readable/writable. The application tries to PR_Write, but the PR_Write ends up doing a read on the network socket, because we are waiting for one of the server's hello messages. But, we don't have that message yet, so that read returns "would block" and so PR_Write returns "would block." Then, the application calls poll again, which again hits the HandshakeTimeout condition, and it all repeats. > In genenral, RESET_ERROR can only be used when we haven't sent any > application data. To expand on what Kai said, we should never be timing out after we've sent application data; there is no positive gain in doing TLS intolerance timeouts during a renegotiation.

Marco Bonardo [:mak]

Comment 71

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/fbf93932695e can this be closed?

Alex Keybl [:akeybl]

Comment 72

•

13 years ago

Comment on attachment 594508 [details] [diff] [review] Actually abort the connection on send/receive if handshake timed out [Triage Comment] Prevents hangs caused by recent changes in SSL handling. Approved for Aurora 12 and Beta 11.

Attachment #594508 - Flags: approval-mozilla-beta?

Attachment #594508 - Flags: approval-mozilla-beta+

Attachment #594508 - Flags: approval-mozilla-aurora?

Attachment #594508 - Flags: approval-mozilla-aurora+

(dormant account)

Updated

•

13 years ago

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1]

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 73

•

13 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/25a023e0faf7 https://hg.mozilla.org/releases/mozilla-beta/rev/871a1595b9c2 https://hg.mozilla.org/releases/mozilla-beta/rev/592ef428da70 https://hg.mozilla.org/releases/mozilla-beta/rev/8dfa3f56e4b4 (Note: mozilla-aurora only needed Part 3, because Part 1 and Part 2 landed on mozilla-central before Aurora branched.)

status-firefox11: --- → fixed

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Assignee

Comment 74

•

13 years ago

(In reply to Marco Bonardo [:mak] from comment #71) > https://hg.mozilla.org/mozilla-central/rev/fbf93932695e > > can this be closed? Yes.

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

u279076

Comment 75

•

13 years ago

What is the STR for this bug so QA can verify it?

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?]

Honza Bambas (:mayhemer)

Comment 76

•

13 years ago

(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #75) > What is the STR for this bug so QA can verify it? I think it has been verified in comment 65.

u279076

Comment 77

•

13 years ago

Hmm...I'm not sure that makes this confirmed verified on Firefox 11 and 12. I'm a little unclear what, if anything QA can do to verify this is fixed across all patched versions.

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa?] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-]

Phil Ringnalda (:philor)

Comment 78

•

13 years ago

It probably hasn't been verified by anyone on m-a or m-b, though. STR: 1. Grab a MiFi (or tethering to a phone would probably work). 2. Head out into the country, you want to get down to just a couple of bars of reception, one would be better. Take your pre-patch and post-patch builds along, you don't want to be downloading them over that connection. 3. Load tbpl.mozilla.org. Wouldn't hurt to open several tabs of tbpl to different busy trees, say mozilla-inbound and try and profiling and maple. 4. Watch the CPU, or just listen for the fan, since you're out in the country and it's quiet so you'll hear it spin up. Once it does, verify that no networking is working, that you can open a tab and try to load some unrelated site, but it won't actually load until you kill the tbpl tabs. Then comes the fun part, doing it again with the post-patch builds, and just sitting around watching the clouds go by and listening to the birds, while waiting to not hear the fan spin up. You're welcome :)

u279076

Comment 79

•

13 years ago

Given those steps, QA will be hard-pressed to verify this fix. I think it would be far more reasonable if Olly could verify this on Firefox 11 and 12 as was done for Firefox 13, presumably, in comment 65.

Honza Bambas (:mayhemer)

Updated

•

13 years ago

Whiteboard: [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-] → [psm-roadblock][patch in attachment 589264 in bug 542832][Snappy:P1][qa-][http-conn]

You need to log in before you can comment on or make changes to this bug.