Closed Bug 340359 Opened 15 years ago Closed 14 years ago

SSL Server stalls on v3 hello using TLS hello extensions

Categories

(Core :: Security: PSM, defect)

1.8 Branch
x86
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.8.1

People

(Reporter: sgautherie, Assigned: KaiE)

References

()

Details

(Keywords: fixed1.8.1, regression, Whiteboard: [at risk][drivers: see comment 38])

Attachments

(2 files)

MOZILLA_1_8_BRANCH has a regression, which adds a long delay before the page starts to load.

(I'll file a separate bug about the certificate selection...)

[Microsoft Internet Explorer, version 6.0.2800.1106 (128b, SP1 + Q889669)] (W98SE)

Immediate:
Authentication dialog (to select a certificate).

Then <https://cfspart.impots.gouv.fr/portal/ICSLogin/?"https://cfspart.impots.gouv.fr/portal/dgi/public/perso?pageId=pna2par&sfid=30">.

This seems 100% fine.


[Netscape® Communicator 4.8 : en-20020722] (W98SE)

Immediate:
'No User Certificate' dialog.

Then <https://cfspart.impots.gouv.fr/portal/ICSLogin/?"https://cfspart.impots.gouv.fr/portal/dgi/public/perso?pageId=pna2par&sfid=30">.

This seems 100% fine.


[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7.13) Gecko/20060414] (release) (W98SE)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.0.4) Gecko/20060516 SeaMonkey/1.0.2] (release) (W98SE)

Immediate,
but the most recent of my 3 certificates is automatically selected.

This is 50% fine: it seems that I should be asked to select one of my certificates.


[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060603 SeaMonkey/1.1a] (nightly) (W98SE)

Long delay (like 2 minutes)...

Then the most recent of my 3 certificates is automatically selected.

This is 0% fine: There should be no delay, and it seems that I should be asked to select one of my certificates.


[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.0.4) Gecko/20060604 Firefox/1.5.0.4] (nightly) (W98SE)

Immediate...

Then <https://cfspart.impots.gouv.fr/portal/ICSLogin/?"https://cfspart.impots.gouv.fr/portal/dgi/public/perso?pageId=pna2par&sfid=30">.

This seems 100% fine.


[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060604 BonEcho/2.0a3] (nightly) (W98SE)

Long delay (like 2 minutes)...

Then <https://cfspart.impots.gouv.fr/portal/ICSLogin/?"https://cfspart.impots.gouv.fr/portal/dgi/public/perso?pageId=pna2par&sfid=30">.

This is 50% fine: There should be no delay.
(In reply to comment #0)
> (I'll file a separate bug about the certificate selection...)

I filed bug 340360.
Version: Trunk → 1.8 Branch
The delay seems to be "exactly" 2 minutes.

I have now tested with the 'Ask Every Time' option, and the delay happens before the certificate selection dialog appears.
Summary: Long delay before HTTPS/SSL page start to load, on a site which uses certificate authentication → Long (2 mn) delay before HTTPS/SSL page start to load, on a site which uses certificate authentication
When did this regress? Could there be some relationship to bug 336944 ?
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8) Gecko/20060502 SeaMonkey/1.1a] (nightly) (W98SE)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a2) Gecko/20060510 SeaMonkey/1.1a] (nightly) (W98SE)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a2) Gecko/20060514 SeaMonkey/1.1a] (nightly) (W98SE)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a2) Gecko/20060515 SeaMonkey/1.1a] (nightly) (W98SE)

Regressed between '2006-05-15-00' and '2006-05-16-14'.
"Security"-only list:
<http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=mozilla%2Fsecurity&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2006-05-15+00&maxdate=2006-05-16+15&cvsroot=%2Fcvsroot>
Full list:
<http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2006-05-15+00&maxdate=2006-05-16+15&cvsroot=%2Fcvsroot>

[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a2) Gecko/20060516 SeaMonkey/1.1a] (nightly) (W98SE)
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a2) Gecko/20060518 SeaMonkey/1.1a] (nightly) (W98SE)

*****

SeaMonkey CPU usage stays at 2-5%;
OCSP and Proxy settings make no difference.
It seems this problem can be induced by rapidly attempting to open multiple bug pages or multiple queries, like from a buglist, opening two or more in new tabs quickly with a middle click each.

I can only guess the following may be related: On https://launchpad.net/ bug tracker each time I try to do a page load or submit, on the first try I get a modal error window, and have to retry once in order to succeed. Next time it happens I'll capture the exact error message. The last URL I updated there was: https://launchpad.net/distros/ubuntu/+bug/48310

Currently using: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.8.1a3) Gecko/20060609 SeaMonkey/1.1a
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060630 SeaMonkey/1.1a] (nightly) (W98SE) [2006-06-30-05-mozilla1.8]

(Bug still present.)
OS: Windows 98 → All
this should be the same as bug 336944, fixed yesterday on 1.8 branch.

could you please try 2006-07-01-01-mozilla1.8, or preferably a later build (to ensure the build really has the fix)



*** This bug has been marked as a duplicate of 336944 ***
Status: NEW → RESOLVED
Closed: 15 years ago
Component: Networking → Security: PSM
Resolution: --- → DUPLICATE
Yes, this is why I posted comment 6 (using the nightly before the fix).

[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060701 SeaMonkey/1.1a] (nightly) (W98SE) [2006-07-01-01-mozilla1.8]

The fix should be in the new nightly, since the build before it.
<http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla1.8-SeaMonkey&hours=24&maxdate=1151752828&legend=0>
<http://tinderbox.mozilla.org/bonsai/cvsquery.cgi?module=MozillaTinderboxAll&branch=MOZILLA_1_8_BRANCH&date=explicit&mindate=1151721780&maxdate=1151731859>

Yet, this bug is still there.
After reading comment 3, I thought the current bug might be different as I was not seeing the 100% CPU usage.

I'll try again with tomorrow's nightly...

PS: If I cancel the dialog, a new attempt to load the page seems to immediately remember that and does not ask again.
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060702 SeaMonkey/1.1a] (nightly) (W98SE) [2006-07-02-02-mozilla1.8]

I confirm my comment 8: this bug is still present.

Reopening !
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1a3) Gecko/20060703 SeaMonkey/1.1a] (nightly) (W98SE)

No better (with unrelated Bug 331977 and Bug 343230 fixes).
Flags: blocking1.8.1?
Flags: blocking1.8.1? → blocking1.8.1+
Target Milestone: --- → mozilla1.8.1beta1
-> kai
Assignee: nobody → kengert
Status: REOPENED → NEW
Target Milestone: mozilla1.8.1beta1 → mozilla1.8.1beta2
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b1) Gecko/20060728 SeaMonkey/1.1a] (nightly) (W98SE)

(Bug still there.)
I see this each time I try to open https://bugzilla.novell.com/show_bug.cgi?id=194715 on OS/2 and Linux. 2 minutes is the delay before connecting changes to connected on the statusbar. 2 more minutes pass before the statusbar shows done.
(In reply to comment #13)
> I see this each time I try to open
> https://bugzilla.novell.com/show_bug.cgi?id=194715 on OS/2 and Linux. 2 minutes
> is the delay before connecting changes to connected on the statusbar. 2 more
> minutes pass before the statusbar shows done.
> 

I see exactly the same thing on OS X 10.4 with 1.8.1 and trunk. After both delays, an nsHostResolver thread spins up.
Kai - do you have any cycles to look into this?
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b1) Gecko/20060815 SeaMonkey/1.1a] (nightly) (W98SE)

(Bug still there.)
Whiteboard: [at risk]
I will look at this today or tomorrow.
A quick reproduction of comment #0 with a sniffer shows that we send a packet of TLS1.0 ClientHello data immediately after connection, it gets ACK'd by the server, and then nothing happens for ~2 minutes -- at which point the server closes the connection with a FIN. We then reconnect, send a SSL3.0 ClientHello, and get the error page.

Do victims of bug 307271 behave this way?
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b1) Gecko/20060816 SeaMonkey/1.1a] (nightly) (W98SE)

The issue seems to be with TLSv1 rather than with SSLv2:
[
security.enable_ssl2 only:
Immediate, "SeaMonkey can't connect securely to {site} because the SSL protocol has been disabled."

security.enable_ssl3 only:
Connects in 1-2 seconds.

security.enable_tls only:
2 mn delay, "SeaMonkey can't connect securely to {site} because the SSL protocol has been disabled."
]
[Microsoft Internet Explorer, version 6.0.2800.1106 (128b, SP1 + Q889669)] (W98SE)

SSLv2 only: immediate, error = "can't connect at all". [expected]
SSLv3 only, or SSLv3+TLSv1: immediate, connects. [I guess MsIE tries SSLv3 first ?]
TLSv1: 5mn delay, error = "can't connect at all".

From my user point of view:
*It seems that this site doesn't support TLSv1 :-(
*But NSS seems to have regressed at handling this situation :-((
Extract from
<http://toolbar.netcraft.com/site_report?url=https://cfspart.impots.gouv.fr>
[
Cipher	DES-CBC3-SHA
SSL version	SSLv3
Public key length	2048
Server	unknown
Signature algorithm	md5WithRSAEncryption
]
Kai, we are holding the beta for blocking bugs. Can you let us know if:

 a) you have an idea of how to fix this and when such a fix would be ready?
 b) the fix you have in mind would be safe for a final Release Candidate as opposed to a Beta (ie: zero-risk of regression)
I am now able to reproduce this bug.
I appears when SSL v2 is not enabled (our new default) and TLS v1 is enabled (our default).

Our SSL engine tries the newest protocols first, which is TLS v1.
I confirm, when TLS v1 is enabled, we stall for a long a time, on both this URLs:
  https://bugzilla.novell.com
  https://cfspart.impots.gouv.fr

When TLS v1 is disabled, with only SSL v3 left enabled, it works fine.

I confirm, using other browsers, e.g. konqueror on Linux, it works fine with SSL v2 disabled, v3 + TLS enabled.

I will use the ssltap tool next to see what is going on at the SSL protocol level.
Here is what I see, which confirms what Justin said in comment 18.

The Mozilla code starts by sending a SSL handshake message, which contains a ClientHelloV3 that lists the ciphers supported by the client. In addition, the ClientHelloV3 includes "extensions".

When comparing this to the connection of a working client to the server, I see, konqueror also sends a ClientHelloV3, however, it does not include any extensions.

I suspect the servers listed in the previous comment get confused?

Nelson, is there anything we can do at the SSL handshake level to avoid that the server stalls?

(The NSS snapshot currently used by Mozilla trunk and 1.8 branch is NSS_3_11_20060731_TAG)
Summary: Long (2 mn) delay before HTTPS/SSL page start to load, on a site which uses certificate authentication → SSL Server stalls on v3 hello using TLS extensions
Attached file ssltap output
Behold: a new form of TLS intolerance.  These are seriously broken servers.

This is precisely why I ensured that NSS would not send TLS hello extensions 
in SSL3 client hellos (even though these extensions are also explicitly 
permitted by the SSL3 spec).  I knew that our retry logic, which works around 
TLS inolerant servers, would (eventually) succeed with these servers, despite their TLS intolerance.

Shortening the timeout on the handshake is the only thing I can suggest.
Summary: SSL Server stalls on v3 hello using TLS extensions → SSL Server stalls on v3 hello using TLS hello extensions
(In reply to comment #26)
> Behold: a new form of TLS intolerance.  These are seriously broken servers.

But I guess we need to find a workaround on the client side.


> I knew that our retry logic, which works around 
> TLS inolerant servers, would (eventually) succeed with these servers, despite
> their TLS intolerance.

Comment 13 suggests the timeout occurs multiple times.
However, I just tested myself, and as soon as I hit the first 2 minutes timeout, we automatically assume a TLS intolerant server, fall back to SSL v3, and the connection succeeds.


> Shortening the timeout on the handshake is the only thing I can suggest.

This sounds tricky to do.
I believe the timeout is specified within the netwerk code.

Darin, would the following make sense?
- whenever Necko tries to start a SSL connection, it specifies a small timeout value for the initial timeout.
- as soon as PSM detects the handshake is moving on, PSM calls a function to set the timeout back to the original value.
(In reply to comment #27)
> Comment 13 suggests the timeout occurs multiple times.
> However, I just tested myself, and as soon as I hit the first 2 minutes
> timeout, we automatically assume a TLS intolerant server, fall back to SSL v3,
> and the connection succeeds.

I believe we see the 2 minutes delay twice on the novell server, because two different hosts are involved to display the page, and both behave in the same incompatible way.
Use a short timeout to send the request.  Once the request is sent, use a 
longer timeout to wait for the response.  

Two Novell servers does not constitute a crisis for mozilla, IMO.
Oh, and when Vista comes out, the crisis will be on the non-compliant servers,
not on mozilla.
(In reply to comment #29)
> Use a short timeout to send the request.  Once the request is sent, use a 
> longer timeout to wait for the response.  

This sounds like a good idea for https, but we probably need a general purpose solution that works with all application protocols, including mail protocols and those, where the server speaks first.


(In reply to comment #30)
> Oh, and when Vista comes out, the crisis will be on the non-compliant servers,
> not on mozilla.

Could somebody with a Beta of Vista try what happens when you use IE 7 to connect to the sites listed in comment 23? Do you get a long delay, too?
FWIW, I sent a message about <https://cfspart.impots.gouv.fr>,
both on the (main) site and by direct email (addresses taken from <http://www.afnic.fr/outils/whois/impots.gouv.fr>):
I hope they'll respond...
> Darin, would the following make sense?
> - whenever Necko tries to start a SSL connection, it specifies a small timeout
> value for the initial timeout.
> - as soon as PSM detects the handshake is moving on, PSM calls a function to
> set the timeout back to the original value.

Can this be done by PSM without participation from Necko?
> Two Novell servers does not constitute a crisis for mozilla, IMO.

Perhaps Novell would even be responsive to fixing their servers since they do ship Mozilla as part of their desktop product.
> Perhaps Novell would even be responsive to fixing their servers since they do
> ship Mozilla as part of their desktop product.

Nevermind.  I'm sure they don't have admin access to the other domains ;-)
Tried IE7 in XP and both are fine.  Is the behavior different for IE7 here for vista vs xp?

(In reply to comment #31)
> (In reply to comment #29)
> > Use a short timeout to send the request.  Once the request is sent, use a 
> > longer timeout to wait for the response.  
> 
> This sounds like a good idea for https, but we probably need a general purpose
> solution that works with all application protocols, including mail protocols
> and those, where the server speaks first.
> 
> 
> (In reply to comment #30)
> > Oh, and when Vista comes out, the crisis will be on the non-compliant servers,
> > not on mozilla.
> 
> Could somebody with a Beta of Vista try what happens when you use IE 7 to
> connect to the sites listed in comment 23? Do you get a long delay, too?
> 

yes.  Vista != IE7.
Both of these sites are Novell iChain 2.3 servers. This could mean it is more practical for Novell iChain sites to upgrade their servers than for us to hurt security or experience (timeout tweaks) by working around the error.
Whiteboard: [at risk] → [at risk][drivers: see comment 38]
(In reply to comment #31)
> > Use a short timeout to send the request.  Once the request is sent, use a 
> > longer timeout to wait for the response.  
> 
> This sounds like a good idea for https, but we probably need a general purpose
> solution that works with all application protocols, including mail protocols
> and those, where the server speaks first.

I used to think that way, but now I think that browsing (which is always 
client speaks first) is fundamentally different from mail/news/ldap (where the 
server generally speaks first), in that accounts for mail/news/ldap are always 
preconfigured by the user, whereas browsing often involves contact with 
previously unknown (and unheard-of) servers.  So, we need a highly adaptive
algorithm for browsing, but can use a more static preconfigured one for mail,
news and ldap.  So I think it's OK to configure mail/news/ldap servers with 
knowledge about whether they should attempt to use TLS, or not, with their 
configured servers.  And for browsers, which (fortunately) are client-speaks-first, we can do the initial write (which does the handshake) 
with a reasonably short timeout, and do the subsequent read of the response 
with a longer one.  I think that will solve the problem for the servers cited
herein.
Reporter/Outsider's questions (from the discussion):

While this (server) bug can be painful, it can be worked with (in my case at least): the main pitfall is that 2 mn leave plenty of time for the user to simply give up.

If it could be expected that this is the "only" server type that is affected, and that it will hopefully be fixed, we may simply wait for that.

When the Tls1WithExtensions initial attempt fails, could a Tls1WithoutExtensions attempt be made (if worth it), or is that just what the Ssl3 attempt does ?
Attached patch Patch v1Splinter Review
This patch seems to work for me on Linux. It reduces the timeout down to 10-15 seconds, depending on how often Necko calls poll.

Could somebody please test this patch on Mac or Windows?
Expected behaviour is:
- without patch: two minute delay until something happens
- with patch: 10-15 seconds delay
Attachment #234441 - Flags: superreview?(darin)
Attachment #234441 - Flags: review?(rrelyea)
Description of the patch, taken from a comment inside:

+  // Additional comment added in August 2006:
+  // When we begun to use TLS hello extensions, we encountered a new class of
+  // broken server, which simply stall for a very long time.
+  // We would like to shorten the timeout, but limit this shorter timeout 
+  // to the handshake phase.
+  // When we arrive here for the first time (for a given socket),
+  // we know the connection is established, and the application code
+  // tried the first read or write. This triggers the beginning of the
+  // SSL handshake phase at the SSL FD level.
+  // We'll make a note of the current time,
+  // and use this to measure the elapsed time since handshake begin.
+
Comment on attachment 234441 [details] [diff] [review]
Patch v1

Darin, are you able to help with this review? (Bob is away)
Attachment #234441 - Flags: review?(rrelyea)
Moving target milestone to 1.8.1Final as we'd want to get this into the trunk for some good baking before taking on the 1.8 branch.
Target Milestone: mozilla1.8.1beta2 → mozilla1.8.1
Attachment #234441 - Flags: superreview?(darin) → superreview+
Note that I discussed with Bob R, he agreed to introduce a reasonable delay.

fixed on trunk
Status: NEW → RESOLVED
Closed: 15 years ago14 years ago
Resolution: --- → FIXED
Comment on attachment 234441 [details] [diff] [review]
Patch v1

This has been baking for a while. Nominating.
Attachment #234441 - Flags: approval1.8.1?
[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b2) Gecko/20060906 SeaMonkey/1.1b] (nightly) (W98SE)

(Bug still there.)
Comment on attachment 234441 [details] [diff] [review]
Patch v1

a=beltzner on behalf of 181drivers

Robert/Kai, can we get you to test this on branch (and test a bunch of other SSL/TLS sites as well?) after it lands?
Attachment #234441 - Flags: approval1.8.1? → approval1.8.1+
(In reply to comment #48)
> (From update of attachment 234441 [details] [diff] [review] [edit])
> a=beltzner on behalf of 181drivers

Thanks for approving.


> Robert/Kai, can we get you to test this on branch (and test a bunch of other
> SSL/TLS sites as well?) after it lands?

I'll do some tests with tomorrow's nightly build and post an update here.

Keywords: fixed1.8.1
Tonight's Linux build 20060908 from ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.8/ works fine for me.

I tested the 2 sites mentioned in comment 23, and after a ~10 seconds delay the site loads. I also used https://mail.google.com/mail/ and an online banking site successfuly, concurrently.

I'll do another test on Windows.
The Windows build gives me the same behaviour.
(re)Testing with <https://cfspart.impots.gouv.fr/>.

[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1b1) Gecko/20060801 SeaMonkey/1.1a] (nightly) (W98SE)

Proxy: Direct, or Manual (+ "Pass-Thru").
2 mn delay: that was this bug.

[Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.1) Gecko/20061008 SeaMonkey/1.1b] (nightly) (W98SE)

Proxy: Direct.
20 s "only" delay: (still a little long but) much better, as expected. (My computer is "slow".)

Proxy: Manual.
+/- 2 mn 30 s delay, and getting a SeaMonkey Page Load Error
[
Connection Interrupted
The document contains no data.
The network link was interrupted while negotiating a connection. Please try again.
]
My client proxy is
[Proxomitron v. Naoko 4.5 (2003-6-1), + _OpenSSL 0.9.8 05 Jul 2005_] (W98SE)
in "Pass-Thru" mode.


If I switch my proxy to OpenSSL mode, the connection is "immediate", with the two builds.
I guess the Proxy uses SSLv3 only, not TLSv1, (at least with the Web site).

***

Could something be done to "un-regress" my "Proxy + Pass-Thru" case ?

(Sorry for not checking this before.)
Other than this site and new case, I haven't noticed any trouble since this patch was checked in.
(In reply to comment #29)
 
> Two Novell servers does not constitute a crisis for mozilla, IMO.

Touché! :-)

Novell's public-facing iChain servers are now fixed. Visiting https://bugzilla.novell.com (for example) no longer triggers this timeout.
You need to log in before you can comment on or make changes to this bug.