Closed Bug 754356 Opened 8 years ago Closed 6 years ago

Remove TLS intolerance timeout logic

Categories

(Core :: Security: PSM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla27
Tracking Status
firefox24 --- wontfix
firefox25 --- wontfix
firefox26 - fixed
firefox27 --- fixed
firefox-esr17 --- wontfix
firefox-esr24 --- wontfix
b2g18 --- wontfix
b2g-v1.1hd --- wontfix
b2g-v1.2 --- fixed

People

(Reporter: briansmith, Assigned: briansmith)

References

Details

Attachments

(1 file, 1 obsolete file)

The TLS intolerance timeout is needed for servers that do not respond AT ALL to our client hello. I do not know how common such servers are, or whether the timeout logic is still necessary. That should be determined first.

If it is still necessary, then there is the following problem: We should avoid doing the timeout as soon as we've seen the ServerHello or perhaps any part of it. But, libssl only tells us when we've received the Certificate or Finished messages. It would be better to either have libssl tell us when it has received (some part of the) ServerHello, or for us to detect it in some other way (e.g. we could push an IO layer underneath libssl's IO layer, and keep track of whether we've received any data in that layer, if we don't need to know that we've received a complete ServerHello.)
Honza, I think we should add some probes that measure the effects of this patch on compatibility before we check this in.
Assignee: nobody → bsmith
Attachment #673306 - Flags: review?(honzab.moz)
(In reply to Brian Smith (:bsmith) from comment #1)
> Honza, I think we should add some probes that measure the effects of this
> patch on compatibility before we check this in.

Definitely we have to think before we remove this failsafe code.  I don't think it is going to be that simpler to just drop this code, but probes will do.

Based on that we first need to do some research, I think it is to early do reviews here, right?

Brian, when do you plan to write a telemetry patch?
Comment on attachment 673306 [details] [diff] [review]
Remove timeout logic for TLS intolerance because of too many false positives

Dropping r based on comment 2.

Brian, if you can have a telemetry patch first, then let's get some data before we remove this.

If not, then I'll review the patch - just rerequest.
Attachment #673306 - Flags: review?(honzab.moz)
Summary: Improve or remove TLS intolerance timeout logic → Remove TLS intolerance timeout logic
Target Milestone: --- → mozilla23
Honza, I verified that neither Chrome nor IE are doing this kind of fallback based on timeout, so there should be minimal negative compatibility impact by making this change. And, there should be a positive compatibility impact because we won't ever fall back from TLS 1.0 to SSL 3.0 for servers that properly support TLS 1.0 but which may not support SSL 3.0. Also, various new TLS extensions to improve security require TLS 1.0 or later, as do extensions like the NPN extension that lets us negotiate SPDY, so it is important that we negotiate TLS 1.0 as often as possible.
Attachment #673306 - Attachment is obsolete: true
Attachment #747752 - Flags: review?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #3)
> Brian, if you can have a telemetry patch first, then let's get some data
> before we remove this.

Unfortunately, there is not a good way to do telemetry, because in order for the telemetry to be useful, we have to know whether we *correctly* or *incorrectly* did the timeout-based fallback. However, we can't calculate whether we are doing the correct thing or not in the code.
Comment on attachment 747752 [details] [diff] [review]
Remove timeout-based TLS intolerance fallback logic

Review of attachment 747752 [details] [diff] [review]:
-----------------------------------------------------------------

No way to test this...

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +1156,5 @@
>        PR_SetError(PR_INVALID_STATE_ERROR, 0);
>        return SECFailure;
>    }
>        
> +  socketInfo->SetFullHandshake();

maybe fix whitespaces when here

any particular reason to put this line here and not leave at the old place?

On a failure we call badcert callback.  That could potentially bypass the error (when reimplemented by an extension or from what ever reason).  The flag then may be set wrong.  It should be set always when we get authcert callback, IMO.

::: security/manager/ssl/src/nsNSSIOLayer.h
@@ +56,4 @@
>    void SetHandshakeCompleted(bool aResumedSession);
>  
> +  // Note that this is only valid *during* a handshake; at the end of the handshake,
> +  // it gets reset back to false.

Nice comment, but what this attribute means?  When is it set and to what values?  It's not completely clear just from the name of the member/s/getter.
Attachment #747752 - Flags: review?(honzab.moz) → review+
Comment on attachment 747752 [details] [diff] [review]
Remove timeout-based TLS intolerance fallback logic

Review of attachment 747752 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the review.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +1156,5 @@
>        PR_SetError(PR_INVALID_STATE_ERROR, 0);
>        return SECFailure;
>    }
>        
> +  socketInfo->SetFullHandshake();

> Any particular reason to put this line here and not leave at the old place

The only reason I moved this was to avoid the redundant check for "if (socketInfo)".

::: security/manager/ssl/src/nsNSSIOLayer.h
@@ +56,4 @@
>    void SetHandshakeCompleted(bool aResumedSession);
>  
> +  // Note that this is only valid *during* a handshake; at the end of the handshake,
> +  // it gets reset back to false.

This term "full handshake" is from the TLS specification (e.g. rfc5246). It is the complement of "resumed handshake" and this member is used to indicate whether we resumed a session or started a new one.
The telemetry I added for this shows Firefox 21 using SSL3 3.2% of the time based on saved-session telemetry. I don't have a breakdown of how much of that is due to the timeout. I presume a lot.

Interestingly 22 = 2.8%, 23 and 24 show just ~0.5%. 23 and 24 also have 30x fewer samples and very different user base. Other than statistical skew (which seems plausible) does anyone have any suggestion on why those channels perform better?

If you use the idle-ping telemetry (which is thinner) the numbers are a bit worse across the board.
Assignee: brian → nobody
Depends on: 839310
Depends on: 777354
No longer depends on: 883739
The failure on test_update.js is still there. The tests seem to run fine, but the attempt to stop the httpd.js HTTP server doesn't result in the server actually getting stopped, and so do_test_finished does not get called.

I suspect that bug 777354 is the culprit. I will comment in bug 777354 too.
I'm not sure I understand why you think this should be removed.  There are known servers that stop talking after sending a too big ClientHello.  TLS 1.2 has more ciphers and is known to have this as effect.  The only way to deal with them is using a timeout.
(In reply to Kurt Roeckx from comment #11)
> I'm not sure I understand why you think this should be removed.  There are
> known servers that stop talking after sending a too big ClientHello.  TLS
> 1.2 has more ciphers and is known to have this as effect.  The only way to
> deal with them is using a timeout.

We have found, even on our own properties (*.mozilla.org) that the timeout logic causes too many false positives. So, I'd like to find alternatives to the timeout logic if at all possible. I believe we can solve the problem by limiting the size of our client hello, at least for now. I have some ideas for how to do this.

I will share a proposal on dev-tech-crypto about what cipher suites we will support going forward; it will be a net reduction in cipher suites supported compared to what we support now. Please subscribe to that mailing list if you are interested in participating in that conversation.
https://hg.mozilla.org/mozilla-central/rev/bb4b199648f3
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
What version will this be in? Did it make 27?
(In reply to Florian Bender from comment #15)
> What version will this be in? Did it make 27?

It is currently in 27 and I am hoping to uplift it to 26 after it has been on m-c for a few more days.
See Also: → 901718
Target Milestone: mozilla23 → mozilla27
Note that 27 is branching tomorrow, at which point 26 will become Beta (with corresponding stricter uplift criteria).
We wouldn't block shipping the release on this, as we've left it in the last two (at least), but please do nominate for uplift consideration and we can base uplift on the risk/reward assessment.
Brian: Are you going to ever get around to nominating this for Firefox 26, or are you going to keep sitting around doing nothing all day?
Flags: needinfo?(brian)
Comment on attachment 747752 [details] [diff] [review]
Remove timeout-based TLS intolerance fallback logic

[Approval Request Comment]
Bug caused by (feature/regressing bug #): none.

User impact if declined: We will continue to fall back to SSL 3.0 from more secure versions of TLS unnecessarily about ~1-2% of the time. At least in theory this makes SSL less secure for these users when it happens because restricts us to the lower end of the cipher suites we can use when the error condition occurs. It also means that we risk compatibility with servers that require support for the Server Name Indication (SNI) extension, which hurts compatibility. It also means we can't negotiate SPDY when this happens, which hurts performance in these cases.

Testing completed (on m-c, etc.): 
Been on mozilla-central for a while without any bug reports (AFAICT) and also I tested this manually. (I could have written a test for this but it would be testing that we don't implement a removed feature, which seems silly. Also, the test would take at least 30 seconds to run, which would slow down the test suite unnecessarily.)

Risk to taking this patch (and alternatives if risky): The main risk is that there may still be servers that require the timeout logic. However, we are the only browser that implements the timeout logic now (others have long dropped it), so it is likely this is buying us approximately 0% compatibility and ~1-2% pain. The risk that there is a bug in this patch is very low since it is primarily deleting code.

String or IDL/UUID changes made by this patch: none.
Attachment #747752 - Flags: approval-mozilla-beta?
Attachment #747752 - Flags: approval-mozilla-aurora?
Clearing the NEEDINFO I set on myself.
Flags: needinfo?(brian)
Attachment #747752 - Flags: approval-mozilla-beta?
Attachment #747752 - Flags: approval-mozilla-beta+
Attachment #747752 - Flags: approval-mozilla-aurora?
Attachment #747752 - Flags: approval-mozilla-aurora+
Ryan, this patch requires the patch for bug 777354 to be uplifted too. The patch for bug 777354 is test-only so I don't think it requires approval. Can you land it and then reland this? If not, I can do so. Thanks.
Flags: needinfo?(brian) → needinfo?(ryanvm)
Sure thing!
No longer depends on: 936818
Depends on: 937459
FYI, I read in another bug that this was added for Novell iChain, but then IE7 refused to add this fallback and Novell quickly issued an update.
You need to log in before you can comment on or make changes to this bug.