Closed Bug 368126 Opened 18 years ago Closed 17 years ago

client abandons SSL connection during bad cert dialogs

Categories

(Core :: Security: PSM, defect)

1.8 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: nelson, Assigned: KaiE)

References

(Blocks 1 open bug, )

Details

(Keywords: verified1.8.1.4)

Attachments

(3 files)

www.openssl.org has a self-signed cert.  When you visit it, the usual
cert dialogs pop-up.  Unless you dismiss (OK) ALL those dialogs within
about 5 seconds, the browser drops the connection to the server, AND 
decides that the server is TLS-intolerant.  ssltap clearly shows that 
the client, not the server, drops the connection first.  

That's just wrong, for two reasons:

1) The connection timeout timer should not be running while the user is 
seeing bad cert dialogs and is deciding what to do about them.  

2) any error that occurs while the user is looking at bad cert dialogs
does NOT indicate TLS-intolerance.  

This issue MAY (or may not) be related to bug 365898.
Nelson, also see comment that I just added to bug 365898.

Regarding your question, whether this bug might be related to 365898, I just asked the reporters.

Besides that, I have to agree with your statement.
The timeout should not count while showing the UI prompt.
Attached patch Patch v1Splinter Review
This patch fixes both mentioned issues for me. In detail the changes are:

- Handshake timeout will be prevented while bad cert is shown.

- Handshake start time will be reset at the time the bad cert UI goes away.

- Handshake timeout increased to 25 seconds, in order to help with bad servers, as requested in bug 365898.

- Do not use a timeout when we have already disabled TLS because of detected intolerance.

Patch applies with both trunk and 1.8 branch for Firefox 2.0.0.x.
I did successful testing with the branch.
Attachment #253712 - Flags: review?(nelson)
Attachment #253712 - Flags: superreview?(rrelyea)
Blocks: 365898
Blocks: 367801
Comment on attachment 253712 [details] [diff] [review]
Patch v1

r+ with the caveat that you get agreement that 25 seconds is a reasonable timeout.

the rest of the patch is fine.
Attachment #253712 - Flags: superreview?(rrelyea) → superreview+
Given the fact that Nelson was the only one who was concerned about a timeout that long - see bug 365898 comment 6 - I'm proposing I land this patch on the trunk of Mozilla, so we can get some testing.

I will make sure we will get an agreement before landing on any stable branches.
I checked this patch in to the trunk.

I'm not marking this patch as fixed yet, because of the pending timeout discussion.
Comment on attachment 253712 [details] [diff] [review]
Patch v1

Despite my misgivings about the number 25, I'm giving r+ to this patch
because I think it solves the problem of timing out and dropping a 
connection while cert dialogs are up.  

I'm also concerned about another issue related to bad cert dialogs.
This is the second point in comment 0.

Suppose that (with this patch in place) you visit a server and get a 
bad cert dialog, and while you are contemplating that dialog, the 
server times you out and drops the TCP connection.  Then you choose
to override the cert error, and click OK (or continue, or whatever).
Then what happens?  Does the connection get reattempted immediately?
Does the user need to take some action to re-launch the request?
and (here's the point): will PSM have marked the server as TLS-intolerant
because the connection was lost while in the cert dialogs?  

If the answer is: no, PSM will not mark the server as TLS-intolerant
if the server drops the connection during cert dialogs, then all is well.
Otherwise, should I file a separate bug on that?  Or do you want to try
to fix that in a new patch for this bug, too?
Attachment #253712 - Flags: review?(nelson) → review+
(In reply to comment #6)
> Despite my misgivings about the number 25, I'm giving r+ to this patch
> because I think it solves the problem of timing out and dropping a 
> connection while cert dialogs are up.  

Thanks a lot. And maybe my comments in bug 365898 comment 26, 27, 28 can convince you that the timeout is not that bad as you might think, because we only run into in whenever the server stalls, but not on other TLS intolerance error codes that we will continue to handle immediately.


> I'm also concerned about another issue related to bad cert dialogs.
> This is the second point in comment 0.
> 
> Suppose that (with this patch in place) you visit a server and get a 
> bad cert dialog, and while you are contemplating that dialog, the 
> server times you out and drops the TCP connection.  Then you choose
> to override the cert error, and click OK (or continue, or whatever).
> Then what happens?  Does the connection get reattempted immediately?

This depends on the error code we get and also on the code above PSM's I/O layer.

PSM will detect the closed connection, and it will look at the error code returned by libssl.

If the error code is in the list of codes we treat as TLS intolerance, then yes,
- PSM will treat the server as intolerant on later connections
- PSM will set PR_CONNECT_RESET_ERROR to hint the upper layer to retry


While trying to answer your questions, reading the existing code, and talking to Wan-Teh about possible error codes on an early connection close, I probably find that our handling is incomplete.

So thanks Nelson for asking these questions.

A return value 0 from PR_Read/PR_Recv means a socket close.
Our existing code that tries to detect TLS intolerance never detects that case.

We possibly run into this scenario with mail protocols where the client reads first.

As PSM's TLS intolerance detection does not look for zero-returned-by-read, we do not signal a connect_reset, and this might explain why there is no automatic retry.

I have experienced scenarios like this myself, when the mail application seemed to be unable to fetch mail on the first connection attempt. And we might actually have an bug open about this, IIRC.

So Nelson, your question might have helped finding the cause of that other bug. Let's hope.


> Does the user need to take some action to re-launch the request?

It is the individual decision of any application code that calls into PSM's I/O functions whether it retries automatically by starting a new connection. But as we learn by my previous comment, PSM might fail to signal a retry is necessary in some scenarios.


> and (here's the point): will PSM have marked the server as TLS-intolerant
> because the connection was lost while in the cert dialogs?  

After reading the code and doing some testing: Yes, this can happen. This is because we treat PR_CONNECT_RESET / END_OF_FILE as potential TLS intolerance errors, and we might get these error after a server timeout.


> If the answer is: no, PSM will not mark the server as TLS-intolerant
> if the server drops the connection during cert dialogs, then all is well.

So unfortunately, there are cases where not all is well, and we should fix it.


> Otherwise, should I file a separate bug on that?  Or do you want to try
> to fix that in a new patch for this bug, too?

I think it's OK to continue to use this bug.
During the last hours I have worked on a new patch.
I'm happy with the code now, and I will attach it now, but it needs careful review.

Here is an incremental patch after patch v1.

In patch v1 I had introduced a boolean flag to track whether we are currently showing bad cert UI. I changed that to a three-state enum, so we know whether cert UI was shown on this socket. We'll need it for deciding later, whether the simle socket-closed errors should be treated as a TLS intolerance (as we did in the past), or whether they might be just the consequence of having shown bad cert UI and having blocked the socket too long. No TLS intolerance decision in the latter case, but a simple retry only.


The new logic needs to be added to function "checkHandshake", which is what this function does. The function gets called after each SSL read/write, it decides whether the combination of
- return value
- error code
- current phase of SSL connection
should result in a TLS intolerance assumption and all required actions around it.

As I'd like to handle the additional case "read returned zero", I've added a param to signal whether we were reading or writing, and updated the callers accordingly.

I created a new function isClosedConnectionAfterBadCertUIWasShown.

The code that reacts to a return value < 0 has the added check for the bad-cert-ui-shown scenario, where it will trigger a retry, but not TLS intolerance assumption. This is the only change in this if branch, besides some code factoring.

The new code that reacts to "reading returned zero during a handshake" can now trigger a retry, too. It will do a simple retry, no TLS intolerance, when it's one of the simple connect_reset / end_of_file errors.

But if no bad-cert-ui was shown, then possibly assume TLS intolerance and retry. We of course only retry when TLS is currently in use. And based on the existing treatment of connect_reset, we only retry if the socket does not have an initial cleartext phase (no proxy, no starttls).


This final statement leads to yet another discussion.

Our existing code includes a special handling for PR_CONNECT_RESET_ERROR on a socket that includes a cleartext phase, either when using a proxy or with STARTTLS. I did not change that code yet. I isolated it into a new function isNonSSLErrorThatWeAllowToRetry. It currently says:

- PR_CONNECT_RESET_ERROR on a socket that immediately uses SSL/TLS shall be retried and (possibly, see above) be treated as TLS intolerance

- PR_CONNECT_RESET_ERROR on a socket with initial cleartext shall not be retried, and shall not be treated as a TLS intolerance.

This is historical code that we've been using for a long time.
I wonder if that distinction makes sense.
We might want to rethink it.
(We also have the option to further distinguish the second case, distinguish proxy and STARTTLS.)

Would it be reasonable to retry in all cases where we see PR_CONNECT_RESET_ERROR?
Maybe this special handling got introduced, because the upper application protocol code gets confused by this error in the middle of a connection?
Or maybe the past thinking was, a retry would require a replay of the socket's cleartext phase, and it wasn't sure whether it is a good idea.

Should we decide that both connect_reset and end_of_file shall always result in a retry, independent of whether the socket has an initial cleartext phase, this should apply to the read-returned-zero scenario as well.
Attachment #255315 - Flags: superreview?(rrelyea)
Attachment #255315 - Flags: review?(nelson)
> The new code that reacts to "reading returned zero during a handshake" can now
> trigger a retry, too. It will do a simple retry, no TLS intolerance, when it's
> one of the simple connect_reset / end_of_file errors.

replace 
> when it's one of the simple connect_reset / end_of_file errors.

with
> if we have shown bad cert UI.
Comment on attachment 255315 [details] [diff] [review]
Incremental patch v2

I'm not sure I understand PSM well enough to review it, but I see no obvious mistakes here.
r=nelson
Attachment #255315 - Flags: review?(nelson) → review+
Comment on attachment 255315 [details] [diff] [review]
Incremental patch v2

r+ = relyea
Attachment #255315 - Flags: superreview?(rrelyea) → superreview+
I checked in the additional patch to the trunk, marking bug fixed.

Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
This is a merged patch of v1 and v2, which applies to 1.8 branch.
Attachment #257720 - Flags: superreview+
Attachment #257720 - Flags: review+
Comment on attachment 257720 [details] [diff] [review]
Patch: Merged v1+v2 for 1.8 branch

Summarizing:

This patch is not trivial.
But I believe we want to land it on the stable branch for Firefox 2.0.0.x and Thunderbird 2.0.0.x

Why?

1)
When we released Firefox 2, we had introduced a new timeout, to deal with a new class of broken, stalling servers. We learned the timeout is too short (bug 365898)

2)
In addition to the timeout being too short, the same timeout is being used on every SSL connection attempt, even after falling back to an older SSL mode. We must only use the timeout once.

3)
When we introduced the timeout, we introduced a regression, this bug.
We unnecessarily switch back to an older SSL protocol (pre-TLS) whenever we experience a cert problem.

4)
In addition, an unexpected socket close on read is not being correctly reported up to the application layer (bug 367801), which prevents the application code from detecting an automatic retry is required.


The patch fixes the above issues.
But I propose we give it thorough testing before landing on the 1.8 branch.

You can test by using a trunk build, please do if you can.
And I will test the patch in my daily work using a 1.8 branch build with this patch applied.
Attachment #257720 - Flags: approval1.8.1.4?
Flags: blocking1.8.1.4+
Comment on attachment 257720 [details] [diff] [review]
Patch: Merged v1+v2 for 1.8 branch

Just a thought, but wouldn't it be better if these timeouts were user-adjustable prefs? Makes it easier to tune.

approved for 1.8.1.4, a=dveditz for release-drivers
Attachment #257720 - Flags: approval1.8.1.4? → approval1.8.1.4+
Dan: Using a pref has been proposed, but I was concerned a patch that introduces a new pref would not get approved for the stable branch, would it?
Merged patch checked in to 1.8 branch, fixed1.8.1.4

Keywords: fixed1.8.1.4
v.fixed on 1.8 branch with 2.0.0.4 rc2 build Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.4) Gecko/20070509 Firefox/2.0.0.4, https://www.openssl.org loaded fine after dismissing cert dialogs over 10-20 seconds (didn't timeout after 5 seconds as originally reported).

Nelson:  If there is a better way to verify this bug, please let me know.  Thanks!
Jay,  Since the timeout time was raised to 25 seconds, it will be necessary 
to wait longer than 25 seconds to test this fix.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: