Frequent error message: OCSP server error

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
13 years ago
13 years ago

People

(Reporter: kaie, Assigned: kaie)

Tracking

(Blocks 1 bug, {fixed1.8.1})

1.8 Branch
x86
All
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

(Assignee)

Description

13 years ago
On Windows, when using recent Firefox from Mozilla 1.8 branch, with OCSP turned on, when using SSL, you frequently get an OCSP error message. The message reports a server certificate could not be verified and that it is a server error.

To reproduce:
- use a Windows Firefox build, e.g. 2.0 beta
- enable OCSP (tools, options, advanced, encryption, verification, middle option)
- access SSL sites that use a server cert that lists an OCSP responder

Sample SSL sites are:
  https://mail.google.com/mail/
  https://us.etrade.com/
  https://www.wellsfargo.com/
  https://www.paypal.com/

If you log in to Google mail and leave it open, you usually get the OCSP error soon. If not, try to increase your SSL activity and open additional sites in separate tabs.

The problem is:
When Firefox reports this error, it stops to load the page.


Is this a server error?
If it is a server error, why does it occur on Windows only?
The problem was never seen on Linux.


It appears unlikely that this is a client software error.

My analysis shows that Mozilla's networking library stops loading with a NS_ERROR_CONNECTION_REFUSED or NS_ERROR_NET_RESET.

I learned that multiple NSPR error codes are mapped to the above error codes. I could have a more detailed look, so we know what NSPR error codes trigger it.

But I believe it is obvious, the problem is at the network level.

I have a theory.
Maybe the OCSP server is not able / not willing to deal with the amount of requests the client sends.
Maybe it is necessary to give the server more time, or even retry.

I implemented such an automatic retry of OCSP requests in the above scenario.
When one of the two errors is seen on a OCSP http request, the code in the patch sleeps for a fraction of a second, and retries.
After a short series of testing, in 80% of the failures the first retry succeeded. In 20% of the failures the second retry succeeded.

However, once during my testing, I saw an error message -8187, SEC_ERROR_INVALID_ARGS
(Assignee)

Comment 1

13 years ago
I left Firefox running over night, with Gmail left open.
The log file shows, about 50 times OCSP requests failed (with connection refused or net reset) and were automatically retried.

Two times it was necessary to make 4 retries until it finally worked.

No error dialog was shown during.

The code currently behaves like this:
- sleep 300 ms before first retry 
- sleep 600 ms before first retry 
- sleep 900 ms before first retry 
- sleep 1200 ms before first retry 
- sleep 1500 ms before first retry 
- give up

Does this sound reasonable?
(Assignee)

Comment 2

13 years ago
Now I'm able to reproduce on Linux, too...!

- start Firefox
- open the 4 URLs from comment 0, and in addition https://kuix.de in the 5th tab
- create a bookmark for this group of tabs
- quit firefox
- start firefox
- open the group bookmark, which will open all 5 sites in parallel

When I do this, my log file shows 34 (thirty four) OCSP requests until all pages are finally loaded.
OS: Windows XP → All
(Assignee)

Comment 3

13 years ago
Posted patch logfileSplinter Review
This is a logfile of such a session, produced on Linux.

I don't understand why I saw much more requests on Windows.

On Linux I get just 12 requests.
I was curious whether NSS would produce unnecessary duplicate requests, but at least on Linux this is not the case.
Kai, I've been told that OCSP checking will be enabled by default in Vista.
It may be that ALL the OCSP responders for the big-name CAs are going to 
get pushed a LOT harder when that releases, and failures of OCSP will 
become more of a problem.  In that case, retries may help, but they might
also exacerbate the problem.  Already I've heard one suggestion that we
should fail as if the OCSP check had succeeded,  because this is equivalent 
to what will happen if people turn off OCSP to get around this problem. :(

Comment 5

13 years ago
When Vista turns OCSP on the also do the following:
  1) cache OCSP requests. 
  2) treat failure to contact the OCSP responder as non-fatal.

I don't think they retry (I think they believe that retrying from Vista will only accelerate bringing down the OCSP server).

Comment 6

13 years ago
https://bugzilla.mozilla.org/attachment.cgi?id=226876 shows I've been seeing this on 1.8.1 branch on OS/2 for more than 2 months.
(Assignee)

Comment 7

13 years ago
As of today, OCSP is not enabled by default.

Our choices are:

a) keep our current behaviour, warn on server overload, and give up

or

b) retry on failures, risking to produce too much load on servers

or

c) allow the user to continue on an OCSP server failure


In my opinion, the behaviour needs to be configurable.
The user (or administrator) should be able to choose between b) and c), and the default should probably be c)

However, I believe we must combine c) with a UI hint.

Whenever OCSP can not be reached, the UI's status bar should have some warning icon, that indicates the site's cert could not be reached.


Short term, which is Firefox 2, we can not do c)
No UI changes for Firefox 2 are possible.
The best we could do is b)


Given the fact that OCSP is not enabled in Firefox 2 by default, is it worth to do b) for Firefox 2? If you think that's reasonable, I'm willing to try getting it in.


I propose to implement b) and c) plus UI on the main development trunk of Firefox and target enabling OCSP by default for Firefox 3.
(Assignee)

Comment 8

13 years ago
Let's begin by implementing the automatic-sleep-and-retry on failure.

As explained in comment 1 this approach helps me.
As OCSP is not yet enabled by default, this would also be a reasonable fix for Firefox 2.
(Assignee)

Comment 9

13 years ago
Posted patch Patch v1 (obsolete) — Splinter Review
Attachment #236326 - Flags: review?(rrelyea)

Comment 10

13 years ago
Comment on attachment 236326 [details] [diff] [review]
Patch v1

r+ for the trunk.

Please fix the following, however:

retryable_error should be initialized  to false before dropping into the do loop, or you will get spurrious log messages.

Other things to think about.... while the overall sleep time before failing is only 4.5 seconds (assuming PR_MillisecondsToInterval is correct), The delay before failure could be longer depending on the network timeout on the socket (I no different platforms seem to have different reactions to non-responding network sockets).

Let's see how this looks on several platforms.

bob
Attachment #236326 - Flags: review?(rrelyea) → review+
(Assignee)

Comment 11

13 years ago
Posted patch Patch v2Splinter Review
This patch addresses Bob's reviewer comment. Carrying forward r+
Attachment #236326 - Attachment is obsolete: true
Attachment #236431 - Flags: review+
(Assignee)

Comment 12

13 years ago
Fix checked in to trunk.

We should wait until we got some testing on trunk, and then nominate for Firefox 2.
Status: NEW → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → FIXED
(Assignee)

Comment 13

13 years ago
Comment on attachment 236431 [details] [diff] [review]
Patch v2

Requesting approval to add this to MOZILLA_1_8_BRANCH for Firefox 2.

This patch only affects users who have manually enabled OCSP.

For those users who enabled OCSP, this will allow them to work with secure sites without constantly being annoyed by error messages.
Attachment #236431 - Flags: approval1.8.1?
Comment on attachment 236431 [details] [diff] [review]
Patch v2

a=mconnor on behalf of drivers, assuming that this only affects codepaths used with OSCP enabled, and has zero effect on the default codepath that default settings will follow.
Attachment #236431 - Flags: approval1.8.1? → approval1.8.1+
(Assignee)

Comment 15

13 years ago
(In reply to comment #14)
> (From update of attachment 236431 [details] [diff] [review] [edit])
> a=mconnor on behalf of drivers, assuming that this only affects codepaths used
> with OSCP enabled, and has zero effect on the default codepath that default
> settings will follow.

I confirm this is correct.

Patch checked in to 1.8 branch, thanks for approving.

Keywords: fixed1.8.1
(Assignee)

Comment 16

13 years ago
For completeness, this is the version of the patch checked in to 1.8 branch.
Difference is in context only.
You need to log in before you can comment on or make changes to this bug.