Closed Bug 929617 Opened 11 years ago Closed 11 years ago

validity period for stapled OCSP responses is too short (SEC_ERROR_OCSP_OLD_RESPONSE with nginx)

Categories

(Core :: Security: PSM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox27 --- fixed
firefox28 --- fixed

People

(Reporter: keeler, Assigned: keeler)

References

Details

Attachments

(1 file, 3 obsolete files)

As far as I can tell, the validity period for an OCSP response is 24 hours[1] (in terms of its "thisUpdate" time[2]). This means that servers have to update their stapled responses once a day. We have evidence that sites like twitter, tumblr, and soundcloud aren't doing this (bug 929063, bug 929068). Given that we follow the spec[3] and terminate connections when we receive an "unsatisfactory" stapled response, this means these sites are broken when ocsp stapling is enabled.

[1] https://mxr.mozilla.org/mozilla-central/source/security/nss/lib/certhigh/ocsp.c#4333
[2] http://tools.ietf.org/html/rfc6960#section-2.4
[3] http://tools.ietf.org/html/rfc6066#section-8
Blocks: 929524
Slight amendment: the validity period for an OCSP response with no "nextUpdate" value is 24 hours. The validity period for an OCSP response with both "thisUpdate" and "nextUpdate" is that time interval {thisUpdate, nextUpdate} (with a little wiggle room for clock skew).
According to http://tools.ietf.org/html/rfc5019#section-2.2.4, "When pre-producing OCSPResponse messages, the responder MUST set the thisUpdate, nextUpdate, and producedAt times", so there's that.
I think we misunderstood the requirement from section-2.2.4: "When pre-producing OCSPResponse messages, the responder MUST set the thisUpdate, nextUpdate, and producedAt times as follows:" What happens when not pre-producing, but rather producing on demand?

Section 4 is the authoritative place:
http://tools.ietf.org/html/rfc5019#section-4

"Clients MUST check for the existence of the nextUpdate field and MUST
 ensure the current time, expressed in GMT time as described in
 Section 2.2.4, falls between the thisUpdate and nextUpdate times. If
 the nextUpdate field is absent, the client MUST reject the response.

"If the nextUpdate field is present, the client MUST ensure that it is
 not earlier than the current time.  If the current time on the client
 is later than the time specified in the nextUpdate field, the client
 MUST reject the response as stale."

It is not clear whether stapled OCSP responses must conform to RFC 5019's requirements. Also it is unclear whether CAs in our program are required to conform to RFC 5019. The baseline requirements say "OCSP responses MUST conform to RFC2560 and/or RFC5019." The phrase "and/or" is problematic and should be replaced with "and."

Because of this lack of clarify, it is likely that some server implementations of OCSP stapling will be too liberal in trusting OCSP responses where nextUpdate is missing--i.e. they may not refresh these OCSP responses as frequently as we require them to be refreshed. This creates a big interoperability issue. We should ask our CA partners to ensure that they set nextUpdate on all of their OCSP responses. I suggest that we recommend to them that the set nextUpdate = thisUpdate + 48 hours. This balances freshness against potential DoS issues.
Depends on: 938805
No longer depends on: 923304
Attached patch patch (obsolete) — Splinter Review
This patch ignores stapled OCSP responses that are expired (meaning that we fall back to OCSP fetching). It depends on the patch in bug 938805 that is near landing.
Assignee: nobody → dkeeler
Status: NEW → ASSIGNED
Attachment #8336398 - Flags: review?(brian)
Comment on attachment 8336398 [details] [diff] [review]
patch

Review of attachment 8336398 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +880,5 @@
>      if (rv != SECSuccess) {
> +      // Due to buggy servers that will staple expired OCSP responses
> +      // (see for example http://trac.nginx.org/nginx/ticket/425),
> +      // don't terminate the connection if the stapled response is expired.
> +      // We will fall back to actively fetching revocation information.

better to(?):

Will fall back to the non-stapled revocation setup
Attachment #8336398 - Flags: review?(cviecco) → review+
Comment on attachment 8336398 [details] [diff] [review]
patch

Review of attachment 8336398 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +880,5 @@
>      if (rv != SECSuccess) {
> +      // Due to buggy servers that will staple expired OCSP responses
> +      // (see for example http://trac.nginx.org/nginx/ticket/425),
> +      // don't terminate the connection if the stapled response is expired.
> +      // We will fall back to actively fetching revocation information.

s/actively //.

@@ +888,5 @@
> +      }
> +
> +      // Reset the error state, because we're going to continue the connection.
> +      rv = SECSuccess;
> +      PR_SetError(savedError, 0);

You don't need these lines that set rv and call PR_SetError. Nobody will read rv or PR_GetError() below.

::: security/manager/ssl/tests/unit/head_psm.js
@@ +65,5 @@
>                             .getService(Ci.nsINSSErrorsService);
>    return nssErrorsService.getXPCOMFromNSSError(statusNSS);
>  }
>  
> +function _getLibraryFunction(functionName, libraryName) {

Name this with _getLibraryFunctionWithNoArguments?

@@ +101,5 @@
> +                                                  "ssl3");
> +  if (SSL_ClearSessionCache() != 0) {
> +    throw "Failed to clear SSL session cache";
> +  }
> +}

Should we be calling this function in test_ocsp_stapling.js like we call clearOCSPCache?

::: security/manager/ssl/tests/unit/test_ocsp_stapling.js
@@ +30,5 @@
> +  // Server Error is actually faster than attempting to connect to a closed
> +  // port.
> +  let fakeOCSPResponder = new HttpServer();
> +  fakeOCSPResponder.registerPrefixHandler("/", function(request, response) {
> +    response.setStatusLine(request.httpVersion, 500, "Internal Server Error");

I haven't thought about it deeply for all cases, but I suspect that this OCSP responder should never be accessed in these tests. Let me know if I am misunderstanding something here.

For example, let's say the server stapled a "malformed" response. We shouldn't be re-verifying the certificate in the cert error runnable because we aren't going to let the user override that malformed response error anyway. In CreateCertErrorRunnable or similar, there is already a case where we bail out early in the error was CERT_ERROR_REVOKED_CERTIFICATE. I suspect we need to add some more cases there.

If what I'm saying above is true, then really what we should be doing here is verifying that the OCSP responder is never accessed, for each test case.

::: security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js
@@ +16,5 @@
> +                       aExpectedOCSPRequests) {
> +  add_connection_test(aHost, aExpectedResult,
> +    function() {
> +      clearOCSPCache();
> +      clearSessionCache();

Shouldn't be be calling clearSessionCache similarly in test_ocsp_stapling.js too?

@@ +21,5 @@
> +      gCurrentOCSPResponse = aOCSPResponseToServe;
> +      gOCSPRequestCount = 0;
> +    },
> +    function() {
> +      do_check_eq(gOCSPRequestCount, 1);

Why would the request count be 1 for the first test case, where the OCSP response is valid? Shouldn't it be zero for that case?

@@ +32,5 @@
> +  let args = [ ["good", "localhostAndExampleCom", "unused" ],
> +               ["expiredresponse", "localhostAndExampleCom", "unused"],
> +               ["oldvalidperiod", "localhostAndExampleCom", "unused"],
> +               ["revoked", "localhostAndExampleCom", "unused"] ];
> +  let ocspResponses = generateOCSPResponses(args, "tlsserver");

*** see below.

@@ +45,5 @@
> +  ocspResponder.start(8080);
> +
> +  add_tls_server_setup("OCSPStaplingServer");
> +  add_ocsp_test("ocsp-stapling-expired.example.com", Cr.NS_OK,
> +                ocspResponses[0]);

Can we clarify in the names that it is the *OCSP response* that is expired, and not any cert? It is very unclear what these test cases are actually testing. It might be helpful to add a comment describing the scenerio.

Instead of referring to these responses as ocspResponses[0], ocspResponses[1], could we name them at the point I labeled "***" above? e.g.:

const goodResponse = ocspResponses[0];
const expiredResponse = ocspResponses[1];
...

This would make the test much easier to review and much easier for us to modify/debug later.
Attachment #8336398 - Flags: review?(brian) → review-
Attached patch patch v2 (obsolete) — Splinter Review
Thanks for the review. I've made the changes you pointed out. I have one concern in particular that I'll point out as a comment on the patch.
Attachment #8336398 - Attachment is obsolete: true
Attachment #8337993 - Flags: review?(brian)
Comment on attachment 8337993 [details] [diff] [review]
patch v2

Review of attachment 8337993 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +448,5 @@
>    
> +  // The certificate was revoked or there was an OCSP error, so
> +  // don't do anything else
> +  if (defaultErrorCodeToReport == SEC_ERROR_REVOKED_CERTIFICATE ||
> +      defaultErrorCodeToReport == SEC_ERROR_BAD_DATABASE ||

This is the code we get when we try to verify an OCSP response signed by a certificate not in the database. I imagine there are other reasons defaultErrorCodeToReport could be SEC_ERROR_BAD_DATABASE - is this behavior correct in those cases as well as for the OCSP one?
Comment on attachment 8337993 [details] [diff] [review]
patch v2

Review of attachment 8337993 [details] [diff] [review]:
-----------------------------------------------------------------

Please make the blacklist/whitelist change I suggest below, unless that breaks something.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +456,5 @@
> +      defaultErrorCodeToReport == SEC_ERROR_OCSP_TRY_SERVER_LATER ||
> +      defaultErrorCodeToReport == SEC_ERROR_OCSP_REQUEST_NEEDS_SIG ||
> +      defaultErrorCodeToReport == SEC_ERROR_OCSP_UNAUTHORIZED_REQUEST ||
> +      defaultErrorCodeToReport == SEC_ERROR_OCSP_UNKNOWN_CERT ||
> +      defaultErrorCodeToReport == SEC_ERROR_OCSP_MALFORMED_RESPONSE) {

Instead of having this be a blacklist, should we make it a whitelist of the error codes that we will allow a cert error override for?:

      case SEC_ERROR_UNKNOWN_ISSUER:
      case SEC_ERROR_CA_CERT_INVALID:
      case SEC_ERROR_UNTRUSTED_ISSUER:
      case SEC_ERROR_EXPIRED_ISSUER_CERTIFICATE:
      case SEC_ERROR_UNTRUSTED_CERT:
      case SEC_ERROR_INADEQUATE_KEY_USAGE:
      case SEC_ERROR_CERT_SIGNATURE_ALGORITHM_DISABLED:
      case SSL_ERROR_BAD_CERT_DOMAIN:
      case SEC_ERROR_EXPIRED_CERTIFICATE:

It seems like the original call to CertVerifier::VerifyCert must have returned one of these error codes in order for us to be able to successfully allow the cert override. (And, if we wouldn't allow the cert override in the first place, there's no reason to call CertVerifier::VerifyCert again.) I think this would resolve your issue with SEC_ERROR_BAD_DATABASE more generally.

To answer your question more directly though, SEC_ERROR_BAD_DATABASE isn't an overridable error, so we can short-circuit in that case, as you have done.

Nit: If it is not too much trouble, this change should be done in its own bug or at least its own patch. It doesn't need its own tests.
Attachment #8337993 - Flags: review?(brian) → review+
Attached patch patch v2.1 (obsolete) — Splinter Review
Thanks for the quick review. Spun off bug 943115, carrying over r+.
https://tbpl.mozilla.org/?tree=Try&rev=6decd0019e12
Attachment #8337993 - Attachment is obsolete: true
Attachment #8338107 - Flags: review+
Attached patch patch v2.2Splinter Review
Had to tweak the library loading because of a difference on windows. Carrying over r+.
Here's a recent try run: https://tbpl.mozilla.org/?tree=Try&rev=21c85ab20d15
(the android xpcshell failure has also been addressed in this patch)
Attachment #8338107 - Attachment is obsolete: true
Attachment #8338892 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/fc7b76d9da1e
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Comment on attachment 8338892 [details] [diff] [review]
patch v2.2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): needed to uplift bug 887321 (OCSP stapling telemetry) to beta
User impact if declined: no OCSP stapling telemetry on beta
Testing completed (on m-c, etc.): landed in 28, no problems since then
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #8338892 - Flags: approval-mozilla-beta?
Summary: validity period for stapled OCSP responses is too short → validity period for stapled OCSP responses is too short (SEC_ERROR_OCSP_OLD_RESPONSE with nginx)
(In reply to David Keeler (:keeler) from comment #15)
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): needed to uplift bug 887321 (OCSP
> stapling telemetry) to beta

Besides the OCSP stapling telemetry, this bug fixes a regression from bug 700698, because some websites now occasionally stop working in cases where they were consistently working previously. I, and several users, have run into websites that have this issue. The problem is widespread because it is a bug in nginx, a very widely-deployed web server.

> User impact if declined: no OCSP stapling telemetry on beta

Also, some websites that were working fine in Firefox 25 will work less consistently in Firefox 27+ (and Firefox 26).

> Testing completed (on m-c, etc.): landed in 28, no problems since then

Also, we have automated unit tests for this fix and related functionality to ensure that it is safe. (See the patch.)

> Risk to taking this patch (and alternatives if risky): low
> String or IDL/UUID changes made by this patch: none

I agree with these two.
Blocks: 700698
Attachment #8338892 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: in-testsuite?
I did uninstall the old firefox and download new one without any change. And mozilla homepage and wikipedia should load correctly. but no SEC_ERROR_OCSP_OLD_RESPONSE
(In reply to sigvard from comment #18)
> I did uninstall the old firefox and download new one without any change. And
> mozilla homepage and wikipedia should load correctly. but no
> SEC_ERROR_OCSP_OLD_RESPONSE

Please file a new bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Security%3A%20PSM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: