OCSP cache does not compare validity period of responses consistently with mozilla::pkix

RESOLVED FIXED in mozilla36

Status

()

Core
Security: PSM
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: briansmith, Assigned: keeler)

Tracking

(Blocks: 3 bugs)

Trunk
mozilla36
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

+++ This bug was initially created as a clone of Bug #1078108 +++

In bug 1078108, I found that if I increase the default validity period of OCSP responses (increase the default value of nextUpdate in pkixtestutil.cpp), the rate of intermittent test failures for OCSP-related tests on Mac OS X debug builds goes down. I found that if I increase it by an extreme amount (many days), the intermittent failures on tryserver seem to stop completely. It seems like the patch in bug 1078108 will reduce the rate to an acceptable-ish level. However, it is concerning that this should even be an issue at all, because we already compare OCSP times with a considerable amount of slop (one day). Thus, differences of a few seconds to an hour shouldn't matter at all, when a test runs in less than a minute.

Possibly, the time comparison logic in pkixocsp.cpp is broken. That would be a cause for concern. And/or, possible the GeneralizedTime encoding code in pkixtestutil.cpp is broken. If it is the pkixtestutil.cpp code, it makes sense that this issue would be platform-specific, because the time encoding code is one of the few places we rely on platform-specific code.

Regardless, I think this is something that should be investigated promptly. It would be much more efficient for somebody with a Mac to do so, because then they could run the OCSP-related xpcshell tests locally.

To investigate this issue, I'd start by identifying (on tryserver, and through bugzilla) the specific tests that cause the intermittent oranges most frequently. Then, I would reverse the effects of patch bug 1078108 by replacing the one-hour difference between thisUpdate and nextUpdate with a one-*second* gap. Because of the slop in the pkixocsp.cpp comparison, a one second gap is supposed to work just fine, IIUC, but is actually broken in some way due to this bug.
See Also: → bug 1078108
Summary: The default validity period for test OCSP responses is too short → Various OCSP-related xpcshell tests fail intermittently
Additional things to consider:

* Is the mozilla::pkix::Time arithmetic correct?
* Is there an off-by-one error in the pkixtestutil time encoding code (esp. in the hours field)?
* Is the assumption that pkixtestutil.cpp makes about time_t being denominated in seconds, as the POSIX scriptures dictate, incorrect for Mac?

IIRC, this also happens on B2G emulator debug builds sometimes.
No longer depends on: 1078108
It seems the problem is that OCSPCache.cpp and/or the caching-related parts of NSSTrustDomain do not take the slop into consideration when determining if a cached entry should be used.

The default validity period of most test OCSP responses is only 10 seconds, and in most cases we generate the OCSP responses for a test at the beginning. That means that these tests are likely to fail whenever they take longer than 10 seconds to execute.

Consider test_ev_certs.js:

    ocspResponder.stop(function () {
      // without net it must be able to EV verify
      let failingOcspResponder = failingOCSPResponder();
      let cert = certdb.findCertByNickname(null, "ev-valid");
      let hasEVPolicy = {};
      let verifiedChain = {};
      let flags = Ci.nsIX509CertDB.FLAG_LOCAL_ONLY |
                  Ci.nsIX509CertDB.FLAG_MUST_BE_EV;

      let error = certdb.verifyCertNow(cert, certificateUsageSSLServer,
                                       flags, verifiedChain, hasEVPolicy);
      do_check_eq(hasEVPolicy.value, isDebugBuild);
      do_check_eq(error,
                  isDebugBuild ? 0 : SEC_ERROR_POLICY_VALIDATION_FAILED);
      failingOcspResponder.stop(run_next_test);
    });
  });

See how this logic can only work if the cached OCSP response is valid?

The ultimate root cause is the design of mozilla::pkix::Time. We never should have exposed the simple comparison operators operator>, operator<, operator==, etc. Instead, we should have exposed methods like:

    IsNotAfter(Time t, unsigned int slopInSeconds);
    IsNotBefore(Time t, unsigned int slopInSeconds);

In other words, we should treat mozilla::pkix::Time similarly to an IEEE floating point number. This way, every time we compare times we'd be forced to consider whether or not we need to apply some slop value.

Additionally, VerifyEncodedOCSPResponse has some Time output parameters. The documentation for VerifyEncodedOCSPResponse needs to be expanded so that callers know whether or not to apply slop, or whether the return value already has the slop applied. And/or, pkix.h should expose a function for comparing a nextUpdate time to a Time that represents the current time, so that callers are guaranteed to do a consistent comparison.
OS: Mac OS X → All
Summary: Various OCSP-related xpcshell tests fail intermittently → OCSP cache does not compare validity period of responses consistently with mozilla::pkix
Created attachment 8501476 [details] [diff] [review]
reduce-test-ocsp-response-validity-periods-to-reproduce-cache-bug.patch

This patch can be used to more easily reproduce the problems locally on platforms that usually don't get the failures on tryserver. In my case, I was able to consistently reproduce test failures in two tests on Windows 8.1 w/ Visual Studio 2013 debug builds.

Since this bug is in NSSTrustDomain/OCSPCache, I'm not planning to work on it further.
Blocks: 959105
Blocks: 1029775
Blocks: 1065640
Blocks: 1024244
When this bug is fixed, please undo as many of the exclusions from xpcshell.ini as possible. I believe that many of those platform-specific exclusions (especially Android and the Windows one) are likely only needed to work around this bug.
Thanks for taking the time to investigate this, Brian. I'll probably use one of the approaches you outline in comment 2 to fix this.
Assignee: nobody → dkeeler
Created attachment 8515295 [details] [diff] [review]
patch

Brian, let me know what you think of this approach. Here's what I wrote in the commit message by way of an explanation of what's going on:

validThrough should now be the time through which, if passed in as the given
time to validate an OCSP response at, VerifyEncodedOCSPResponse will still
consider it trustworthy. After that time, it will be expired. This makes it
so the OCSP cache compares validity period responses consistently with
mozilla::pkix.
Attachment #8515295 - Flags: review?(brian)
Forgot to mention: this looked pretty good on try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=0b57089481e1 (the starred failures seem to be unrelated to this)

Updated

3 years ago
Blocks: 1089305
Comment on attachment 8515295 [details] [diff] [review]
patch

Review of attachment 8515295 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/pkix/include/pkix/pkix.h
@@ +114,5 @@
>  // trustworthy. If the response is not trustworthy, thisUpdate will be 0.
>  // Similarly, the optional parameter validThrough will be the time through
> +// which the encoded response is considered trustworthy (that is, as long as
> +// the given time at which to validate is less than or equal to validThrough,
> +// the response will be considered trustworthy).

Please also add that validThrough has already been adjusted to deal with clock skew.

If validThrough is adjusted for clock skew, shouldn't thisUpdate also be adjusted for clock skew?

::: security/pkix/lib/pkixocsp.cpp
@@ +635,1 @@
>    if (rv != Success) {

// This could only happen if we're dealing with times beyond the year
      // 10,000AD.

::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp
@@ +285,5 @@
> +              VerifyEncodedOCSPResponse(trustDomain, *endEntityCertID,
> +                                        Now(), END_ENTITY_MAX_LIFETIME_IN_DAYS,
> +                                        response, expired, nullptr,
> +                                        &validThrough));
> +    ASSERT_FALSE(expired);

I think there should also be an assertion that, at least, validThrough > oneDayAfterNow (in particular, validThrough != oneDayAfterNow), and a comment explaining that assertion.

Also, I think that this should be a different test (separate TEST_F), because good_byKey is testing that we understand "Good" responses. It gets too confusing it we use the same test case to test multiple things, especially when we need to change the tests later. In particular, somebody might understand *one* reason for the test, but overlook the second reason, and modify the test so it doesn't make sense for both reasons any more.
Attachment #8515295 - Flags: review?(brian) → review-
Thanks for the review.

(In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment #8)
> ::: security/pkix/include/pkix/pkix.h
> @@ +114,5 @@
> >  // trustworthy. If the response is not trustworthy, thisUpdate will be 0.
> >  // Similarly, the optional parameter validThrough will be the time through
> > +// which the encoded response is considered trustworthy (that is, as long as
> > +// the given time at which to validate is less than or equal to validThrough,
> > +// the response will be considered trustworthy).
> 
> Please also add that validThrough has already been adjusted to deal with
> clock skew.

I'm not sure that's useful to include in the documentation. My thinking here was that instead of talking about the internal determination of what validThrough will end up being (i.e. for one thing it depends on if there's a nextUpdate value in the response), we just say what properties it will have. In this case, the property is "as long as the given time at which to validate is less than or equal to validThrough [i.e. the value of validThrough from a previous call with the same response], the response will be considered trustworthy".

> If validThrough is adjusted for clock skew, shouldn't thisUpdate also be
> adjusted for clock skew?

I don't think it makes sense to change thisUpdate. My understanding is its meaning is "this is the thisUpdate time that was indicated in the response". In any case, it's only ever compared to other thisUpdate times, so local clock skew doesn't affect it.

> ::: security/pkix/lib/pkixocsp.cpp
> @@ +635,1 @@
> >    if (rv != Success) {
> 
> // This could only happen if we're dealing with times beyond the year
>       // 10,000AD.

Added.

> ::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp
> @@ +285,5 @@
> > +              VerifyEncodedOCSPResponse(trustDomain, *endEntityCertID,
> > +                                        Now(), END_ENTITY_MAX_LIFETIME_IN_DAYS,
> > +                                        response, expired, nullptr,
> > +                                        &validThrough));
> > +    ASSERT_FALSE(expired);
> 
> I think there should also be an assertion that, at least, validThrough >
> oneDayAfterNow (in particular, validThrough != oneDayAfterNow), and a
> comment explaining that assertion.

Sounds good.

> Also, I think that this should be a different test (separate TEST_F),
> because good_byKey is testing that we understand "Good" responses. It gets
> too confusing it we use the same test case to test multiple things,
> especially when we need to change the tests later. In particular, somebody
> might understand *one* reason for the test, but overlook the second reason,
> and modify the test so it doesn't make sense for both reasons any more.

Good call - I fixed this.
Created attachment 8517695 [details] [diff] [review]
patch v2
Attachment #8515295 - Attachment is obsolete: true
Attachment #8517695 - Flags: review?(brian)

Updated

3 years ago
No longer blocks: 1089305
Comment on attachment 8517695 [details] [diff] [review]
patch v2

Review of attachment 8517695 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay. Good work.

::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp
@@ +381,5 @@
>                                        response, expired));
>    ASSERT_FALSE(expired);
>  }
>  
> +TEST_F(pkixocsp_VerifyEncodedResponse_successful, check_validThrough)

Please add reference to this bug: // Bug 1079436

Also, please add a note explaining what we're testing: That the output variable validThrough represents the last time for which VerifyEncodedOCSPResponse will succeed, which is different than the nextUpdate time in the OCSP response due to the slop we add for comparisons to deal with clock skew.
Attachment #8517695 - Flags: review?(brian) → review+
Thanks! I added comments before landing.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b379f1bc58e1
https://hg.mozilla.org/mozilla-central/rev/b379f1bc58e1
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.