1079436 - OCSP cache does not compare validity period of responses consistently with mozilla::pkix

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Description

•

11 years ago

+++ This bug was initially created as a clone of Bug #1078108 +++ In bug 1078108, I found that if I increase the default validity period of OCSP responses (increase the default value of nextUpdate in pkixtestutil.cpp), the rate of intermittent test failures for OCSP-related tests on Mac OS X debug builds goes down. I found that if I increase it by an extreme amount (many days), the intermittent failures on tryserver seem to stop completely. It seems like the patch in bug 1078108 will reduce the rate to an acceptable-ish level. However, it is concerning that this should even be an issue at all, because we already compare OCSP times with a considerable amount of slop (one day). Thus, differences of a few seconds to an hour shouldn't matter at all, when a test runs in less than a minute. Possibly, the time comparison logic in pkixocsp.cpp is broken. That would be a cause for concern. And/or, possible the GeneralizedTime encoding code in pkixtestutil.cpp is broken. If it is the pkixtestutil.cpp code, it makes sense that this issue would be platform-specific, because the time encoding code is one of the few places we rely on platform-specific code. Regardless, I think this is something that should be investigated promptly. It would be much more efficient for somebody with a Mac to do so, because then they could run the OCSP-related xpcshell tests locally. To investigate this issue, I'd start by identifying (on tryserver, and through bugzilla) the specific tests that cause the intermittent oranges most frequently. Then, I would reverse the effects of patch bug 1078108 by replacing the one-hour difference between thisUpdate and nextUpdate with a one-*second* gap. Because of the slop in the pkixocsp.cpp comparison, a one second gap is supposed to work just fine, IIUC, but is actually broken in some way due to this bug.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

Blocks: 1059883, 1003622, 967444, 964673

Comment 1

•

11 years ago

Additional things to consider: * Is the mozilla::pkix::Time arithmetic correct? * Is there an off-by-one error in the pkixtestutil time encoding code (esp. in the hours field)? * Is the assumption that pkixtestutil.cpp makes about time_t being denominated in seconds, as the POSIX scriptures dictate, incorrect for Mac? IIRC, this also happens on B2G emulator debug builds sometimes.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

No longer depends on: 1078108

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Comment 2

•

11 years ago

It seems the problem is that OCSPCache.cpp and/or the caching-related parts of NSSTrustDomain do not take the slop into consideration when determining if a cached entry should be used. The default validity period of most test OCSP responses is only 10 seconds, and in most cases we generate the OCSP responses for a test at the beginning. That means that these tests are likely to fail whenever they take longer than 10 seconds to execute. Consider test_ev_certs.js: ocspResponder.stop(function () { // without net it must be able to EV verify let failingOcspResponder = failingOCSPResponder(); let cert = certdb.findCertByNickname(null, "ev-valid"); let hasEVPolicy = {}; let verifiedChain = {}; let flags = Ci.nsIX509CertDB.FLAG_LOCAL_ONLY | Ci.nsIX509CertDB.FLAG_MUST_BE_EV; let error = certdb.verifyCertNow(cert, certificateUsageSSLServer, flags, verifiedChain, hasEVPolicy); do_check_eq(hasEVPolicy.value, isDebugBuild); do_check_eq(error, isDebugBuild ? 0 : SEC_ERROR_POLICY_VALIDATION_FAILED); failingOcspResponder.stop(run_next_test); }); }); See how this logic can only work if the cached OCSP response is valid? The ultimate root cause is the design of mozilla::pkix::Time. We never should have exposed the simple comparison operators operator>, operator<, operator==, etc. Instead, we should have exposed methods like: IsNotAfter(Time t, unsigned int slopInSeconds); IsNotBefore(Time t, unsigned int slopInSeconds); In other words, we should treat mozilla::pkix::Time similarly to an IEEE floating point number. This way, every time we compare times we'd be forced to consider whether or not we need to apply some slop value. Additionally, VerifyEncodedOCSPResponse has some Time output parameters. The documentation for VerifyEncodedOCSPResponse needs to be expanded so that callers know whether or not to apply slop, or whether the return value already has the slop applied. And/or, pkix.h should expose a function for comparing a nextUpdate time to a Time that represents the current time, so that callers are guaranteed to do a consistent comparison.

OS: Mac OS X → All

Summary: Various OCSP-related xpcshell tests fail intermittently → OCSP cache does not compare validity period of responses consistently with mozilla::pkix

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Comment 3

•

11 years ago

Attached patch reduce-test-ocsp-response-validity-periods-to-reproduce-cache-bug.patch — Details — Splinter Review

This patch can be used to more easily reproduce the problems locally on platforms that usually don't get the failures on tryserver. In my case, I was able to consistently reproduce test failures in two tests on Windows 8.1 w/ Visual Studio 2013 debug builds. Since this bug is in NSSTrustDomain/OCSPCache, I'm not planning to work on it further.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

Blocks: 959105

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

Blocks: 1029775

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

Blocks: 1065640

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Updated

•

11 years ago

Blocks: 1024244

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Comment 4

•

11 years ago

When this bug is fixed, please undo as many of the exclusions from xpcshell.ini as possible. I believe that many of those platform-specific exclusions (especially Android and the Windows one) are likely only needed to work around this bug.

Dana Keeler (she/her) [:keeler]

Assignee

Comment 5

•

11 years ago

Thanks for taking the time to investigate this, Brian. I'll probably use one of the approaches you outline in comment 2 to fix this.

Assignee: nobody → dkeeler

Dana Keeler (she/her) [:keeler]

Assignee

Comment 6

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Brian, let me know what you think of this approach. Here's what I wrote in the commit message by way of an explanation of what's going on: validThrough should now be the time through which, if passed in as the given time to validate an OCSP response at, VerifyEncodedOCSPResponse will still consider it trustworthy. After that time, it will be expired. This makes it so the OCSP cache compares validity period responses consistently with mozilla::pkix.

Attachment #8515295 - Flags: review?(brian)

Dana Keeler (she/her) [:keeler]

Assignee

Comment 7

•

11 years ago

Forgot to mention: this looked pretty good on try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=0b57089481e1 (the starred failures seem to be unrelated to this)

:Cykesiopka

Updated

•

11 years ago

Blocks: 1089305

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Comment 8

•

11 years ago

Comment on attachment 8515295 [details] [diff] [review] patch Review of attachment 8515295 [details] [diff] [review]: ----------------------------------------------------------------- ::: security/pkix/include/pkix/pkix.h @@ +114,5 @@ > // trustworthy. If the response is not trustworthy, thisUpdate will be 0. > // Similarly, the optional parameter validThrough will be the time through > +// which the encoded response is considered trustworthy (that is, as long as > +// the given time at which to validate is less than or equal to validThrough, > +// the response will be considered trustworthy). Please also add that validThrough has already been adjusted to deal with clock skew. If validThrough is adjusted for clock skew, shouldn't thisUpdate also be adjusted for clock skew? ::: security/pkix/lib/pkixocsp.cpp @@ +635,1 @@ > if (rv != Success) { // This could only happen if we're dealing with times beyond the year // 10,000AD. ::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp @@ +285,5 @@ > + VerifyEncodedOCSPResponse(trustDomain, *endEntityCertID, > + Now(), END_ENTITY_MAX_LIFETIME_IN_DAYS, > + response, expired, nullptr, > + &validThrough)); > + ASSERT_FALSE(expired); I think there should also be an assertion that, at least, validThrough > oneDayAfterNow (in particular, validThrough != oneDayAfterNow), and a comment explaining that assertion. Also, I think that this should be a different test (separate TEST_F), because good_byKey is testing that we understand "Good" responses. It gets too confusing it we use the same test case to test multiple things, especially when we need to change the tests later. In particular, somebody might understand *one* reason for the test, but overlook the second reason, and modify the test so it doesn't make sense for both reasons any more.

Attachment #8515295 - Flags: review?(brian) → review-

Dana Keeler (she/her) [:keeler]

Assignee

Comment 9

•

11 years ago

Thanks for the review. (In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment #8) > ::: security/pkix/include/pkix/pkix.h > @@ +114,5 @@ > > // trustworthy. If the response is not trustworthy, thisUpdate will be 0. > > // Similarly, the optional parameter validThrough will be the time through > > +// which the encoded response is considered trustworthy (that is, as long as > > +// the given time at which to validate is less than or equal to validThrough, > > +// the response will be considered trustworthy). > > Please also add that validThrough has already been adjusted to deal with > clock skew. I'm not sure that's useful to include in the documentation. My thinking here was that instead of talking about the internal determination of what validThrough will end up being (i.e. for one thing it depends on if there's a nextUpdate value in the response), we just say what properties it will have. In this case, the property is "as long as the given time at which to validate is less than or equal to validThrough [i.e. the value of validThrough from a previous call with the same response], the response will be considered trustworthy". > If validThrough is adjusted for clock skew, shouldn't thisUpdate also be > adjusted for clock skew? I don't think it makes sense to change thisUpdate. My understanding is its meaning is "this is the thisUpdate time that was indicated in the response". In any case, it's only ever compared to other thisUpdate times, so local clock skew doesn't affect it. > ::: security/pkix/lib/pkixocsp.cpp > @@ +635,1 @@ > > if (rv != Success) { > > // This could only happen if we're dealing with times beyond the year > // 10,000AD. Added. > ::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp > @@ +285,5 @@ > > + VerifyEncodedOCSPResponse(trustDomain, *endEntityCertID, > > + Now(), END_ENTITY_MAX_LIFETIME_IN_DAYS, > > + response, expired, nullptr, > > + &validThrough)); > > + ASSERT_FALSE(expired); > > I think there should also be an assertion that, at least, validThrough > > oneDayAfterNow (in particular, validThrough != oneDayAfterNow), and a > comment explaining that assertion. Sounds good. > Also, I think that this should be a different test (separate TEST_F), > because good_byKey is testing that we understand "Good" responses. It gets > too confusing it we use the same test case to test multiple things, > especially when we need to change the tests later. In particular, somebody > might understand *one* reason for the test, but overlook the second reason, > and modify the test so it doesn't make sense for both reasons any more. Good call - I fixed this.

Dana Keeler (she/her) [:keeler]

Assignee

Comment 10

•

11 years ago

Attached patch patch v2 — Details — Splinter Review

Attachment #8515295 - Attachment is obsolete: true

Attachment #8517695 - Flags: review?(brian)

:Cykesiopka

Updated

•

11 years ago

No longer blocks: 1089305

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Reporter

Comment 11

•

11 years ago

Comment on attachment 8517695 [details] [diff] [review] patch v2 Review of attachment 8517695 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for the delay. Good work. ::: security/pkix/test/gtest/pkixocsp_VerifyEncodedOCSPResponse.cpp @@ +381,5 @@ > response, expired)); > ASSERT_FALSE(expired); > } > > +TEST_F(pkixocsp_VerifyEncodedResponse_successful, check_validThrough) Please add reference to this bug: // Bug 1079436 Also, please add a note explaining what we're testing: That the output variable validThrough represents the last time for which VerifyEncodedOCSPResponse will succeed, which is different than the nextUpdate time in the OCSP response due to the slop we add for comparisons to deal with clock skew.

Attachment #8517695 - Flags: review?(brian) → review+

Dana Keeler (she/her) [:keeler]

Assignee

Comment 12

•

11 years ago

Thanks! I added comments before landing. https://hg.mozilla.org/integration/mozilla-inbound/rev/b379f1bc58e1

Carsten Book [:Tomcat]

Comment 13

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/b379f1bc58e1

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

reduce-test-ocsp-response-validity-periods-to-reproduce-cache-bug.patch 11 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2.01 KB, patch		Details \| Diff \| Splinter Review
patch 11 years ago Dana Keeler (she/her) [:keeler] 7.31 KB, patch	briansmith : review-	Details \| Diff \| Splinter Review
patch v2 11 years ago Dana Keeler (she/her) [:keeler] 7.79 KB, patch	briansmith : review+	Details \| Diff \| Splinter Review