Closed Bug 950240 Opened 11 years ago Closed 10 years ago

OCSP response is fetched even when the same valid Good OCSP response has been stapled

Categories

(Core :: Security: PSM, defect)

29 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox28 --- fixed
firefox29 --- fixed

People

(Reporter: mcmanus, Assigned: keeler)

References

(Blocks 1 open bug)

Details

(Keywords: main-thread-io, perf, Whiteboard: [snappy][main thread disk AND network I/O])

Attachments

(5 files, 6 obsolete files)

10.00 KB, application/octet-stream
Details
7.81 KB, image/png
Details
2.75 KB, patch
Details | Diff | Splinter Review
11.49 KB, patch
keeler
: review+
Details | Diff | Splinter Review
18.33 KB, patch
keeler
: review+
Details | Diff | Splinter Review
Attached file ilya.capture
Ilya reports a case where a stapled ocsp response is being ignored and the cert is being validated the old fashioned way.. but to me the two responses look identical. (the resource loads fine).

Right now this is totally reproducible for me. Earlier it was intermittently reproducible, I am honestly not sure whether the problem is intermittent or whether that was user-error in the earlier runs, and I don't have a time machine to check.

this is nightly 29.

STR
1 - place "173.230.151.99 www.igvita.com" in /etc/hosts (or whatever the moral equivalent of a cname for origin.igvita.com is when you read this)
2 - start a packet capture
3 - load https://www.igvita.com/favicon.ico

What I see
 1] a favicon loaded just fine
 2] a server hello with a stapled OCSP response for serial 745685
 3] firefox generated OCSP query to ocsp.startssl.com for serial 745685
 4] a "good" OCSP response for 745685.. it appears to be exactly eh same response that was stapled onto [2] - with the same produced at.
"now" falls between "thisUpdate" and "nextUpdate". 

I'm really not familiar with raw OCSP, so I'm relying more heavily on wireshark here to interpret than I normally do. So there's a good chance I've misunderstood something - but from the trace it looks like firefox doesn't care for the stapled response but accepts the exact same thing from the OCSP responder.

I will attach a packet capture from my test.
David, if you can diagnose the cause of this in the next few work days, we can make a decision about whether to hold the NSS 3.15.4 release for a fix. That would be very helpful.
Assignee: nobody → dkeeler
Target Milestone: --- → mozilla29
I can't reproduce this (I tried last week and again today, but no dice). One thing to note is that if firefox sees an expired stapled response, it will attempt to fetch a response itself. If that response is also expired, I'm pretty sure it'll continue the connection (see https://mxr.mozilla.org/mozilla-central/source/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js#69 ). Going by the dates on the packet capture and when you posted it, however, that's not what's going on.
This could be a WPT issue.. I can reliably reproduce it there:
- User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0 PTST/151
- http://www.webpagetest.org/result/131218_CH_f66b6bd96db11323af85d413244c1987/1/details/

3G profile + "ignore cert errors" (under advanced) checked.
Attached image igvita-waterfall.png
Check out the connection view in the WPT run that was linked to in comment 3, which I have attached here for future reference.

Notice that the DNS lookup + request to the OCSP responder was not even done until AFTER the https://igvita.com/ HTML page was received. In other words, the OCSP request wasn't triggered by the initial request on the initial connection, but instead it was triggered by the initial request on the SECOND connection. Also notice that BOTH connections appear to be blocked on the OCSP request.

To me, this indicates that there is something causing a certificate verification *ON THE MAIN THREAD* in Firefox, where that certificate verification is somehow bypassing the OCSP cache.

Does anybody know how the "ignore cert errors" feature of WPT is implemented? I wonder if it is implemented with an addon (or similar) that verifies the certificate and then adds an exception for it. If so, I could see that all happening on the main thread.

However, from Patrick's description, Patrick's test was done locally and so I need to investigate that separately.
Attachment #8347487 - Attachment filename: ilya.capture → ilya.pcap
(In reply to Patrick McManus [:mcmanus] from comment #0)
> I don't have a time machine to check.

Patrick, could you please verify that the system date and time on your machine is correct? And, what time was this capture done? I am wondering if there is any possibility that the OCSP response was expired. When a stapled OCSP response is expired, we will do a fallback OCSP fetch to work around an nginx bug (nginx will staple an expired OCSP response before replacing it with a fresh one).

The problem with Patrick's pcap is that it filtered too much stuff out. I would have liked to see where the application data sent/received was in relation to the OCSP fetch.
Flags: needinfo?(mcmanus)
> Does anybody know how the "ignore cert errors" feature of WPT is
> implemented? I wonder if it is implemented with an addon (or similar) that
> verifies the certificate and then adds an exception for it. If so, I could
> see that all happening on the main thread.

"WPT forces nspr to think all certs are valid by overriding Firefox's cert validation to always return true: https://github.com/WPO-Foundation/webpagetest/blob/master/agent/wpthook/hook_nspr.cc#L224" [1]

[1] https://plus.google.com/u/0/+PatrickMcManusDucksong/posts/GDYqqEtm89Y
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #5)
> (In reply to Patrick McManus [:mcmanus] from comment #0)
> > I don't have a time machine to check.
> 
> Patrick, could you please verify that the system date and time on your
> machine is correct? 

yep, it is.

The capture file still has the original filesystem timestamp on it: 13:32:18.502544443 -0500 .. (that matches the UTC stamps on the captures 18:32:18.502544443 +0000)

it looks like the ocsp response is good from the 13th at 06:35 UTC for 48 hours.. so it should be valid.

And, what time was this capture done? I am wondering if
> there is any possibility that the OCSP response was expired. When a stapled
> OCSP response is expired, we will do a fallback OCSP fetch to work around an
> nginx bug (nginx will staple an expired OCSP response before replacing it
> with a fresh one).
> 
> The problem with Patrick's pcap is that it filtered too much stuff out. I
> would have liked to see where the application data sent/received was in
> relation to the OCSP fetch.

I think its weird that it isn't in there too. Maybe an HTTP cache issue? I'll go see if I can still reproduce.

(this was all desktop nightly with a fresh profile for this bug - but it wasn't fresh for the captured run).
Flags: needinfo?(mcmanus)

(In reply to Patrick McManus [:mcmanus] from comment #7)

> I'll go see if I can still reproduce.
> 
> (this was all desktop nightly with a fresh profile for this bug - but it
> wasn't fresh for the captured run).

I cannot reproduce this morning. I've tried both www (with local /etc/hosts mapping) and origin (the latter of which correctly makes me make a cert exception). Neither of them generate any port 80 traffic.

Is it possible to have some kind of error in the date processing (e.g. comparing a local date to a utc date or something?)
This patch helped diagnose the problem.

Ilya, the information you provided about how WPT skips SSL errors is extremely helpful.

When we load/navigate, some events fire which cause the UI to wonder whether the certificate for the connection is EV or not. In order to satisfy its curiosity, the UI ends up asking for the certificate to be verified on the main thread:

 	xul.dll!mozilla::psm::CertVerifier::VerifyCert
>	xul.dll!nsNSSCertificate::hasValidEVOidTag
 	xul.dll!nsNSSCertificate::getValidEVOidTag
 	xul.dll!nsNSSCertificate::GetIsExtendedValidation
 	xul.dll!nsSSLStatus::GetIsExtendedValidation
 	xul.dll!nsSecureBrowserUIImpl::EvaluateAndUpdateSecurityState
 	xul.dll!nsSecureBrowserUIImpl::OnLocationChange

This bug shows up more readily on WPT because WPT replaces AuthCertificateHook with its own hook. The extraction of the stapled OCSP response happens within AuthCertificateHook, so that means we'll always ignore the stapled OCSP response in WPT.

My suspicion is that the bug shows up in non-WPT Firefox due to a race condition between the main thread's attempt to verify the certificate and the certificate verification thread's attempt to verify the certificate. If the main thread wins, then the stapled OCSP response will not be in the OCSP cache, but if the certificate verification thread wins then the cached OCSP response will be in the OCSP cache and so the badness of what the main thread is doing is lessened.

Note that the main thread doing certificate verification will cause jank--often MASSIVE jank--because the main thread will be blocked not only on disk I/O but often *network* I/O. This is the thing that we should fix straight away and uplift to Firefox 28 and probably even 27. David, I have some ideas for doing this and I'm happy to talk to you about them if you want some suggestions.

In addition to the Firefox bug, I'd like to talk to whoever is most directly responsible for the WPT hook. We can and should build a better way for WPT to ignore certificate errors. Right now, WPT's measurements with the "ignore certificate errors" feature enabled are not representative of what Firefox normally does. In addition, the behavior of not executing all the code in Firefox's AuthCertificateHook is undefined. It does more than just verify the certificate. Also, WPT's assumption that SSL_SetURL will be called after we call SSL_AuthCertificateHook is fragile. We can build something better for WPT that more accurately represents what Firefox would do. Selenium and other tools need such a mechanism anyway, and such a mechanism would allow us to access the cert override service off the main thread, which would greatly simplify the certificate error handling logic.
Keywords: main-thread-io
Whiteboard: [snappy][main thread disk AND network I/O]
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #9)

> In addition to the Firefox bug, I'd like to talk to whoever is most directly
> responsible for the WPT hook.

Patrick Meenan <pmeenan@webpagetest.org>
David, it might be better to keep this bug as a tracking bug for the overall issue and file blocking bugs for the individual causes, like the extended validation check I mentioned above. It may be the case that the extended validation check isn't the only cause of this problem.
Keywords: perf
Summary: OCSP Stapling not working for igvita → OCSP response is fetched even when the same valid Good OCSP response has been stapled
Ilya forwarded the bug to me as well so feel free to ping me here or there (for WebPagetest).  I'd be happy to change the way the cert error ignoring is plumbed (I think there used to be a pref that was removed so I had to do it through hooks). Github project is here: https://github.com/WPO-Foundation/webpagetest if you prefer to create an issue and track it there.

Looks like when I hacked the new method in I'm doing it all of the time and not just when the ignore flag is set so fixing that right now.
Attached patch cache mCachedEVStatus on disk (obsolete) — Splinter Review
As far as I can tell, we cache the certificate associated with a host but not its EV status. So, when loading from the cache, the browser tries to update its security UI based on the security properties of the connection. This includes querying the certificate for EV status. Since that status isn't cached, the certificate gets verified to see if it is EV or not (on the main thread). As a result, this can cause an OCSP request to race with a stapled response on a concurrent connection to that host (I'm not really sure why that would happen if we're loading from the cache, but maybe we're fetching an uncached sub-resource).
Attachment #8358561 - Flags: review?(brian)
I modified Brian's patch so as to diagnose the problem with a more firefox-like environment. It causes certificate verification to be very slow for network connections and crashes if a certificate is verified on the main thread.
Without the EV status caching patch, if I load https://www.igvita.com once to get it into the cache, close and reopen firefox, and load the site again, firefox will crash. With the patch, firefox will not.
Comment on attachment 8358561 [details] [diff] [review]
cache mCachedEVStatus on disk

Review of attachment 8358561 [details] [diff] [review]:
-----------------------------------------------------------------

This seems like a good change, assuming that the cross-version compatibility issues are not a problem.

However, I am concerned that this may not be enough. I'm not convinced that this problem only occurs with cached entries, but that is the only case that this patch fixes. isExtendedValidation is using the LOCAL_ONLY flag to avoid OCSP fetching on all the cases, but due to limitations of the NSS classic validation API, the classic code cannot enforce LOCAL_ONLY. So, I think that we still should have a new flag to tell CertVerifier::VerifyCert to skip the DV fallback, to avoid the issue.

::: security/manager/ssl/src/nsNSSCertificate.cpp
@@ +1724,5 @@
> +  if (NS_FAILED(rv)) {
> +    return rv;
> +  }
> +
> +  return aStream->Write32(static_cast<uint32_t>(mCachedEVStatus));

The serialization of nsNSSCertificate is used by TransportSecurityInfo::Write and TransportSecurityInfo::Read, which in turn are used by the HTTP cache to serialize this information into the HTTP cache.

We have a rule of sorts that after the user upgrades to Firefox version X, then if they downgrade to version X-1, then Firefox X-1 will work correctly with any data written to the user profile by version X. Obviously, Firefox X must be able to correctly consume data written by Firefox X-1.

@@ +1751,5 @@
>  
> +  uint32_t cachedEVStatus;
> +  rv = aStream->Read32(&cachedEVStatus);
> +  if (NS_FAILED(rv)) {
> +    return rv;

What *should* happen is that the HTTP cache will see this failure and throw away the cached entry, loading the resource from the network. Is that what is happening?

One thing we have to be careful of is that we don't cause the HTTP cache to see the failure, NOT throw away the cached entry, and then load the resource from the network, FOREVER. I.e. if we're going to throw away the old entries without this information, we need to verify (manually, at least) that those entries will get thrown away.

::: security/manager/ssl/src/nsNSSCertificate.h
@@ +73,1 @@
>    } mCachedEVStatus;

Document that these values are written to disk and so their values should not be changed.
Attachment #8358561 - Flags: review?(brian) → review+
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #15)
> However, I am concerned that this may not be enough. I'm not convinced that
> this problem only occurs with cached entries, but that is the only case that
> this patch fixes. isExtendedValidation is using the LOCAL_ONLY flag to avoid
> OCSP fetching on all the cases, but due to limitations of the NSS classic
> validation API, the classic code cannot enforce LOCAL_ONLY. So, I think that
> we still should have a new flag to tell CertVerifier::VerifyCert to skip the
> DV fallback, to avoid the issue.

I'm actually not sure what you mean by this - wouldn't that mean VerifyCert wouldn't validate the certificate?

> We have a rule of sorts that after the user upgrades to Firefox version X,
> then if they downgrade to version X-1, then Firefox X-1 will work correctly
> with any data written to the user profile by version X. Obviously, Firefox X
> must be able to correctly consume data written by Firefox X-1.

As far as I can tell from manual testing, when going between versions, the deserialization fails, so the object effectively isn't in the cache, but it doesn't cause Firefox to crash or become unusable.

> What *should* happen is that the HTTP cache will see this failure and throw
> away the cached entry, loading the resource from the network. Is that what
> is happening?

I believe this is what I'm observing. It would be great to have some assistance (QA?) in verifying this, however.
Flags: needinfo?(brian)
(In reply to David Keeler (:keeler) from comment #16)
> I'm actually not sure what you mean by this - wouldn't that mean VerifyCert
> wouldn't validate the certificate?

Right. You could have VerifyCert return SECSuccess with the *evOidPolicy = SEC_OID_UNKNOWN if the flag is set.
Flags: needinfo?(brian)
How about this approach: have nsNSSCertificate only ever use the EV OID it receives when being created. I think this will have the same effect as adding a flag to VerifyCert that will only be used when asking about EV status. While there are other cases of certificate verification on the main thread (which is a problem, but a larger one that should be dealt with separately), I think this is the only one that actually races with a TLS connection and can cause spurious OCSP requests (unless I'm missing something).
Attachment #8360535 - Flags: review?(brian)
Comment on attachment 8360535 [details] [diff] [review]
only use mCachedEVStatus / mCachedEVOidTag

Review of attachment 8360535 [details] [diff] [review]:
-----------------------------------------------------------------

My concern with this approach is as follows:

1. When we do a full handshake, we will set the EV state for the nsNSSCertificate object as part of certificate verification in SSLServerCertVerification/nsNSSIOLayer/nsNSSCallbacks.
2. When we do a resumed handshake, we DO NOT set the EV state for the nsNSSCertificate object in SSLServerCertVerification/nsNSSIOLayer/nsNSSCallbacks.

Consequently, I suspect that this patch will cause the EV indicator to stop showing up for an EV site if we close the initial connection and then reload the page on a new, resumed, connection. True/false?

Now, one obvious solution to this problem would be to set the EV state for resumed connections in HandshakeCallback--i.e. basically call GetIsExtendedValidation pre-emptively in HandshakeCallback, so that it won't be calculated on the main thread. But, this would also require that the DV fallback be disabled, because we can't be doing OCSP on the socket transport thread either! (It's actually even worse than the main thread, as we've seen!)

One drawback of this patch is that it looks risky to uplift it. It seems like the change I mentioned in the preceding paragraph (actually, perhaps just the change in the second sentence of that paragraph) is necessary and sufficient to resolve this problem. I actually like the idea of this patch but I think it may be better put in a bug like bug 867475 or a new one blocking bug 775698, that we wouldn't pulift.
Flags: needinfo?(dkeeler)
Attached patch no dv fallback patch (obsolete) — Splinter Review
Ok - this makes sense to me now (at least, I think). Let me know if this is what you had in mind.
Attachment #8358561 - Attachment is obsolete: true
Attachment #8358563 - Attachment is obsolete: true
Attachment #8360535 - Attachment is obsolete: true
Attachment #8360535 - Flags: review?(brian)
Attachment #8360731 - Flags: review?(brian)
Flags: needinfo?(dkeeler)
Comment on attachment 8360731 [details] [diff] [review]
no dv fallback patch

Review of attachment 8360731 [details] [diff] [review]:
-----------------------------------------------------------------

We should add a test that nsIX509Cert::isExtendedValidation doesn't make OCSP requests for either EV certs nor DV certs.

Note that nsIX509Cert::isExtendedValidation will return false ("not EV") for an EV cert that hasn't been previously validated with OCSP fetching enabled, because there won't be any OCSP responses in the local OCSP cache for it to look at. This is expected, though unfortunate, behavior.

::: security/manager/ssl/src/CertVerifier.h
@@ +20,5 @@
>  public:
>    NS_INLINE_DECL_THREADSAFE_REFCOUNTING(CertVerifier)
>  
>    typedef unsigned int Flags;
> +  // XXX: The localonly flag is ignored in the classic verification case

s/The localonly flag/FLAG_LOCAL_ONLY/
Attachment #8360731 - Flags: review?(brian) → review+
Attached patch updated patch + test (obsolete) — Splinter Review
Attachment #8360731 - Attachment is obsolete: true
Attachment #8361428 - Flags: review?(brian)
Comment on attachment 8361428 [details] [diff] [review]
updated patch + test

Review of attachment 8361428 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/public/nsIX509CertDB.idl
@@ +327,5 @@
>     */
>    int32_t /*PRErrorCode*/
>      verifyCertNow(in nsIX509Cert aCert,
>                    in int64_t /*SECCertificateUsage*/ aUsage,
> +                  in uint32_t aFlags,

I guess we should update the UUID for this change.

::: security/manager/ssl/tests/unit/test_ev_certs.js
@@ +179,5 @@
> +// it hasn't already been verified (e.g. on the verification thread when
> +// connecting to a site.)
> +add_test(function() {
> +  clearOCSPCache();
> +  for (let nickname of ["ev-valid", "non-ev-root", "no-ocsp-url-cert"]) {

This style doesn't work great when there is a failure, because it is hard to tell which certificate failed to validate.

This style makes it easier to debug failures using the line numbers in the stack traces:

let testNoOCSPRequests = function() { /* body of your loop */ };
testNoOCSPRequests("ev-valid");
testNoOCSPRequests("non-ev-root");
testNoOCSPRequests("no-ocsp-url-cert");

@@ +187,5 @@
> +    let verifiedChain = {};
> +    let flags = Ci.nsIX509CertDB.FLAG_LOCAL_ONLY |
> +                Ci.nsIX509CertDB.FLAG_NO_DV_FALLBACK_FOR_EV;
> +    let error = certdb.verifyCertNow(cert, certificateUsageSSLServer, flags,
> +                                     verifiedChain, hasEVPolicy);

This is good to test.

Could we also test that nxIX509Cert.isExtendedValidation doesn't cause any OCSP requests?
Attachment #8361428 - Flags: review?(brian) → review+
(In reply to Brian Smith (:briansmith, :bsmith; NEEDINFO? for response) from comment #23)
> Review of attachment 8361428 [details] [diff] [review]:
> ::: security/manager/ssl/public/nsIX509CertDB.idl
> I guess we should update the UUID for this change.

Good catch - done.

> ::: security/manager/ssl/tests/unit/test_ev_certs.js
> This style doesn't work great when there is a failure, because it is hard to
> tell which certificate failed to validate.

Sounds good - fixed it.

> Could we also test that nxIX509Cert.isExtendedValidation doesn't cause any
> OCSP requests?

Yep.

Thanks for the reviews. Carrying over r+. Sent to inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b887641e3983
Attachment #8361428 - Attachment is obsolete: true
Attachment #8361840 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/b887641e3983
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment on attachment 8361840 [details] [diff] [review]
patch + test v1.1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 813418
User impact if declined: Jank (user interface freezes for up to a few seconds), poor performance.
Testing completed (on m-c, etc.): This landed on 2014-01-18 on mozilla-central; see comment 25. This has automated tests that will also be uplifted.
Risk to taking this patch (and alternatives if risky): If we keep the IDL/UUID changes, then there is a small risk that the UUID change may cause some problems. Otherwise, the patch is safe and we're probably more stable without the patch than with it.

String or IDL/UUID changes made by this patch: nsIX509CertDB was modified so it has a new UUID, and the signature to nsIX509CertDB::verifyCert was changed. However, it is possible for us to create a new version of the patch that doesn't have these IDL/UUID changes, if that is a concern.
Attachment #8361840 - Flags: approval-mozilla-aurora?
Attachment #8361840 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #27)
> https://hg.mozilla.org/releases/mozilla-aurora/rev/82bc631a9e53

Backed out in https://hg.mozilla.org/releases/mozilla-aurora/rev/095aa30cb6ef for XPCshell failures: https://tbpl.mozilla.org/php/getParsedLog.php?id=33766358&tree=Mozilla-Aurora

I was going to uplift the test fixes from bug 923304 as test-only, but it looks like those depend on changes from bug 950129, which is still pending approval.
Attached patch patch + test for aurora (obsolete) — Splinter Review
The original patch incidentally depended on test-only changes from bug 923304. I folded those changes in with it (the only differences should be in test_ev_certs.js and a minor rebasing change in CertVerifier.cpp).
https://tbpl.mozilla.org/?tree=Try&rev=576bf1bdf0b2
Attachment #8369110 - Flags: review?(brian)
Attachment #8369110 - Flags: review?(brian) → review+
Looks like the failures are opt-only. Debug builds are green.
Opt does things differently depending on if the testing EV root is present, so I had to fix that up.
Try looked good: https://tbpl.mozilla.org/?tree=Try&rev=4b384f558662
Checkin: https://hg.mozilla.org/releases/mozilla-aurora/rev/3d9d920ca43b
Attachment #8369110 - Attachment is obsolete: true
Attachment #8369301 - Flags: review+
Depends on: 995801
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: