Closed Bug 1056341 Opened 10 years ago Closed 6 years ago

mozilla::pkix spends too much time attempting to build a valid path when there are many possible paths

Categories

(Core :: Security: PSM, defect, P1)

32 Branch
All
Other
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: miken32, Assigned: keeler)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [psm-assigned])

Attachments

(5 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2

Steps to reproduce:

Attempt to load web interface of a pfSense router using default (self-signed) HTTPS certificate


Actual results:

Page hangs on loading, using 100% CPU, process has to be terminated


Expected results:

Certificate warning
I don't have an installation open to the internet for testing against, but if this isn't a known issue and testing needs to be done, I can set one up. Setting security.use_mozillapkix_verification to false allows successful page load (with security warning.)

See also https://forum.pfsense.org/index.php?topic=79656
Can you please attach the pfsense certifcates being used? (email is OK if you want to keep them private)
Flags: needinfo?(miken32)
Having a public test installation would also be useful, if possible.
Here is the default certificate; I will work on getting a test box with public internet access tomorrow.
Flags: needinfo?(miken32)
The cert contains CA:True, which means we should talk to pfsense to modify the way the generate their cert.
(In reply to Camilo Viecco (:cviecco) from comment #6)
> The cert contains CA:True, which means we should talk to pfsense to modify
> the way the generate their cert.

why is the cert verification spinning rather than just failing?
Test box is now online at https://184.66.147.162/
Hi Michael - thanks for setting that up. However, I can't connect to that IP (using firefox or otherwise). Is there a firewall that needs to open a port or something?
Flags: needinfo?(miken32)
Allow all rule has been added to the WAN port, sorry about that.
Flags: needinfo?(miken32)
Michael, I can connect now (using Firefox 32), but all I see is the untrusted connection page. After adding an override, I can visit the site. Is this still an issue?
Flags: needinfo?(miken32)
Yes, still experiencing the problem on Firefox 33 beta 1, starting with all addons disabled. I will try with a fresh profile to confirm it's still happening in that case as well.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:33.0) Gecko/20100101 Firefox/33.0
Flags: needinfo?(miken32)
Ok I have an update on this (and apologies for not testing this before filing!) The problem seems to be happening because I already had accepted the certificate previously. Once I deleted that certificate from my database I was able to access the page, and accept the security certificate. The problem doesn't recur, though the certificate is back in the database.

I have tried creating a new profile in FF 23 and 18, accepting the certificate, and then loading it up in FF 33, but cannot reproduce the problem. However by copying over my existing cert8.db file to a new profile I can reproduce the problem.

I'd be inclined to close it as WORKSFORME and write it off as a corrupt file, except for the fact that it seems to be affecting lots of other pfSense users. I don't want to attach my cert8.db to the report, but please confirm I can email it to dkeeler for testing and I will do so.
Flags: needinfo?(dkeeler)
You can email your cert8.db to me if you feel comfortable with that. My understanding is it will have a list of every intermediate certificate you've encountered while browsing as well as root CAs you've added manually. It will also have the public half of any client certificates you have (the private keys should all be in key3.db, so don't send that, of course).
Flags: needinfo?(dkeeler)
Email has been sent, thanks.
I think I figured out why that profile takes forever to verify the certificate sent by that server. The server sends a certificate with issuer "E=Email Address,CN="Common Name (eg, YOUR name)",OU="Organizational Unit Name (eg, section)",O=CompanyName,L=Somecity,ST=Somewhere,C=US". There are 16 different certificates in that profile with that same issuer/subject. Since each of these certificates could have in theory issued the others, and since (it looks like) signature verification happens starting from the root backwards, mozilla::pkix basically explores something like every possible ordering of every 7-element subset of the 16 certificates, which is 7! * 16 choose 7 = 57657600. In other words, it's trying to do around 57 million signature verifications, which understandably takes some time. If we checked signatures the other way, I think we would prevent this, but I have an inkling it's not safe to do that for some reason (although I'm not sure why, since intuitively if every signature is checked, it ends up being the same either way).
Brian, unless I'm misunderstanding pkixbuild.cpp, it seems like building a chain without first verifying signatures like we're doing is at least inefficient, if not just wrong. In particular, things like comment 16 happen. Also, pkixbuild.cpp will call trustDomain.IsChainValid on a chain that where the signatures haven't actually been validated (although it will validate them before returning an ultimate success). Can you shed some light on this? Thanks.
Flags: needinfo?(brian)
(In reply to David Keeler (:keeler) [use needinfo?] from comment #17)
> Brian, unless I'm misunderstanding pkixbuild.cpp, it seems like building a
> chain without first verifying signatures like we're doing is at least
> inefficient, if not just wrong.

The signature verification is the most expensive check, so we try to do it after we've verified everything else is OK. Consider:

     - B <- C <- D
    /
A <--- E <- F <- G
    \
     - H <- I <- J


Let's say that only J is a trusted CA. If you do the signature verification on the way up the tree, then you will do 9 signature verifications. If you do signature verification on the way down the tree, then you will only do 3. 3 < 9.

> In particular, things like comment 16
> happen.

> Also, pkixbuild.cpp will call trustDomain.IsChainValid on a chain
> that where the signatures haven't actually been validated (although it will
> validate them before returning an ultimate success).

Again, this is a good thing, right? Because if you reject a key as "not pinned" then you're trading an expensive signature verification for a cheap hash comparison.

Consider this:

A <- A <- A <- A <- A <- A <- A <- A

You have 8 certificates, all with the same subject name. Further, only one of them (the last one) is a trust anchor. If you try really hard, it is possible to put the certificates in the right order so that you get a valid chain. But, trying really hard takes a lot of time, as you saw.

Obviously, it isn't good to have an O(n!) algorithm. But, in fact, there really are O(n!) possible chains sometimes, and sometimes one (and only one) is a valid chain. So, if we want to guarantee that you spend less than O(n!) time in the worst case, we'd need some kind of heuristic.

Anyway, I agree this is a bug. We shouldn't make the normal, good case of a certificate that chains to a trusted CA significantly slower in favor of more quickly handling self-signed certificates, but we should find a way to better handle this case. Not sure it should be a high priority, though.
Flags: needinfo?(brian)
See the discussion here:
https://forum.pfsense.org/index.php?topic=82295.0
OS: Mac OS X → Other
Hardware: x86 → All
We at pfSense have committed changes for the next pending version that will make new certificates in a more proper and unique way so they will work better for new deployments or for those who manually regenerate the self-signed certificate (no CA flag, server set, a unique identifier in the CN field, generated using the configured hostname and so on.)

However, that won't help the very large number of installations already out there with the less-than-ideal certificates that overlap and tickle this bug (well over 200,000), not to mention other products and sites that may have similar dodgy self-signed certificates. Now that the ability to disable PKIX has been stripped out of Firefox, we have to tell people hitting this bug to use other browsers that don't have issues processing the certificates aside from the usual warnings, such as Chrome. At least in the current version of Firefox the CPU usage dies down once the offending tab is closed. 

I feel this should be given some priority if at all possible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
There are test services in bug 1147544, which David has marked as a duplicate of this one.
(I'm making a comment regarding the case in duplicate bug 1147544, I haven't checked if it's exactly the same scenario.)

After a discussion amonst several NSS developers, the recommendation is to stop searching for other issuer certificates if the normalized subject and issuer names of the certificate match.

In all non-pathological scenarios, this should sufficiently indicate that it's a self signed certificate, and searching for potential other issuers is unnecessary.

If you would like to be extra careful, then in addition for checking for identical subject and issuer names, you could also check that the signature contained in certificate A was indeed created using the key contained in the certificate A, and if it was was, assume it's fine to stop searching for other issues.

It was mentioned that there might be other scenarios, but they are seen as impractical for the purposes of Internet PKI, and Firefox shouldn't need to support them.
There are multiple bugs here:

1. When Gecko stores certificate error overrides for a hostname, it stores the certificate for which it added the certificate error override, but it never cleans up any old certificates that were stored for previous certificate error overrides.

2. In general, PKIX path building is a combinatorial problem. Some heuristics is needed to reduce the combinatorial problem into one with a lower upper bound in performance. mozilla::pkix's heuristic is to limit the cert chain length, but this is not sufficient in some situations.

Because of #1, when cert error overrides are used for self-signed/issued certificates, #2 is triggered more commonly. However, self-issued certificates are not the only thing that can cause a large number of possible paths to be traversed. The proposed solutions mentioned in comment 23 would wallpaper over #2 for this most-common case, without actually solving the general problem. That doesn't seem like a good approach. Instead, we should find a solution that works for all cases of #2, that doesn't cause false positive failures for real-world use cases. Additionally, another bug for #1 should be filed, and that bug should be fixed.

(In reply to Kai Engert (:kaie) from comment #23)
> After a discussion amonst several NSS developers, the recommendation is to
> stop searching for other issuer certificates if the normalized subject and
> issuer names of the certificate match.

This would effectively remove the ability for any self-issued certificates to work in Gecko. That seems likely to cause real-world compatibility problems.

> If you would like to be extra careful, then in addition for checking for
> identical subject and issuer names, you could also check that the signature
> contained in certificate A was indeed created using the key contained in the
> certificate A, and if it was was, assume it's fine to stop searching for
> other issues.

This is better, but it will not be compatible with (hopefully) upcoming changes to how signature verification is done in mozilla::pkix, so I'd like to avoid it.

Keep in mind that an attacker that is interested in forcing this DoS situation could do so by using self-issued certificates that aren't self-signed, or by sending us certificates where none are self-issued. We need a solution that prevents this DoS attack.

A better solution would be to limit the total number of calls to IssuerChecker::Check() per invocation of BuildCertChain to a certain value. That would be analogous to libpkix's "if (numIterations++ > 250) PKIX_ERROR(PKIX_TIMECONSUMEDEXCEEDSRESOURCELIMITS);" check. See 597618 comment 2 and 597618 comment 7. Considering that the maximum cert chain length in mozilla::pkix is 8 (1 + MAX_SUBCA_COUNT + 1), 250 should be more than enough, and our experience with libpkix shows that 250 seems to be web compatible.
Summary: pfSense interface hangs with mozilla::pkix enabled → mozilla::pkix spends too much time attempting to build a valid path when there are many possible paths
CC'ing Wan-Teh since he fixed the same (or very similar, at least) bug in libpkix in bug 597618. 

The comment references above should be "bug 597618 comment 2" and "bug 597618 comment 7."
David, in the bug you just dup'd to this one, you said that this is an architectural issue that is unlikely to change. However, above, I mentioned a very simple workaround that is pretty easy to implement and which probably won't cause much harm.
Oh - for some reason I missed that the first time around. That sounds like a reasonable solution.
OK, so I can confirm that that problem arise with other self-signed certificates, see: https://bugzilla.mozilla.org/show_bug.cgi?id=1240548

So bad the title of the bug didn't let me find it when I was searching for existing bug before opening that one. And actually, I don't think anybody without a deep understanding of mozilla internals would search for "mozilla::pkix" and "path", but more tags like "self-signed certificate", "hang / stall", etc.
Whiteboard: [psm-backlog]
Me too!

This bug happens very often here because I test an application that has such certificates.
Every time I do it, I known I have to open a chromium to access the application. With firefox it is just impossible.
Rather than just "me too!", can you attach a certificate chain that demonstrates this problem? Can you provide details about the product or CA that generates it? The most useful path to finding a good resolution is to include additional debug and diagnostic data about the problem, describing with as much detail as you can the environment that tickles it.

To date, this seems to be exceptional and only when some PKI products have taken long-known suboptimal paths, but if there is room for improvement, the NSS team is happy to explore.
Ryan,

I could attach, but it's very simple to reproduce:

- Generate a self signed single hostname certificate (like: webserver1, no full FQDN) and used it with a webserver. Point the browser to the webserver.
- Close the browser
- Generate another certificate, with the same data, same hostname: webserver1, put in on the webserver, point the browser to the webserver.
- Do this until the browser starts to hang when you try to load that website (I think 5-10 times would be enough).
That's not sufficient, nor does it reproduce. More likely, you're doing something like generating a self-signed cert with the same distinguished name and extensions, but a new key, but that's precisely why I pointed out the need for more info.

Good bugs get fixed.
(In reply to Claudiu N. CISMARU from comment #34)
> I could attach, but it's very simple to reproduce:
> 
> - Generate a self signed single hostname certificate (like: webserver1, no
> full FQDN) and used it with a webserver. Point the browser to the webserver.
> - Close the browser
> - Generate another certificate, with the same data, same hostname:
> webserver1, put in on the webserver, point the browser to the webserver.
> - Do this until the browser starts to hang when you try to load that website
> (I think 5-10 times would be enough).


Can you clarify how you generate the certificate?  I see you specify "self-signed", but there are several different ways to generate a certificate.  Alternatively, can you attach two certificates generated using the method you describe?  Please do not include the private keys, just the certificates.
Uploaded 2 certs from 2 different devices. They are generated during the installing process.
These examples are very helpful.  Each if these is a self-signed CA certificate and each uses the exact same Distinguished Name.  They have different SPKI and SKID.

Are you importing all of these (one per device) as a trusted CA into Firefox?
Yes. Usually I'm adding exceptions as they are testing devices which I use daily basis.

However, I though you understood where the problem is, that's why I never bothered to add the certs in here.
I think there's no need for further confirmation that the bug exists; this was confirmed more than 2 years ago. Given that nothing has happened since then, I'm not holding my breath for a fix. Clear out your old certificates and ensure that your certs  have unique distinguished names and you should be fine.
This is still broken in Firefox 52.0 ESR.

Is there a reason why previous certificates for the same server are kept in cert8.db when a new certificate exception is stored? Shouldn't the new certificate replace the old one? In Firefox's Certificate Manager there is only one entry for each server, even when I've stored multiple exceptions for the same server, as I would have expected. In cert8.db, on the other hand, all previous certificates are still in place.

I'm testing a device with self-signed certificate that is regenerated every time the configuration of the device is wiped. As a result, I have dozens of certificate with the same name in cert8.db. Even when deleting the exception from Certificate Manager, certificates still remain stored in cert8.db.

This workaround from Red Hat Bugzilla fixes the problem until the certificates pile up again:

https://bugzilla.redhat.com/show_bug.cgi?id=1204670#c34
> certutil -d dbm:. -D -n localhost.localdomain
> multiple times, it will delete on cert with each execution.
> You can run the -L command again to check how many you have left.
> When you have deleted all of them, try firefox again, and it should be quick again.
Still broken in 55.0b2. I just was bitten badly by this (came here via bug 1240548.)
I use a VPN connection into the office. It died while I was in the process of logging in to a system at work. I restarted the VPN connection but Firefox didn't get anywhere. Restarting Firefox didn't help - it hung hard in the TLS handshake (displaying that in the status bar is nice!)
I deleted all certificates that were related to that system, restarted Firefox and I could log in normally after going through the usual security dialog.
Priority: -- → P3
Assignee: nobody → dkeeler
Priority: P3 → P1
Whiteboard: [psm-backlog] → [psm-assigned]
Comment on attachment 8962915 [details]
bug 1056341 - introduce a budget for path searching in mozilla::pkix to avoid unbounded search

https://reviewboard.mozilla.org/r/231722/#review237420

Patch looks good. Only a couple minor things.

::: security/pkix/lib/pkixbuild.cpp:201
(Diff revision 1)
>      return RecordResult(rv, keepGoing);
>    }
>  
> +  // If we've ran out of budget, stop searching.
> +  if (budget == 0) {
> +    Result savedRv = RecordResult(Result::ERROR_UNKNOWN_ISSUER, keepGoing);

I think we should introduce a new error code. We can treat it as an `UNKOWN_ISSUER` on the front-end for the beginning. But it would make adding a different error message easier and we'd get telemetry on how widespread the issue is (also see the budget pref comment).

::: security/pkix/lib/pkixbuild.cpp:410
(Diff revision 1)
> +  // when exhausted, terminates the search with the result
> +  // Result::ERROR_UNKNOWN_ISSUER. The current value appears to be a good
> +  // balance between finding a path when one exists (when the space isn't too
> +  // large) and timing out quickly enough when the space is too large or there
> +  // is no valid path to a trust anchor.
> +  unsigned int budget = 200000;

Should this maybe be a pref so we can experiment with the exact value?

::: security/pkix/lib/pkixbuild.cpp:410
(Diff revision 1)
> +  // when exhausted, terminates the search with the result
> +  // Result::ERROR_UNKNOWN_ISSUER. The current value appears to be a good
> +  // balance between finding a path when one exists (when the space isn't too
> +  // large) and timing out quickly enough when the space is too large or there
> +  // is no valid path to a trust anchor.
> +  unsigned int budget = 200000;

Maybe we should make the name a little more specific? `budget`for what?

::: security/pkix/test/gtest/pkixbuild_tests.cpp:842
(Diff revision 1)
> +  Result IsChainValid(const DERArray&, Time, const CertPolicyId&) override
> +  {
> +    return Success;
> +  }
> +
> +  std::vector<ByteString> certs;

I'm a fan of `mMemberVariable` or `my_member_` names. That doesn't seem to be the style in this file though.
Attachment #8962915 - Flags: review?(franziskuskiefer)
Comment on attachment 8962915 [details]
bug 1056341 - introduce a budget for path searching in mozilla::pkix to avoid unbounded search

https://reviewboard.mozilla.org/r/231722/#review237420

Agreed.

> I think we should introduce a new error code. We can treat it as an `UNKOWN_ISSUER` on the front-end for the beginning. But it would make adding a different error message easier and we'd get telemetry on how widespread the issue is (also see the budget pref comment).

I think this is also a good idea, but: Will we want to uplift this patch at all? If so, then let's do the new error code + telemetry in a follow-on.

> Should this maybe be a pref so we can experiment with the exact value?

FWIW I'd be fine with this being a follow-on, since prefs here are so awful.

> Maybe we should make the name a little more specific? `budget`for what?

`searchIterationBudget` maybe?
Comment on attachment 8962915 [details]
bug 1056341 - introduce a budget for path searching in mozilla::pkix to avoid unbounded search

https://reviewboard.mozilla.org/r/231722/#review237420

> I think this is also a good idea, but: Will we want to uplift this patch at all? If so, then let's do the new error code + telemetry in a follow-on.

A new error could would be good to do. Bug 1448787 is still in flight and would conflict, though, so I filed bug 1449668 to add it so we can do these in parallel. FWIW I don't think we need to uplift this - this is a long-standing issue.

> FWIW I'd be fine with this being a follow-on, since prefs here are so awful.

I don't think adding a pref would be worth the work.

> `searchIterationBudget` maybe?

I called it `buildForwardCallBudget` since that's what it is right now.

> I'm a fan of `mMemberVariable` or `my_member_` names. That doesn't seem to be the style in this file though.

Yeah I just stayed with the existing style here.
Comment on attachment 8962915 [details]
bug 1056341 - introduce a budget for path searching in mozilla::pkix to avoid unbounded search

https://reviewboard.mozilla.org/r/231722/#review237682
Attachment #8962915 - Flags: review?(jjones) → review+
Comment on attachment 8962915 [details]
bug 1056341 - introduce a budget for path searching in mozilla::pkix to avoid unbounded search

https://reviewboard.mozilla.org/r/231722/#review237808

Thanks!
Attachment #8962915 - Flags: review?(franziskuskiefer) → review+
Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9dcd9f186bbf
introduce a budget for path searching in mozilla::pkix to avoid unbounded search r=fkiefer,jcj
Thank you for the effort; however i'm not sure of if adding a budget will solve the issue; won't it simply prevent accessing some site if certificate it too slow to search ?
To have some metrics, have 9 self-signed certificate for the same site is slow, 10 makes it unreachable (Firefox 58.0.2) - will this patch makes Firefox fail faster, or make it find faster the certificate ? Suppressing 3 certificates make its again nearly instantaneous (I have so much self-signed certificates for the same system because I'm destroying and recreating VMs with a webapp reachable only via HTTPS)

In the code that is patched, I see "// TODO(perf): This probably can and should be optimized in some way." - i'm not sure of the actual implementation details, but the way it gets slower with the number of certificates lets me believe it's at least a polynomial cost - but I might be completly wrong

Thank you again
https://hg.mozilla.org/mozilla-central/rev/9dcd9f186bbf
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Yes, this should make path searching quit before exploring all possible paths for 10 certificates with the same issuer/subject DN but different SPKIs. For now you'll get the error "unknown issuer" which is presumably what you're getting anyway, so you should be able to add an override and access the site (please try it out and let me know if it's not working for you). The perf TODO you noted is unrelated.

The work done in this bug addresses part 2 of comment 24, but we should probably still address part 1 somehow (i.e. make it so profiles never get in the state of having a bunch of these certificates that create a pessimal search space). Note that simply removing replaced certificate entries when a new override gets added won't work for the case where e.g. a user is visiting ip-A, adding an override, then visiting ip-B, adding an override... (I don't know how common that is, but I'm assuming one use-case is visiting a bunch of network devices that all have different IP addresses, in which case previous overrides are never overwritten, so none of the certificates will be removed.)
For people still having the problem with FF60, you can still do:

cd  ~/.mozilla/firefox/xxxx.default/
certutil -d sql:. -L | grep hostname-of-problematic-server
[you should see several entries for it]
certutil -d sql:. -D -n hostname-of-problematic-server

It will solve the error for a time, before you need to do it again. 

Also, an example of a problematic self signed certificate can be created with:

openssl req -new -x509 -newkey rsa:2048 -subj "/C=FR/ST=France/L=Paris/CN=server.rudder.local/emailAddress=root@server.rudder.local/" -keyout /opt/.../xxx.key -out /.../xxx.crt -days 1460 -nodes -sha256

Create it on a dozen different test boxes, go to each of them, and you will see the problem.
Tested with Firefox 61, this is *not* fixed. It makes corresponding site unusable, getting tens of seconds for each requests (included all small ajax requests).

And it happens with low number of cert: 

% cd ~/.mozilla/firefox/xxxxxx.default
% certutil -d sql:. -L | grep server.rudder.local     
server.xxx.local                                          ,,
server.xxx.local                                          ,,
server.xxx.local                                          ,,
server.xxx.local                                          ,,
server.xxx.local                                          ,,
server.xxx.local                                          ,,
server.xxx.local


Here, access to corresponding site is almost impossible, extremelly slow and unusable. Doing 7 times: 

% certutil -d sql:. -D -n server.xxx.local

And now, everything is fast again.

Should I create a new ticket to track that? Or should this one be reopenned?
Yes, please open a new bug.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: