Closed Bug 1530429 Opened 5 years ago Closed 5 years ago

SSL validation sometimes fails when it should pass for page loaded on startup (race?)

Categories

(Core :: Security: PSM, defect)

64 Branch
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: adamw, Unassigned)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0

Steps to reproduce:

An automated test I maintain enrols a client system as a member of a FreeIPA domain. This happens via the Cockpit web UI in a Firefox instance. Part of FreeIPA domain enrolment involves the addition of a new CA cert - the FreeIPA domain's own CA - to the system trust store, which is done during the enrolment, using the Fedora 'shared system certificate' mechanism (see https://fedoraproject.org/wiki/Features/SharedSystemCertificates - basically the CA cert file is dropped in a special location and 'update-ca-trust' is run, which among other things, updates the NSS database of trusted CA certs). That Firefox instance then exits. In a subsequent step, the test runs a new Firefox instance from a console like this:

startx /usr/bin/firefox -width 1024 -height 768 https://ipa001.domain.local

which should immediately load the FreeIPA webUI. That server uses an SSL cert issued by the FreeIPA domain CA, so it should be trusted.

Actual results:

Usually, this works fine, and the page loads correctly. However, about 30% of the time, we instead see a certificate error instead. The error is SEC_ERROR_UNKNOWN_ISSUER: "Peer's Certificate issuer is not recognized". So, I looked into this further: I had the test show the details of the error (the actual certificate chain it received from the server), and then go into the Firefox Certificate Manager and page through the entries. This showed that the expected certificate for the 'DOMAIN.LOCAL' CA is actually present in the list. I had the test export this certificate so we could look at it; manually comparing that certificate to the one shown as 'not recognized' in the SEC_ERROR_UNKNOWN_ISSUER error shows they are identical. Finally, I had the test try simply refreshing the page when it hits the certificate error...and lo and behold, on refresh the page loads fine.

I suspect there is a race here between load/refresh of the system trusted certificate store, and load of a page specified on the Firefox command line. If the latter happens fast, the certificate can be erroneously rejected in this way.

Here's a sample of the test in the version where it provides additional data on the certificate failure:

https://openqa.stg.fedoraproject.org/tests/484084
https://openqa.stg.fedoraproject.org/tests/484084#downloads

On the "Logs & Assets" page (the second link), 'freeipa_webui-test.crt' is the CA cert for the FreeIPA domain CA, as exported from the 'Certificate Manager' within Firefox. The screenshots https://openqa.stg.fedoraproject.org/tests/484084#step/freeipa_webui/14 and https://openqa.stg.fedoraproject.org/tests/484084#step/freeipa_webui/15 show the certificate chain that Firefox rejected; you can manually verify that the CA cert in the chain is identical to the one that we find in the Certificate Manager. 'freeipa_webui-certs' is the output of a further check I had the test run, where it uses openssl s_client to contact the same URL, and display and verify the certificate chain; it displays the same chain, and accepts it.

https://openqa.stg.fedoraproject.org/tests/484652 is an example of the test in the form where it retries when it hits the certificate error. You can see the certificate error at the seventh and eighth screenshots of the 'freeipa_webui' test module - https://openqa.stg.fedoraproject.org/tests/484652#step/freeipa_webui/7 and https://openqa.stg.fedoraproject.org/tests/484652#step/freeipa_webui/8 . The next screenshot - https://openqa.stg.fedoraproject.org/tests/484652#step/freeipa_webui/9 - is after the test then clicks the 'refresh' button: as you can see, the same URL loaded just fine. You can also see a video of this happening - https://openqa.stg.fedoraproject.org/tests/484652/file/video.ogv (skip to about 1:28 - note the video is faster than real time and skips some frames, it is constructed from screenshots).

Expected results:

The SSL cert should always be considered valid when I run the test 100 times, rather than failing to validate ~30% of the time.

To be clear, when I first started looking into this my assumption was that when the test failed there actually was a real certificate issue for some reason - I was figuring that maybe there was an issue with the validity period of the certificate, or somehow when we have multiple instances of this test running at once, wires are getting crossed and the client ends up loading the FreeIPA webUI from a different server to the one it enrolled against, or import of the CA cert was sometimes failing, or something. But all the checks I wound up doing seem to point the same way: there isn't a real certificate validity issue. The chain the server sends is one the client should consider valid. The openssl check confirms this, and so does the check where I actually have the test find and export the CA cert from the Firefox certificate manager. I've run through this many times now and haven't found a single instance where the certs don't actually all match up fine. The refresh test was the last thing I tried, to confirm my new theory that the chain really is always valid, and this is just some sort of timing / race issue in Firefox - and it does indeed seem to confirm it.

Component: Untriaged → Security
Product: Firefox → Core

I hope this is the correct component for nss/cert bugs. Please move if I'm wrong.

Component: Security → Security: PSM

Maybe running with the environment variable P11_KIT_DEBUG=all set would give some illustrative output? Also, are you certain the second instance of Firefox runs after the first one has shut down? (using -on-remote might help to ensure this)

Flags: needinfo?(adamw)

"Maybe running with the environment variable P11_KIT_DEBUG=all set would give some illustrative output?"

I'll try that, thanks.

"Also, are you certain the second instance of Firefox runs after the first one has shut down? (using -on-remote might help to ensure this)"

I'm pretty sure, yes. We run the browser the same way both times - as the sole app in an X session, using 'startx /usr/bin/firefox' - and explicitly quit with 'ctrl-q' at the end of what we're doing, and expect the test to return to the console after that (if it doesn't, the test will fail). I don't think the Firefox process could survive this, right?

I'm actually having some trouble reproducing the 'refresh fixes it' result ATM, note. On that first try, the refresh would have happened after about 30 seconds. I tried a test where the refresh would've happened almost immediately (as soon as the test recognized the 'certificate error' screen) and that failed (still showed certificate error). I tried with a 5 second delay and it still failed. Currently I'm trying to test a 15 second delay, don't have the result yet.

Flags: needinfo?(adamw)

Ah, just as I type that, we get a result where refresh after 15 seconds worked:

https://openqa.stg.fedoraproject.org/tests/484856#

I'll keep trying to see if that stays consistent, and I'll try the P11_KIT_DEBUG=all thing.

OK, here's a case with a 'fail, wait 15 seconds, refresh, success' pattern:

https://openqa.stg.fedoraproject.org/tests/485008

and I got the Firefox console output wiht P11_KIT_DEBUG=all set logged: https://openqa.stg.fedoraproject.org/tests/485008/file/freeipa_webui-ff.log

Unfortunately it isn't timestamped, though...

OK, now I tweaked the test to timestamp the log messages by piping them through awk:

https://openqa.stg.fedoraproject.org/tests/485364
https://openqa.stg.fedoraproject.org/tests/485364/file/freeipa_webui-ff.log

From the test execution log, Firefox is launched at 20:45:14.713 UTC , the certificate error screen is identified at 20:45:18.013 UTC , and the successful refresh happens at 20:45:33.104 UTC . Note the Firefox log timestamps are local time not UTC, that explains the 5 hour difference. This 16 second gap between things happening in the Firefox log is noticeable:

2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsInit: in: 18, (2) [ { CKA_VALUE = (958) NOT-PRINTED }, { CKA_CLASS = CKO_CERTIFICATE } ]
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsInit: out: 0x0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjects: in: 18, 1
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjects: out: 0x12, 0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsFinal: in
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsFinal: out: 0x0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_GetAttributeValue: in: 29, 174
2019-02-26 15:45:16: (p11-kit:3001) sys_C_GetAttributeValue: out: 0x0 (1) [ { CKA_NSS_MOZILLA_CA_POLICY = (1) NOT-PRINTED } ]
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsInit: in: 18, (2) [ { CKA_VALUE = (958) NOT-PRINTED }, { CKA_CLASS = CKO_CERTIFICATE } ]
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsInit: out: 0x0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjects: in: 18, 1
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjects: out: 0x12, 0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsFinal: in
2019-02-26 15:45:16: (p11-kit:3001) sys_C_FindObjectsFinal: out: 0x0
2019-02-26 15:45:16: (p11-kit:3001) sys_C_GetAttributeValue: in: 29, 174
2019-02-26 15:45:16: (p11-kit:3001) sys_C_GetAttributeValue: out: 0x0 (1) [ { CKA_NSS_MOZILLA_CA_POLICY = (1) NOT-PRINTED } ]
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjectsInit: in: 18, (4) [ { CKA_TOKEN = (1) "\x01" }, { CKA_CLASS = CKO_CERTIFICATE }, { CKA_ISSUER = (57) "071\x150\x13\x06\x03U\x04\n\x0c\x0cDOMAIN.LOCAL1\x1e0\x1c\x06\x03U\x04\x03\x0c\x15Certificate Authority" }, { CKA_SERIAL_NUMBER = (3) "\x02\x01\t" } ]
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjectsInit: out: 0x0
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjects: in: 18, 1
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjects: out: 0x12, 0
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjectsFinal: in
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjectsFinal: out: 0x0
2019-02-26 15:45:32: (p11-kit:3001) sys_C_FindObjectsInit: in: 18, (4) [ { CKA_TOKEN = (1) "\x01" }, { CKA_CLASS = CKO_CERTIFICATE }, { CKA_ISSUER = (57) "071\x150\x13\x06\x03U\x04\n\x0c\x0cDOMAIN.LOCAL1\x1e0\x1c\x06\x03U\x04\x03\x0c\x15Certificate Authority" }, { CKA_SERIAL_NUMBER = (1) "\t" } ]

and the thing that shows up at :45:32 does indeed look like the DOMAIN.LOCAL CA cert we're concerned with here. It seems like this activity at :32 happens just before the page refresh - i.e. it's not happening in response to the page refresh, more like the refresh has to happen after this mysterious delay, and our 15 second sleep is just long enough when combined with an extra second or two of time spent by openQA in matching screenshots and sending actions and stuff.

Can I have a look at the exact setup when this happens (e.g., which package versions are used for p11-kit and nss)? I actually see a similar issue locally, but it happens on very specific occasions.

I've seen it happen when testing F28 updates, F29 updates and Rawhide nightly composes. The package set for update tests is 'current stable packages from the release, plus whatever's in the update', and of course Rawhide changes regularly.

It looks like the 485364 test, for instance, had nss-3.41.0-3.fc28 and p11-kit-0.23.15-1.fc28.

p11-kit-0.23.15-1.fc28 has some problem so it would be nice if you could check if it still happens with p11-kit-0.23.15-2.fc28:
https://bodhi.fedoraproject.org/updates/FEDORA-2019-04bcb2f1b9

Other than that, do you have any PKCS#11 module installed (check out /usr/share/p11-kit/modules)?

So here's a fun thing: I put in a change that I thought would include that update in the test, ran 27 iterations of the test, and it didn't fail once.

...then I double checked, and realized that in fact the change was a no-op with the way I was restarting the tests. So somehow I just got 27 runs of the test without any intentional attempt to fix the bug...none of which hit the bug. Which is odd, I never saw that before.

I've got 9 more instances running right now, we'll see if any of those hit the bug. Then I'll try some runs with the different p11-kit really included (hopefully), though of course this level of variability will make it hard to be sure of a fix :/

There seem to be two p11-kit modules present, btw: opensc.module and p11-kit-trust.module.

Would a copy of the firefox p11-kit debug output from a case where the bug did not happen be useful?

ah, so the 29th iteration of the test finally did hit the bug :) Trying some more now.

In my experiments, it happens more if I drop the VM cache (echo 3 > /proc/sys/vm/drop_cache) before starting Firefox, while it happens less if I remove the opensc module.

OK, fortunately for my sanity, one of the first nine runs with 0.23.15-2.fc28 failed:

https://openqa.stg.fedoraproject.org/tests/485698#

I dumped the output of rpm -q p11-kit to verify the version, and p11-kit-0.23.15-2.fc28.x86_64 was definitely installed right when the test went to run Firefox. So, that build definitely doesn't fix this.

(In reply to Adam Williamson from comment #14)

So, that build definitely doesn't fix this.

That is good for isolating the possible causes :-) At this point I still can't tell whether the p11-kit-trust.so is malfunctioning or the caller is trying to access other tokens and failing. Would you also like to check if removing opensc package affects the results?

Sorry for the delay, just got back to this. Trying with a removal of opensc now.

So the very first run with opensc removed hit the bug:

https://openqa.stg.fedoraproject.org/tests/488207

log tarballs etc at:

https://openqa.stg.fedoraproject.org/tests/488207#downloads

don't worry about how the test subsequently fails, that's just because I didn't think through all the consequences of the way I implemented the opensc removal :)

You might get more informative output if you also set the environment variable MOZ_LOG="pipnss:4,certverifier:5"

Flags: needinfo?(adamw)

Thanks Dana, will try it that way and report back.

Flags: needinfo?(adamw)

Here's the Firefox output from a test that hit the bug with those vars added. Note, opensc plugin is present this time - since removing it didn't seem to help, I took the code for removing it out of the test again.

Any progress on this?

Flags: needinfo?(adamw)

I was waiting on you, since I posted the requested output :)

Flags: needinfo?(adamw)

Fair enough. Unfortunately I'm not seeing the cause of the issue from those logs. Can you try again with this build: https://queue.taskcluster.net/v1/task/NwNHtASbQZa1ljUQ8IR9Vg/runs/1/artifacts/public/build/target.tar.bz2
Thanks!

Flags: needinfo?(adamw)
Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE

Sorry I didn't test with that build - we got busy with the F30 release rush, then I went on holiday.

For the record, it seems this may have got fixed: dueno just asked me to check if it has happened recently, and it seems like the answer is "no". The most recent occurrences I can find were on 2019-03-20 and 2019-03-22; the issue does not seem to have happened one time since then.

Flags: needinfo?(adamw)

Ok - feel free to reopen if you see this again.

Resolution: INCOMPLETE → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: