Closed Bug 1273678 Opened 4 years ago Closed 3 years ago

AddressSanitizer: heap-use-after-free on address in nssPKIObject_AddRef

Categories

(NSS :: Libraries, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(firefox49 wontfix, firefox-esr45- wontfix, firefox50+ wontfix, firefox51+ wontfix, firefox52+ wontfix, firefox-esr52 wontfix, firefox53+ wontfix, firefox54+ wontfix, firefox55+ fixed)

RESOLVED FIXED
Tracking Status
firefox49 --- wontfix
firefox-esr45 - wontfix
firefox50 + wontfix
firefox51 + wontfix
firefox52 + wontfix
firefox-esr52 --- wontfix
firefox53 + wontfix
firefox54 + wontfix
firefox55 + fixed

People

(Reporter: bc, Assigned: ttaubert)

References

(Blocks 1 open bug, )

Details

(Keywords: csectype-uaf, sec-moderate, Whiteboard: [adv-main55+][post-critsmash-triage])

Attachments

(4 files, 2 obsolete files)

Attached file log + asan messages
1. http://www.musica-classica.it/forum/index.php?/topic/3736-il-caso-luchesi/&page=5 with an asan linux x86_64 build.

2. ==9341==ERROR: AddressSanitizer: heap-use-after-free on address 0x61d0010a9110 at pc 0x7f02b3b3f2a8 bp 0x7ffc8fcfa3a0 sp 0x7ffc8fcfa398
WRITE of size 4 at 0x61d0010a9110 thread T0
    #0 0x7f02b3b3f2a7 in nssPKIObject_AddRef /builds/slave/m-cen-l64-asan-ntly-0000000000/build/src/security/nss/lib/pki/pkibase.c:156

on rhel6 x86_64.

I have been unable to reproduce this issue on rhel6 or fedora24 but considered it important enough to file anyway. This may have been due to an advertisement. I'll keep trying to reproduce as time permits.
I'm not sure if this error is really in NSS. It looks like an NSS shutdown issue to me. David could you have a quick look?
Flags: needinfo?(dkeeler)
I wrote a small test client that simulates what those stack traces seem to indicate is going on (a TLS connection is happening while the platform is unloading the pkcs11 module containing the default root certificates), and it appears that NSS is not robust against this. I'm not sure it's intended to be (sometimes NSS acquires locks when working with these objects, but sometimes it doesn't appear to).

If this is a use-case that NSS wants to support, the lack of locking should be fixed. In the meantime, Firefox probably shouldn't unload modules while any networking is happening.
Flags: needinfo?(dkeeler)
I'll mark this sec-high because it sounds like a shutdown ordering kind of issue, which will be more difficult to exploit.
Keywords: sec-high
Thanks for your assessment David. Can you attach your test code?
It sounds to me like NSS should lock here unless Wan-Teh has a reason why we don't.
Flags: needinfo?(wtc)
Flags: needinfo?(dkeeler)
Attached file nsslclient.tar.gz (obsolete) —
Here's what I came up with. Let me know if the steps to reproduce aren't clear enough (see README).
Flags: needinfo?(dkeeler)
David, can you look at this again? This sec-high is just kind of hanging out, getting old but no less bad.
Flags: needinfo?(dkeeler)
This would be best suited to someone who understands how NSS is supposed to handle this situation (see comment 2). Maybe Tim?
Flags: needinfo?(dkeeler) → needinfo?(ttaubert)
I'm trying to understand what's happening here but Franziskus and I have trouble reproducing this with the script you provided. Does that still work for you?
Flags: needinfo?(ttaubert) → needinfo?(dkeeler)
Attached file nsslclient.c (obsolete) —
It took a bit of fiddling around, but yes, this still reproduces for me (this version of the testcase spins off two different threads that connect to the server, which seems to make it reproduce faster).

Things to try:

* nss/nspr compiled with USE_ASAN? (I eventually just grabbed a build from https://treeherder.mozilla.org/#/jobs?repo=nss )
* is a TLS server running on port 8000?
* is LD_LIBRARY_PATH correct when running ./nsslclient?

With all that, it still takes about 1-3 minutes to reproduce. Unfortunately, the error message isn't very elucidating since asan appears to be encountering an error of its own:

==24541==ERROR: AddressSanitizer: heap-use-after-free on address 0x61d000323cc0 at pc 0x7f688e8df733 bp 0x7f68884374f0 sp 0x7f68884374e8
WRITE of size 4 at 0x61d000323cc0 thread T2
    #0 0x7f688e8df732 in nssToken_AddRef /home/worker/nss/lib/dev/devtoken.c:62:5
    #1 0x7f688e8df462 in nssSlot_GetToken /home/worker/nss/lib/dev/devslot.c:221:16
    #2 0x7f688e8c4f5d in nssTrustDomain_FindTrustForCertificate /home/worker/nss/lib/pki/trustdomain.c:1075:20
    #3 0x7f688e8dd92b in fill_CERTCertificateFields /home/worker/nss/lib/pki/pki3hack.c:798:17
    #4 0x7f688e8d9ad2 in stan_GetCERTCertificate /home/worker/nss/lib/pki/pki3hack.c:896:9
    #5 0x7f688e8d9c56 in STAN_GetCERTCertificate /home/worker/nss/lib/pki/pki3hack.c:929:12
    #6 0x7f688e8b1a7d in CERT_NewTempCertificate /home/worker/nss/lib/certdb/stanpcertdb.c:393:10
    #7 0x7f688de550e6 in ssl3_CompleteHandleCertificate /home/worker/nss/lib/ssl/ssl3con.c:10594:24
    #8 0x7f688de7dfe3 in ssl3_HandleCertificate /home/worker/nss/lib/ssl/ssl3con.c:10507:12
    #9 0x7f688de61ba0 in ssl3_HandlePostHelloHandshakeMessage /home/worker/nss/lib/ssl/ssl3con.c:11741:18
    #10 0x7f688de5cc7a in ssl3_HandleHandshakeMessage /home/worker/nss/lib/ssl/ssl3con.c:11688:22
    #11 0x7f688de698fa in ssl3_HandleHandshake /home/worker/nss/lib/ssl/ssl3con.c:11869:18
    #12 0x7f688de64d47 in ssl3_HandleRecord /home/worker/nss/lib/ssl/ssl3con.c:12632:22
    #13 0x7f688de8ddcb in ssl3_GatherCompleteHandshake /home/worker/nss/lib/ssl/ssl3gthr.c:474:22
    #14 0x7f688de93567 in ssl_GatherRecord1stHandshake /home/worker/nss/lib/ssl/sslcon.c:78:10
    #15 0x7f688deb1e04 in ssl_Do1stHandshake /home/worker/nss/lib/ssl/sslsecur.c:65:14
    #16 0x7f688deb5c4c in SSL_ForceHandshake /home/worker/nss/lib/ssl/sslsecur.c:413:14
    #17 0x4e577a in doConnect /home/keeler/src/nsslclient/nsslclient.c:90:7
    #18 0x4e5850 in connectThreadMain /home/keeler/src/nsslclient/nsslclient.c:105:5
    #19 0x7f688e4bca2f in _pt_root /home/worker/nspr/Linux3.13_x86_64_clang-3.9_glibc_PTH_ASAN_64_DBG.OBJ/pr/src/pthreads/../../../../pr/src/pthreads/ptthread.c:216:5
    #20 0x7f688d9105c9 in start_thread (/lib64/libpthread.so.0+0x75c9)
    #21 0x7f688cd1df6c in __clone (/lib64/libc.so.6+0x102f6c)

ASAN:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
Flags: needinfo?(dkeeler)
Haven't found the time to try again thus far, but it's on my list of things to fix rather sooner than later.
Flags: needinfo?(ttaubert)
Tracking 53+ for this sec high issue.
Track 51+ as sec-high.
Track 52+.  Tim, I don't suppose you've had more luck reproducing this since November?
Flags: needinfo?(ttaubert)
I tried to a few times since David's last message. I can't reproduce it on Docker/Linux, on a Linux desktop machine, and also Franziskus can't reproduce it on his Linux desktop machine. Doesn't reproduce on macOS either :(
Flags: needinfo?(ttaubert)
Tim can't reproduce. David, can you take a look again? Would be too bad if this remains unfixed much longer.
Flags: needinfo?(dkeeler)
I don't know why I didn't think to do this earlier, but I ran my testcase with rr, which enabled me to get a better idea of what's going on. I'm still getting a handle on this, but it looks like the presence of a Fedora-specific token (the "System Trust" token) is causing the uaf, which might indicate why this doesn't reproduce on non-Fedora systems (which is what I'm assuming Tim is using?). I'll keep investigating.
Flags: needinfo?(dkeeler)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #17)
> might indicate why this doesn't reproduce on non-Fedora systems (which is
> what I'm assuming Tim is using?).

Yay, a lead! I've been indeed using a virtual and a real Ubuntu only. Franziskus uses Arch.
Attached file Dockerfile
I figured out how to reproduce this (well, not exactly the same uaf as in comment 0, but a similar root-unloading race condition) in an ubuntu-based docker image, so I don't think this is fedora-specific. If you have docker set up, all you should need to do is:

(download this Dockerfile to cwd)
sudo docker build -t nssthreading .
sudo docker run nssthreading
Attachment #8759742 - Attachment is obsolete: true
Attachment #8799045 - Attachment is obsolete: true
Flags: needinfo?(ttaubert)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #19)
> I figured out how to reproduce this (well, not exactly the same uaf as in
> comment 0, but a similar root-unloading race condition) in an ubuntu-based
> docker image, so I don't think this is fedora-specific. If you have docker
> set up, all you should need to do is:

Thanks David, this is great! I was able to reproduce this on Docker (with macOS as the host) as soon as I set the number of CPUs > 1. Franziskus was able to reproduce on his Arch desktop too.

Now we "only" have to find out what's happening... definitely looks like a race condition due to improper locking.
Flags: needinfo?(ttaubert)
Mark 51 won't fix as 51 was released.
Tim, did you have a chance to find what is happening there?
Flags: needinfo?(ttaubert)
All attempts to reproduce this outside of the docker image have failed for me, unfortunately. I can reproduce this if I use David's image, as mentioned in comment #20. But that doesn't build anymore because the artifacts have expired. We'd also have to adjust the Makefile a little now that we're using GYP for our ASan builds.

I have not managed to reproduce this when building NSS myself, with e.g. ./build.sh --asan --ubsan.

David, I hate to steal more of your time but can you reproduce this when building NSS yourself?

Another thing I spotted is that the docker image installs `libnss3-tools` for certutil. That install a bunch of libraries, including NSS and NSPR. I can't do "make nssthreading" with that removed. I wonder if there's some weird interaction between system libs and the ones taken from the Taskcluster build. OTOH, the UAF looks legit.
Flags: needinfo?(ttaubert) → needinfo?(dkeeler)
Attached patch delay.patchSplinter Review
I can still reproduce this locally. Perhaps try with this patch (maybe without asan?) - it poisons the PK11SlotInfo memory when freeing it and introduces some delay in nssTrustDomain_FindTrustForCertificate, which is where I'm currently seeing this reproduce (the original stack has nssTrustDomain_RemoveTokenCertsFromCache, but I think this is from the same underlying cause).

To run locally, you'll probably have to change the "dist/Linux3.13_x86_64_clang-3.9_glibc_PTH_ASAN_64_DBG.OBJ/" paths in the Makefile and source file to wherever your local build put the dist folder (also make sure that LD_LIBRARY_PATH is set appropriately - "dist/Debug/lib/" is what's working for me, with dist as a symlink to the nss dist dir).

The dockerfile installs nss utils because the Makefile uses certutil to do some setup, but I imagine that might not even be necessary (yeah - if you replace the NSS_Init with NSS_InitReadWrite you don't need to call certutil).
Flags: needinfo?(dkeeler)
After setting up my new Linux desktop (running Arch) I tried again and still can't reproduce it :( Franziskus can't either. I also couldn't reproduce it after building on the Docker image and linking against the self-built library. I wouldn't know how to debug and fix this otherwise...

The probably only thing left to try is to set up a Fedora virtual machine and see if it's reproducible in there? I'm a little out of ideas. I hope we don't have to spend the second week of May in front of David's machine and fix it ;)
Duplicate of this bug: 1346803
Tim really, really can't reproduce :)
Can you pick this up again, keeler?
Flags: needinfo?(dkeeler)
Attached patch possible patchSplinter Review
Looking at this again, it seems incorrect that the structures NSSSlot and NSSToken could each have a pointer to a PK11SlotInfo and yet not cause the PK11SlotInfo's refcount to increase (in my testcase, what sometimes happens is a PK11SlotInfo is destroyed while there are still NSSSlot/NSSTokens that reference it). Once I fixed that, I found that while a comment in nssToken_Destroy claims that an NSSToken will always have the last reference to an NSSSlot, this is demonstrably not true, so I fixed it to null out the NSSSlot->NSSToken pointer, since it would no longer be valid once the NSSToken is destroyed. Then I had to add a null check in nssSlot_GetToken since it could be null. Tim, what do you think? (with this patch, I can't reproduce the asan failures I was seeing)
Flags: needinfo?(dkeeler)
Attachment #8850230 - Flags: feedback?(ttaubert)
David, thanks for the patch! I've been looking at it and it does look very promising indeed. I'm currently still trying to reproduce based on what's in your patch. Will get back to you soon about it.
I figure 54 and 55 are likely to be affected too. 
We are a week away from shipping 53 release, but I'll keep tracking this for now, in case it's something we end up wanting to fix in mid-cycle.
This missed 53, but we could still take a patch for 54. Tim or David, do you think 54 is affected?
Flags: needinfo?(ttaubert)
Comment on attachment 8850230 [details] [diff] [review]
possible patch

Review of attachment 8850230 [details] [diff] [review]:
-----------------------------------------------------------------

I still can't reproduce locally after trying a few more things but this looks very promising, definitely something we need to fix.

::: lib/dev/devslot.c
@@ +66,5 @@
>  nssSlot_AddRef(
>      NSSSlot *slot)
>  {
>      PR_ATOMIC_INCREMENT(&slot->base.refCount);
> +    PK11_ReferenceSlot(slot->pk11slot);

(See below.)

::: lib/dev/devtoken.c
@@ +36,5 @@
>               * When the token is actually destroyed, that ref must go too.
>               */
> +            nssSlot_EnterMonitor(tok->slot);
> +            tok->slot->token = NULL;
> +            nssSlot_ExitMonitor(tok->slot);

How did you find an instance where the token doesn't have the last reference to the slot? Instead of the three lines here and the null-check in `nssSlot_GetToken()` I added a `PORT_Assert(tok->slot->base.refCount == 1)`. Running a few tests locally didn't hit that. It would be good if we could properly assert that...

@@ +63,5 @@
>  nssToken_AddRef(
>      NSSToken *tok)
>  {
>      PR_ATOMIC_INCREMENT(&tok->base.refCount);
> +    PK11_ReferenceSlot(tok->pk11slot);

Referencing the slot once on creation is fine, but do we need to increase the refCount every time someone references the token? I think we can remove this and, above, move the `PK11_FreeSlot()` call into the case where refCount==0 after the atomic decrement.
Attachment #8850230 - Flags: feedback?(ttaubert) → feedback+
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #31)
> This missed 53, but we could still take a patch for 54. Tim or David, do you
> think 54 is affected?

Should be affected, but I'm not sure this is something we need to uplift. It seems quite difficult to exploit. David what's your take on this?
Flags: needinfo?(ttaubert)
(In reply to Tim Taubert [:ttaubert] from comment #32)
> How did you find an instance where the token doesn't have the last reference
> to the slot? Instead of the three lines here and the null-check in
> `nssSlot_GetToken()` I added a `PORT_Assert(tok->slot->base.refCount == 1)`.
> Running a few tests locally didn't hit that. It would be good if we could
> properly assert that...

Pushed to try with that assertion and the core parts of your patch and we don't hit it there either:

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=416422a1fdf9bd9fa27a3f29598dd34f612195dd
There are no more scheduled releases for esr45, so marking wontfix.
(In reply to Tim Taubert [:ttaubert] from comment #32)
> ::: lib/dev/devtoken.c
> @@ +36,5 @@
> >               * When the token is actually destroyed, that ref must go too.
> >               */
> > +            nssSlot_EnterMonitor(tok->slot);
> > +            tok->slot->token = NULL;
> > +            nssSlot_ExitMonitor(tok->slot);
> 
> How did you find an instance where the token doesn't have the last reference
> to the slot? Instead of the three lines here and the null-check in
> `nssSlot_GetToken()` I added a `PORT_Assert(tok->slot->base.refCount == 1)`.
> Running a few tests locally didn't hit that. It would be good if we could
> properly assert that...

IIRC, I encountered a situation where if one thread removes the module the token and slot are on while another thread is operating on the slot, the token will not have the last reference to the slot.

> @@ +63,5 @@
> >  nssToken_AddRef(
> >      NSSToken *tok)
> >  {
> >      PR_ATOMIC_INCREMENT(&tok->base.refCount);
> > +    PK11_ReferenceSlot(tok->pk11slot);
> 
> Referencing the slot once on creation is fine, but do we need to increase
> the refCount every time someone references the token? I think we can remove
> this and, above, move the `PK11_FreeSlot()` call into the case where
> refCount==0 after the atomic decrement.

Ah, I think I see. Unless I'm forgetting some issue I ran into, I agree that should work.
(In reply to Tim Taubert [:ttaubert] from comment #33)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #31)
> > This missed 53, but we could still take a patch for 54. Tim or David, do you
> > think 54 is affected?
> 
> Should be affected, but I'm not sure this is something we need to uplift. It
> seems quite difficult to exploit. David what's your take on this?

I believe 54 is still affected, but I think this would be very difficult to exploit, particularly without user intervention (I think this is only possible when unloading PKCS#11 modules or at shutdown).
try looked good. whats next, keeler?
Flags: needinfo?(dkeeler)
I promised David that I'll take it from here.
Assignee: nobody → ttaubert
Status: NEW → ASSIGNED
Flags: needinfo?(dkeeler)
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #4)
> Thanks for your assessment David. Can you attach your test code?
> It sounds to me like NSS should lock here unless Wan-Teh has a reason why we
> don't.

I just realized Franziskus asked me a question in comment 4.
I am sorry I missed his question. I assume Tim and David already
figured out the root cause of this bug.
Flags: needinfo?(wtc)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #36)
> (In reply to Tim Taubert [:ttaubert] from comment #32)
> > ::: lib/dev/devtoken.c
> > @@ +36,5 @@
> > >               * When the token is actually destroyed, that ref must go too.
> > >               */
> > > +            nssSlot_EnterMonitor(tok->slot);
> > > +            tok->slot->token = NULL;
> > > +            nssSlot_ExitMonitor(tok->slot);
> > 
> > How did you find an instance where the token doesn't have the last reference
> > to the slot? Instead of the three lines here and the null-check in
> > `nssSlot_GetToken()` I added a `PORT_Assert(tok->slot->base.refCount == 1)`.
> > Running a few tests locally didn't hit that. It would be good if we could
> > properly assert that...
> 
> IIRC, I encountered a situation where if one thread removes the module the
> token and slot are on while another thread is operating on the slot, the
> token will not have the last reference to the slot.

Yeah, that seems like it could happen. Not sure if that's a common occurrence but we should guard against as you suggested.
This is basically David's patch, reduced a little as suggested by me above.

https://nss-review.dev.mozaws.net/D332
https://hg.mozilla.org/projects/nss/rev/859a4f82207b
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.31
If we intend to fix this for Fx54 and ESR52.2, we'll need NSS 3.30 and 3.28 point releases ASAP since the RC builds are happening on Monday.
Flags: needinfo?(ttaubert)
I don't think this really should be sec-high, maybe rather sec-moderate. It seems really hard to exploit, only when shutting down the browser. I'd let it ride the trains and hit beta in ~2 weeks.
Flags: needinfo?(ttaubert)
Group: crypto-core-security → core-security-release
Track 54+/55+ and mark 54 won't fix per comment #45.
See Also: → 1370356
Sounds like this is wontfix for ESR52, though maybe we should keep it on the radar for ride-along inclusion if/when we find ourselves shipping another 3.28.x point release anyway.
Whiteboard: [adv-main55+]
For Fedora, I have recently packaged NSS 3.31, which includes this fix.  It is still under testing but I received a report from a user saying that the package is causing a deadlock in nssSlot_IsTokenPresent(), with a stacktrace similar to comment 9:
https://bugzilla.redhat.com/show_bug.cgi?id=1470352#c8

It seems that nssSlot_GetToken() obtains a lock before calling nssSlot_IsTokenPresent(), while nssSlot_IsTokenPresent() also tries to lock the slot for C_GetSlotInfo:
https://hg.mozilla.org/projects/nss/rev/859a4f82207b#l1.33
https://hg.mozilla.org/projects/nss/file/tip/lib/dev/devslot.c#l134

Tim, David, do you have any idea on this?
Flags: needinfo?(ttaubert)
It would be fairly straightforward to re-work nssSlot_GetToken to not hold the "monitor" (lock) while calling nssSlot_IsTokenPresent.
See Also: → 1381784
Re comment 49 and 50, Daiki fixed it with bug 1381784.
Flags: needinfo?(ttaubert)
Flags: qe-verify?
Whiteboard: [adv-main55+] → [adv-main55+][post-critsmash-triage]
Depends on: 1381784
Depends on: 1388370
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.