Closed Bug 1605109 Opened 5 years ago Closed 5 years ago

Shutdown failed after loading https://geekflare.com

Categories

(Core :: Networking: HTTP, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: kershaw, Assigned: dragana)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Attached file http log

Hit MOZ_CRASH(NSS_Shutdown failed) at shutdown after visiting https://geekflare.com.
I think this is somehow related to Http3, since this only happens when I turn on network.http.http3.enabled and https://geekflare.com also supports http3.

Could you take a look, dragana?

Flags: needinfo?(dd.mozilla)

I know about this one. I need to investigate. Thanks

Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Flags: needinfo?(dd.mozilla)

Feel free to change the priority

Priority: -- → P2
Whiteboard: [necko-triaged]
Blocks: QUIC

I tried to track this down and there is a slot which is still referenced during NSS shutdown. The slot is referenced by a certificate which was created during call to SSLExp_RecordLayerData:

#0 0x00007f48b97a2601 in nssPKIObject_AddRef (object=0x7f488f7050d8) at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/pkibase.c:152
#1 0x00007f48b979ad20 in nssCertificate_AddRef (c=0x7f488f7050d8) at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/certificate.c:81
#2 0x00007f48b97a7d44 in nssTrustDomain_GetCertForIssuerAndSNFromCache (td=0x7f4883deb030, issuer=0x7f4892df7228, serial=0x7f4892df7218)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/tdcache.c:1027
#3 0x00007f48b97a9bfe in nssTrustDomain_FindCertificateByIssuerAndSerialNumber (td=0x7f4883deb030, issuer=0x7f4892df7228, serial=0x7f4892df7218)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/trustdomain.c:755
#4 0x00007f48b97a9ee6 in nssTrustDomain_FindCertificateByEncodedCertificate (td=0x7f4883deb030, ber=0x7f4892df7348)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/trustdomain.c:839
#5 0x00007f48b97a9f5d in NSSTrustDomain_FindCertificateByEncodedCertificate (td=0x7f4883deb030, ber=0x7f4892df7348)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/pki/trustdomain.c:852
#6 0x00007f48b9766875 in CERT_NewTempCertificate (handle=0x7f4883deb030, derCert=0x7f4892df73f0, nickname=0x0, isperm=0, copyDER=1)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/certdb/stanpcertdb.c:361
#7 0x00007f48b94fed63 in tls13_HandleCertificateEntry (ss=0x7f48817f6000, data=0x7f4892df7488, first=0, certp=0x7f4892df7480)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/tls13con.c:3104
#8 0x00007f48b94f05f8 in tls13_HandleCertificate (ss=0x7f48817f6000, b=0x7f488f75cc93 '344' <repeats 199 times>, <incomplete sequence 344>..., length=0)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/tls13con.c:3224
#9 0x00007f48b94efe12 in tls13_HandlePostHelloHandshakeMessage (ss=0x7f48817f6000, b=0x7f488f75c000 "", length=3219)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/tls13con.c:954
#10 0x00007f48b94af17c in ssl3_HandleHandshakeMessage (ss=0x7f48817f6000, b=0x7f488f75c000 "", length=3219, endOfRecord=0)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/ssl3con.c:12099
#11 0x00007f48b94b22cc in ssl3_HandleHandshake (ss=0x7f48817f6000, origBuf=0x7f48817f62c0) at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/ssl3con.c:12304
#12 0x00007f48b94b1433 in ssl3_HandleNonApplicationData (ss=0x7f48817f6000, rType=ssl_ct_handshake, epoch=0, seqNum=0, databuf=0x7f48817f62c0)
at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/ssl3con.c:12790
#13 0x00007f48b94ca858 in SSLExp_RecordLayerData
(fd=0x7f48817d9d30, epoch=2, contentType=ssl_ct_handshake, data=0x7f488f70e000 "232Et224J0a321I334o353347-ɉ317036j|354205316060%Y272201p4270064177347001321342313R", len=1108) at /mnt/work/opt/moz/hg-central-2/security/nss/lib/ssl/ssl3gthr.c:680
#14 0x00007f48b3a36ef9 in neqo_crypto::ssl::SSL_RecordLayerData
(fd=0x7f48817d9d30, epoch=2, ct=22, data=0x7f488f70e000 "232Et224J0a321I334o353347-ɉ317036j|354205316060%Y272201p4270064177347001321342313R", len=1108)
at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-crypto/src/exp.rs:19
#15 0x00007f48b3a2a38f in neqo_crypto::agentio::Record::write (self=..., fd=0x7f48817d9d30) at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-crypto/src/agentio.rs:59
#16 0x00007f48b3a260f9 in neqo_crypto::agent::SecretAgent::handshake_raw (self=0x7f48802f3178, now=..., input=...)
at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-crypto/src/agent.rs:631
#17 0x00007f48b3996c49 in neqo_transport::connection::Connection::handshake (self=0x7f48802f3000, now=..., epoch=2, data=...)
at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-transport/src/connection.rs:1369
#18 0x00007f48b3998b5f in neqo_transport::connection::Connection::input_frame (self=0x7f48802f3000, epoch=2, frame=..., now=...)
at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-transport/src/connection.rs:1475
#19 0x00007f48b39922f9 in neqo_transport::connection::Connection::process_packet (self=0x7f48802f3000, hdr=0x7f4892dfb4b0, body=..., now=...)
at /mnt/work/opt/moz/hg-central-2/third_party/rust/neqo-transport/src/connection.rs:970

The certificate is stored at https://searchfox.org/mozilla-central/rev/c7b673f443407a359cc0766fb5a4ac323a1d2628/security/nss/lib/ssl/tls13con.c#3251 and never freed. Is this a bug in NSS code or is necko supposed to initiate freeing of the certificate?

Flags: needinfo?(dkeeler)

It could be a bug in NSS. It could also be the case that neqo isn't releasing something, and that's causing NSS to hang on to that certificate. Does this happen with TLS 1.3 but not HTTP3? If not, I would make sure neqo isn't leaking something (or hanging on to it for too long).

Flags: needinfo?(dkeeler)

I tested it by loading just an image https://geekflare.com/wp-content/uploads/2019/02/kinsta-logo2.png to keep the log and rr session as small as possible. If I disable http3 the leak is not present, with http3 enabled the cert isn't freed.

need-info mt if he have quick idea what is holding the ref.

Flags: needinfo?(mt)

This will be probably fixed by https://github.com/mozilla/neqo/pull/400.

(I wrote this a while ago, but failed to hit send.)

I can't see a specific problem here, but maybe I can explain how this is all supposed to work.

The TLS component of NSS, when it connects, makes and holds a copy of all of the certificates that it gets from its peer. These certificates are held in a CERT_CertificateList and discarded when the socket is closed. This is all well-tested code that is used in Firefox all the time. I took a quick look and I can't see anything that might be a problem here. Dropping the CERT_CertificateList should be sufficient to release the certificates it holds and the certificate is always dropped.

The way that neqo uses this is fairly simple: SecretAgent::peer_certificate() calls the accessor on the NSS object and gets in return a copy of the list object. This includes "copies" of each of the certificates, but really these are refcounted. Looking at the code in gecko, this object is used, the contents are cloned, and then discarded all within the same function. So we have to concern ourselves with the potential for leaks in the objects that are used in neqo.

The main certificate object used in neqo is a wrapper around CERT_CertificateList called CertList. This is generated by a macro that implements Drop by routing to the correct C destructor function. That object is held by the CertificateInfo object that is returned, so I can't see that being the source of the problem.

So the question becomes what about the socket. Is there any way in which the socket could still be hanging around. And that is harder to answer directly. SecretAgent, the main wrapper object, doesn't implement Drop. And that's it. We are relying on the NSPR IO functions to be called in order to release the socket in agent_close(). But those never get called. I managed to build NSS with ASAN and run our test suite and everything exploded. The whole socket appears to leak.

This is a huge oversight on my part, for which I apologize. The PR above should fix this.

Flags: needinfo?(mt)

This has been fix in the new neqo update (bug 1614711)

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: