Closed Bug 834383 Opened 7 years ago Closed 7 years ago

WebRTC doesn't necessarily shutdown properly, crashes when GC isn't run at certain point: Shutdown | application crashed [@ std::_Tree<std::_Tmap_traits<std::basic_string …

Categories

(Core :: WebRTC, defect, P2, critical)

defect

Tracking

()

RESOLVED FIXED
mozilla21

People

(Reporter: smaug, Assigned: abr)

References

Details

(Keywords: crash, intermittent-failure, Whiteboard: [webrtc][blocking-webrtc+][qa-])

Crash Data

Attachments

(1 file, 5 obsolete files)

No description provided.
See
Summary: WebRTC doesn't necessarily shutdown properly, crashes when GC isn → WebRTC doesn't necessarily shutdown properly, crashes when GC isn't run at certain point
No longer blocks: 833143
Blocks: 833143
Assigning to Adam for now
Assignee: nobody → adam
Severity: normal → critical
Keywords: crash
Priority: -- → P2
Whiteboard: [webrtc][blocking-webrtc+]
Crash Signature: [@ std::_Tree<std::_Tmap_traits<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,sipcc::PeerConnectionImpl *,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const > std::allocator<std::pair<std::b …
Attached patch null check (obsolete) — Splinter Review
Perhaps this is enough?
https://tbpl.mozilla.org/?tree=Try&rev=834156e84dca
Summary: WebRTC doesn't necessarily shutdown properly, crashes when GC isn't run at certain point → WebRTC doesn't necessarily shutdown properly, crashes when GC isn't run at certain point: Shutdown | application crashed [@ std::_Tree<std::_Tmap_traits<std::basic_string …
ekr, adam - you should look at the signature on the crash after Olli tried adding a null check (comment 6).

This is likely fallout from the weakreference changes, and if GC doesn't happen early enough we're not cleaning up the peerconnections until after NSS is shut down (in addition to the NULL problem Olli tried to work around).
I maybe having the same issue with my push to try:
https://tbpl.mozilla.org/?tree=Try&rev=c9dffd8d59c2
Yeah, https://tbpl.mozilla.org/php/getParsedLog.php?id=19142658&tree=Try&full=1 looks like the same
problem, which is no wonder since you have https://hg.mozilla.org/try/rev/689690a17de3 in your tree.
Bug 833143 was backed out because of this bug.
(In reply to Randell Jesup [:jesup] from comment #7)
> ekr, adam - you should look at the signature on the crash after Olli tried
> adding a null check (comment 6).

Both crash signatures are pretty clearly an attempt to call the PCImpl destructor with either this == nullptr or some subset of the members == nullptr. 

> Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
> Crash address: 0x8

...

> Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
> Crash address: 0x14

In any case, this is likely a CC problem, not a GC problem. See Bug 824919 comment 25 for a diagram that shows the cycle involved.
this is WebRTC problem. It doesn't shutdown stuff early enough.
(In reply to Olli Pettay [:smaug] from comment #11)
> this is WebRTC problem.

Sure, I wasn't disputing that.

I was pointing out why your simple GC patch was very unlikely to address the issue. The stuff that needs collection is a cycle of objects, which the GC won't be able to find.
My GC patch makes this problem to happen all the time.
In other words, I'd like to land the patch for bug 833143 in order to speed up
shutdown (in some cases significantly) but if we don't do that GC call, WebRTC crashes.
Okay, I think the check-for-null patch actually was half of the solution we needed here. The problem is that the whole PeerConnection lifetime is anchored in the DOM window (since that's where they're instantiated from). This means that the PeerConnectionCtx can't go through and deallocate every existing PeerConnectionImpl on shutdown.

When XPCOM shuts down, it first calls all its observers, which will tear down PeerConnectionCtx. Afterwards, it shuts down the cycle collector, which goes out and collects any objects that are still hanging around. This appears to include any DOM windows that are hanging around, including any objects anchored in those windows (such as PeerConnection and its buddies).

By this time, NSS has already been shut down, which zeros out the g_default_trust_domain used in pki3hack.c. Since PeerConnectionImpl has a DtlsIdentity, which in turn has a CERTCertificate, destroying the PeerConnectionImpl ends up ultimately calling nssCertificate_Destroy in its destructor. The code in nssCertificate_Destroy assumes that the NSS hasn't yet been torn down, so it doesn't check that the default trust domain is valid.

This can be fixed by checking for a null default trust domain in nssCertificate_Destroy.

Arguably, the correct solution to this whole mess is to ensure that the PeerConnections are collected prior to XPCOM and NSS shutdown, but their model of lifecycle management (being created from javascript applications) doesn't seem to be amenable to that approach. For the time being, I think our best bet is simply to ensure that the PeerConnectionImpl destructor can operate smoothly when various services are already shut down.

Patch impending.
Attachment #706339 - Attachment is obsolete: true
(In reply to Adam Roach [:abr] from comment #14)
> DtlsIdentity, which in turn has a CERTCertificate, destroying the
> PeerConnectionImpl ends up ultimately calling nssCertificate_Destroy in its
> destructor. The code in nssCertificate_Destroy assumes that the NSS hasn't
> yet been torn down, so it doesn't check that the default trust domain is
> valid.

The NSS library requires you to not call any NSS functions after NSS shutdown. It is analogous to a pointer-use-after-free. 

> Arguably, the correct solution to this whole mess is to ensure that the
> PeerConnections are collected prior to XPCOM and NSS shutdown, but their
> model of lifecycle management (being created from javascript applications)
> doesn't seem to be amenable to that approach. For the time being, I think
> our best bet is simply to ensure that the PeerConnectionImpl destructor can
> operate smoothly when various services are already shut down.

We shouldn't be calling CERT_DestroyCertificate after NSS shutdown in the first place.

PSM can notify any part of code to let it know that NSS is being shut down and that it needs to free all of its NSS objects (i.e. call CERT_DestroyCertificate on certificates and PR_Close() on DTLS sockets). The object that owns the CERTCertificate object should implement the nsNSSShutDownObject interface to do this. The thing to be aware of is that the code that uses that object has to handle the case where those objects are no longer available.

Note that there is an existing XPCOM wrapper around CERTCertificate called nsIX509Cert that implements nsNSSShutDownObject.

Also, see the thread I started yesterday on mozilla.dev.platform.
Attachment #707737 - Attachment is obsolete: true
(In reply to Brian Smith (:bsmith) from comment #16)

> PSM can notify any part of code to let it know that NSS is being shut down
> and that it needs to free all of its NSS objects (i.e. call
> CERT_DestroyCertificate on certificates and PR_Close() on DTLS sockets). The
> object that owns the CERTCertificate object should implement the
> nsNSSShutDownObject interface to do this. The thing to be aware of is that
> the code that uses that object has to handle the case where those objects
> are no longer available.

Aha! Thanks for the pointer. That seems like a much more elegant solution. I'll go try to hook that together.
(In reply to Adam Roach [:abr] from comment #18)
> Aha! Thanks for the pointer. That seems like a much more elegant solution.
> I'll go try to hook that together.

One thing to look for: AFAICT, a DtlsIdentity becomes useless when you don't have the certificate. So, perhaps the thing that owns the DtlsIdentity should free the DtlsIdentity during NSS shutdown, instead of DtlsIdentity freeing its cert. Or, maybe things are better off being done even higher up on the object ownership ladder. Generally, it is better to things as high on the ladder as possible.

In fact, rjesup and ekr fixed some other shutdown-related bugs recently, trying to destroy things at the highest possible layer. This bug may be an indication that their patch is not working according to the way they expected.
Attachment #707742 - Attachment is obsolete: true
(In reply to Brian Smith (:bsmith) from comment #19)
> AFAICT, a DtlsIdentity becomes useless when you don't
> have the certificate. So, perhaps the thing that owns the DtlsIdentity
> should free the DtlsIdentity during NSS shutdown, instead of DtlsIdentity
> freeing its cert.

Yes, I like that solution. I've revised the patch to make PeerConnectionImpl an nsNSSShutDownObject, and to free the DtlsIdentity when NSS shuts down. This seems to solve the problem I was seeing before. Thanks again for pointing me in the right direction.
Comment on attachment 707771 [details] [diff] [review]
Ensure PeerConnectionImpl destructor doesn't use globals after they're gone

Brian -- I'd like you to just give a quick glance at my usage of nsNSSShutDownObject. I think I've fulfilled the contract outlined in nsNSSShutDown.h, but would appreciate a sanity check.
Attachment #707771 - Flags: review?(rjesup)
Attachment #707771 - Flags: review?(bsmith)
OS: Windows XP → All
Hardware: x86_64 → All
Comment on attachment 707771 [details] [diff] [review]
Ensure PeerConnectionImpl destructor doesn't use globals after they're gone

Review of attachment 707771 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +251,2 @@
>    CloseInt(false);
>  

What thread does this happen on? If it happens off the main thread then you may need to use nsNSSShutDownPreventionLock and call virtualDestroyNSSReference while protected by the lock, instead of relying on implicit destruction of mIdentity (which would not be protected by the lock).

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
@@ +251,5 @@
>      return true;
>    }
>  
> +#ifdef MOZILLA_INTERNAL_API
> +  void virtualDestroyNSSReference();

I suggest you add an explicit MOZ_FINAL here even though the class is MOZ_FINAL, if you end up calling this method in the destructor.
Attachment #707771 - Flags: review?(bsmith) → review+
By the way, if you are able to clear mIdentity this easily, without having to add any (mIdentity != nullptr) checks, then does that mean that the other code in this class already handled the null mIdentity case correctly? Or, does it mean that mIdentity isn't really being used for anything? I only checked the part of the patch that implements the nsNSSShutDownObject protocol.
(In reply to Brian Smith (:bsmith) from comment #24)
> Comment on attachment 707771 [details] [diff] [review]
> Ensure PeerConnectionImpl destructor doesn't use globals after they're gone
> 
> Review of attachment 707771 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
> @@ +251,2 @@
> >    CloseInt(false);
> >  
> 
> What thread does this happen on? If it happens off the main thread then you
> may need to use nsNSSShutDownPreventionLock and call
> virtualDestroyNSSReference while protected by the lock, instead of relying
> on implicit destruction of mIdentity (which would not be protected by the
> lock).

The destructor must be called on the main thread for other reasons (and the PC_AUTO_ENTER_API_CALL_NO_CHECK macro does enforce this constraint in a roundabout sort of way), so I think we're safe here.
 
> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
> @@ +251,5 @@
> >      return true;
> >    }
> >  
> > +#ifdef MOZILLA_INTERNAL_API
> > +  void virtualDestroyNSSReference();
> 
> I suggest you add an explicit MOZ_FINAL here even though the class is
> MOZ_FINAL, if you end up calling this method in the destructor.

Sounds good to me. As you note, we can't derive anything from this class, but I'm all for making these things more obvious.

Thanks for taking a look.
(In reply to Brian Smith (:bsmith) from comment #25)
> By the way, if you are able to clear mIdentity this easily, without having
> to add any (mIdentity != nullptr) checks, then does that mean that the other
> code in this class already handled the null mIdentity case correctly? Or,
> does it mean that mIdentity isn't really being used for anything? I only
> checked the part of the patch that implements the nsNSSShutDownObject
> protocol.

My reasoning here is that NSS is shut down only when the world is being torn down, so anything that would be making use of the mIdentity *should* be done with it already. Basically, we had a sequencing problem before that would result in NSS functions being called after NSS was gone, and we just want to ensure that the sequence of events is rearranged.

That said, you do shine a light on something that I haven't explored completely, so I'm going to look closely at the consumers of these identity objects to ensure that they aren't going to choke in the case that they get a nullptr back from GetIdentity()
Comment on attachment 707771 [details] [diff] [review]
Ensure PeerConnectionImpl destructor doesn't use globals after they're gone

Review of attachment 707771 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +251,2 @@
>    CloseInt(false);
>  

"What thread does this happen on? If it happens off the main thread then you may need to use nsNSSShutDownPreventionLock ..."

PC_AUTO_ENTER_API_CALL_NO_CHECK verifies it's on MainThread (except in the unit tests, which I believe is ok - ekr? - and this is released on MainThread.)
Attachment #707771 - Flags: review?(rjesup) → review+
Attachment #707771 - Attachment is obsolete: true
Attachment #708299 - Attachment is obsolete: true
Comment on attachment 708300 [details] [diff] [review]
Ensure PeerConnectionImpl destructor doesn't use globals after they're gone

Carrying forward r+ from jesup and bsmith
Attachment #708300 - Flags: review+
Let's make sure this fixed Olli's problem on Windows:

https://tbpl.mozilla.org/?tree=Try&rev=2d0f0d53eada
Comment on attachment 708300 [details] [diff] [review]
Ensure PeerConnectionImpl destructor doesn't use globals after they're gone

This looks like it's solved the problem.
Attachment #708300 - Flags: checkin?(rjesup)
https://hg.mozilla.org/integration/mozilla-inbound/rev/70872c020944
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][webrtc-uplift]
Target Milestone: --- → mozilla21
Attachment #708300 - Flags: checkin?(rjesup) → checkin+
This was backed out and re-landed while investigating Mochitest oranges.
https://hg.mozilla.org/integration/mozilla-inbound/rev/67a6eb23e586
https://hg.mozilla.org/mozilla-central/rev/67a6eb23e586
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift] → [webrtc][blocking-webrtc+][webrtc-uplift][qa-]
Covered by the fact our peer connection tests are running in CI already.
Flags: in-testsuite+
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift][qa-] → [webrtc][blocking-webrtc+][qa-]
You need to log in before you can comment on or make changes to this bug.