Hang in CC_SIPCCService::mLock on shutdown if Ctrl/Cmd+Q hit during long crashtests/822197.html

VERIFIED FIXED in Firefox 22

Status

()

P2
critical
VERIFIED FIXED
6 years ago
5 years ago

People

(Reporter: jib, Assigned: jib)

Tracking

({hang})

Trunk
mozilla23
x86
Mac OS X
Points:
---
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox22 fixed)

Details

(Whiteboard: [webrtc][blocking-webrtc+], crash signature, URL)

Attachments

(4 attachments, 2 obsolete attachments)

Created attachment 730380 [details]
Log output + callstack

Reproduce steps:
1. Open URL
2. Quit the browser using Ctrl+Q (Cmd+Q on Mac) within 10 seconds or so.

These are the same reproduce steps as Bug 842749, except that now that the JS re-entrancy has been fixed, the browser is ironically much less responsive during crashtests/822197.html.

This is presumably "more correct" since this crashtest contains JS creating 70 PeerConnections in a for-loop. E.g. you have to work hard on a Mac now to right-click the Nightly icon in the taskbar while this is running and see anything other than "Force Quit".

If I wait it out (takes about 30 seconds), e.g. until right-click on Nightly produces the "Quit" option, or until the beachball-of-death on the mac stops, then Firefox exits cleanly.

However, if I manage to get the Quit in early, either by right-clicking on Nightly just as the page loaded, or hit Cmd+Q sometime during the run, then it hangs every time for me in the same place:

#0 __psynch_mutexwait ()
#1 pthread_mutex_lock ()
#2 in PR_Lock ()
#3 mozilla::Mutex::Lock()@BlockingResourceBase.cpp:228
#4 mozilla::BaseAutoLock<mozilla::Mutex>::BaseAutoLock() @Mutex.h:153
#5 mozilla::BaseAutoLock<mozilla::Mutex>::BaseAutoLock() @Mutex.h:154
#6 CSF::CC_SIPCCService::removeCCObserver() @CC_SIPCCService.cpp:742
#7 CSF::CallControlManagerImpl::disconnect() @CallControlManagerImpl.cpp:234
#8 CSF::CallControlManagerImpl::destroy() @CallControlManagerImpl.cpp:57
#9 sipcc::PeerConnectionCtx::Cleanup() @PeerConnectionCtx.cpp:221
Depends on: 842749
Whiteboard: [webrtc][blocking-webrtc+]
Ekr, can you take a look at this?  It looks like this is fallout from your DISPATCH patch landing, and it's currently marked a blocker to release.  Thanks.
Assignee: nobody → ekr
Duplicate of this bug: 835851
If you prefer a non-interactive reproduce case, see Bug 835851 (download attachments and run crashtest).

Comment 4

6 years ago
Can you provide a backtrace of all the threads? We may be able to guess who is holding this mutex.
Flags: needinfo?(jib)
Here's a backtrace of all threads from Bug 835851 (easiest steps): attachment 730791 [details].
Flags: needinfo?(jib)
Created attachment 730854 [details]
This output shows the deadlock.

The SIPCC thread does a sync dispatch to main in CC_SIPCCService::onCallEvent() while holding its m_lock.

The Main thread is busy shutting down in CC_SIPCCService::removeCCObserver where it waits on the m_lock.
Assignee: ekr → jib

Updated

5 years ago
Severity: normal → critical
Crash Signature: [@ mozilla::Mutex::Lock()]
Keywords: hang
Status: NEW → ASSIGNED
Created attachment 732972 [details] [diff] [review]
Async onEvent dispatch to PC on main. WIP.

Update. Work in progress.

As discussed last week on IRC, our solution to the hang is:
(1) Fix PeerConnectionCtx::onCallEvent() to dispatch NORMAL (async) to main.
(2) remove unused aCall;
(3) make onCallEvent_m static; and
(4) pass in strdup(aCall->getPeerConnection()), effectively a weak ref to
    the pc to onCallEvent_m, to handle running late in shutdown (pc gone).

Open issues:
- CC_*Ptr is not thread-safe so we must not manipulate the ref count on
  multiple threads at once, the way it does here. Previously, SyncRunnable
  enforced this. Now that we are async, this must be addressed. We can
  either change C_*Ptr to threadsafe, or...

  Since it seems like pc->onEvent just gets aInfo->callStateToString(event)
  anyway, I can just duplicate the string and pass that in to avoid the
  reference altogether.

- I think I can make Bug 835851 into a crashtest.

Comment 8

5 years ago
(In reply to Jan-Ivar Bruaroey [:jib] from comment #7)
> - CC_*Ptr is not thread-safe

Just as a minor correction here: as of Bug 848423, these pointers are thread-safe.
> Just as a minor correction here: as of Bug 848423, these pointers are thread-safe.

Good news.
Created attachment 733052 [details]
Progress, but now callstacks showing new hang in cprJoinThread()

The patch seems to work, but exposes another hang in cprJoinThread().
I may as well address that at the same time.

Current thinking on IRC is to have the GSM thread et al. use a new BreakableSyncRunnable derivative that takes a PRInt32 argument, so that ccUnload() can break through on shutdown.
Created attachment 734162 [details] [diff] [review]
Async onEvent to PC on main + safe join of SyncRunnable-using SIPCC threads.

Now fixes SIPCC join-hang on shutdown, in addition to earlier fix.

To recap, this patch changes two things:
1) PeerConnectionCtx::onCallEvent() is dispatched NORMAL (async) to main.
2) Safe joining of SyncRunnable-using SIPCC threads is now a feature of
   thread_monitor.c

Try - https://tbpl.mozilla.org/?tree=Try&rev=220d2c4d600e
Attachment #732972 - Attachment is obsolete: true
Attachment #734162 - Flags: review?(rjesup)
Comment on attachment 734162 [details] [diff] [review]
Async onEvent to PC on main + safe join of SyncRunnable-using SIPCC threads.

Review of attachment 734162 [details] [diff] [review]:
-----------------------------------------------------------------

r+, but please add some comments about why it's done this way and the implications -- especially about spinning the event loop if we were called from JS code.
Attachment #734162 - Flags: review?(rjesup) → review+
Created attachment 734237 [details] [diff] [review]
Async onEvent to PC on main + safe join of SyncRunnable-using SIPCC threads. r=jesup

Comment-only update. Carrying forward r+ from Randell.
Attachment #734162 - Attachment is obsolete: true
Attachment #734237 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/332061725181
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][webrtc-uplift]
Target Milestone: --- → mozilla23

Updated

5 years ago
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/332061725181
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

5 years ago
Flags: in-testsuite-

Updated

5 years ago
Keywords: verifyme
Verified on 4/8 with STR in comment 0 - not seeing a hang.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Comment on attachment 734237 [details] [diff] [review]
Async onEvent to PC on main + safe join of SyncRunnable-using SIPCC threads. r=jesup

[Approval Request Comment]
Bug caused by (feature/regressing bug #): n/a

User impact if declined: Possible deadlocks during shutdown, especially when webrtc is overloaded

Testing completed (on m-c, etc.): on m-c for a while.  Manual testing of reproducible hang.

Risk to taking this patch (and alternatives if risky): Low risk.  SIPCC/signaling shutdown is currently only done at browser-quit time.  It's unlikely even in a bug in this code would do more than leak objects or deadlock shutdown.

String or IDL/UUID changes made by this patch: none
Attachment #734237 - Flags: approval-mozilla-aurora?
Comment on attachment 734237 [details] [diff] [review]
Async onEvent to PC on main + safe join of SyncRunnable-using SIPCC threads. r=jesup

low risk, well tested patch . Been on m-c for a few days and the hang is explicitly verified by QA .
Attachment #734237 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/2534a95ecee7
status-firefox22: --- → fixed
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift] → [webrtc][blocking-webrtc+]
Target Milestone: mozilla23 → mozilla22

Comment 20

5 years ago
The m-c version where it landed is 23.
Target Milestone: mozilla22 → mozilla23
You need to log in before you can comment on or make changes to this bug.