Closed Bug 904784 Opened 11 years ago Closed 11 years ago

Local shared active media does not stop recording on tab close

Categories

(Firefox for Android Graveyard :: General, defect)

26 Branch
ARM
Android
defect
Not set
major

Tracking

(firefox23 disabled, firefox24+ fixed, firefox25+ verified, firefox26 verified, firefox-esr24 wontfix, fennec24+)

VERIFIED FIXED
Firefox 26
Tracking Status
firefox23 --- disabled
firefox24 + fixed
firefox25 + verified
firefox26 --- verified
firefox-esr24 --- wontfix
fennec 24+ ---

People

(Reporter: aaronmt, Assigned: jesup)

References

Details

(Keywords: reproducible, Whiteboard: [getUserMedia][android-gum?])

Attachments

(9 files, 1 obsolete file)

Akin to bug 885783 which was a followup from bug 88136, I can reproduce a situation where our local shared active media notification is stuck and will not dismiss on tab close. Unlike last time, the media lock is not held. Jason was also able to reproduce this.

This is likely a race condition, close the tab quickly after sharing.

Steps:
 
i) http://mozilla.github.io/webrtc-landing/gum_test.html
ii) Share 'audio and video', close tab (quickly)

--
Nightly (08/13)
Samsung Galaxy SIV (Android 4.3)
Not blocking only because this requires timing to reproduce this. A user can definitely hit this, but it's not a common use case.
Whiteboard: [getUserMedia][android-gum-]
tracking-fennec: ? → +
Please renom if the product team feels this is a webRTC ship blocker.
New video URL; thus far I've hit this only via a race. I will renom if I can this any other method.

http://www.youtube.com/watch?v=8_VVeT15lhY
Able to reproduce this in normal conditions

Requires 2 phones:

i) Have both phones join a room at http://talky.io/<room>
ii) Accept and share a camera and microphone
iii) After a short while (~20 seconds), close the active tab on one device

See the notification on the device with the device closed still remaining.
tracking-fennec: + → ?
Summary: Local shared active media notification does not dismiss on tab close (race) → Local shared active media notification does not dismiss on tab close
Whiteboard: [getUserMedia][android-gum-] → [getUserMedia][android-gum?]
It looks like the larger problem here is that the media is still active after tab close.

I captured output into a log with appropriate logging and on tab close, after my logging of audio shutdown in this case I can continue to see preview frame length logging while idle with no tabs open.
(In reply to Aaron Train [:aaronmt] from comment #6)
> Created attachment 792256 [details]
> Raw log (Nightly 08/19)
> 
> It looks like the larger problem here is that the media is still active
> after tab close.
> 
> I captured output into a log with appropriate logging and on tab close,
> after my logging of audio shutdown in this case I can continue to see
> preview frame length logging while idle with no tabs open.

Now that sounds actually quite worse - we're persisting a getUserMedia session after shutdown. Could there be a leak here potentially?
Another log, actions shown

Visited my home-screen

> I/ActivityManager(  517): START u0 {act=android.intent.action.MAIN cat=[android.intent.category.HOME] flg=0x10200000 cmp=com.android.launcher/com.android.launcher2.Launcher} from pid 517

Launched my camera successfully

> I/ActivityManager(  517): START u0 {act=android.intent.action.MAIN cat=[android.intent.category.LAUNCHER] flg=0x10200000 cmp=com.google.android.gallery3d/com.android.camera.CameraLauncher bnds=[40,785][200,985]} from pid 874

Re-launched Fennec, idle on about:home, no tabs open

> I/ActivityManager(  517): START u0 {act=android.intent.action.MAIN cat=[android.intent.category.HOME] flg=0x10200000 cmp=com.android.launcher/com.android.launcher2.Launcher} from pid 517

Immediately my console continues to flood with WebRTC preview frame length messages (again, zero tabs open; idle on about:home)

> V/WEBRTC-JC(19252): preview frame length 497664 context-2126770176
Tapping the notification, does not dismiss the notification (no tabs open, sitting on about:home)
On further attempt to initiate a new gUM session in a brand new tab with this corrupt situation, no hardware drop-down is displayed on attempt to share, nor is any preview frame is displayed

Demo page used to resolve the situation: http://mozilla.github.io/webrtc-landing/gum_test.html

This is in-line with what this users also reported: https://groups.google.com/forum/#!topic/mozilla.dev.media/SxDpgWm8zao
The current work-around is to force-stop (in Android Application Settings) to wipe the lock the browser has on local media.
Summary: Local shared active media notification does not dismiss on tab close → Local shared active media notification does not dismiss on tab close; possible gUM leak
Alternative STR that works for me on my Nexus 4 (Android 4.3)

i) http://mozilla.github.io/webrtc-landing/gum_test.html
ii) Share Audio & Video
iii) Hit the device home-button
iv) Restore the browser, close the active tab
(In reply to Aaron Train [:aaronmt] from comment #12)
> Alternative STR that works for me on my Nexus 4 (Android 4.3)
> 
> i) http://mozilla.github.io/webrtc-landing/gum_test.html
> ii) Share Audio & Video
> iii) Hit the device home-button
> iv) Restore the browser, close the active tab

That doesn't reproduce the bug for me on a Galaxy Nexus on Android 4.2.
Another log, right after closing the active tab and seeing the notification remain
Assignee: nobody → bnicholson
tracking-fennec: ? → 24+
Tracking since this came up in the product meeting as a possible 24 fix, not sure if it's worth taking at this point (edge casey).
Using the STR from comment 12, I traced the remove messages, and I found that "Notification:Hide" is never sent from WebrtcUI.js (notify() itself isn't called when this bug happens). I'm not too familiar with this code; what should trigger these notify() events?
Flags: needinfo?(wjohnston)
Should fire whenever we get a recording-device-events message with the data set to "shutdown". i.e. here's one: http://mxr.mozilla.org/mozilla-central/source/dom/camera/DOMCameraControl.cpp#322.

Or this call in MediaManager should also be firing:
http://mxr.mozilla.org/mozilla-central/source/dom/camera/DOMCameraControl.cpp#322

I'm not sure which path we should be on for closing a tab with an active session though.

From reading this, I think(?) there are two bugs here. Maybe we should spilt them up to avoid confusion.

1.) The race where you quickly close the tab before the notification is closed
2.) The bug in comments 5, 6, and 7 where the lock on media isn't being closed ever?
Flags: needinfo?(wjohnston)
As Aaron said in comment 6, I'm seeing these capture events in logcat even after the tab is closed. In my situation, it seems that these events are flooding Gecko and heavily delaying the callbacks from executing. During this time, Gecko isn't responsive at all; opening a new tab does nothing, the about:home favicon doesn't load, etc. After following the STR from comment 12, if I wait a long time (around 5-10 minutes), I do eventually see the media events stop. At that point, the notification disappears and Gecko becomes responsive again.

It looks like this is the path taken for a tab close:

OnNavigation calls Invalidate: http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#1255
Invalidate dispatches a MEDIA_STOP Runnable: http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#1574
This dispatches a STOPPING media event: http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.h#320
Which finally triggers the "recording-device-events" notification: http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#1634

After the STR from comment 12, I found that even MediaManager#OnNavigation isn't called while Gecko is hosed.
Tracking for Fx24 and would take a low risk patch if available for Fx24 but definitely not a show stopper for WebRTC ship in Fx24.
Brian's path for shutdown in MediaManager is correct: if we're not getting OnNavigation, we won't ever start the process of killing the capture.

Note that STOPPING is only sent once we've actually stopped the media (and this is done on the MediaThread).

Normally, OnNavigation is called from Navigator::OnNavigation; it's inside #ifdef MOZ_MEDIA_NAVIGATOR
Correction: I'm actually using Aaron's steps from comment 5, not comment 12. The STR from comment 12 haven't worked for me using my Droid RAZR.

Upon closing the tab for talky.io, I'm seeing three different possible outcomes:
1) The tab closes and the notification disappears, as expected.
2) The tab closes, the notification remains, and Gecko is frozen for several minutes as described in comment 18. I haven't actually hit this since yesterday.
3) The tab closes, the notification remains, and Gecko is still responsive.

I've been running into #3 a lot today, and what I've found is that OnNavigation is called, we get to Invalidate, and a MEDIA_STOP runnable is dispatched here: http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#1578. But the runnable's Run() callback (http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.h#257) is never executed.

I'm not really sure how to proceed from here -- I don't think I'm familiar enough with the code (or GDB) to debug why the queue is blocked.
After reproducing this somewhat reliably with minimal steps (just share audio+video and wait), I think this bug is a major problem. As has already been pointed out, this isn't a notification issue; the media itself never stops streaming after the tab is closed.

That means:
1) The notification hangs around after the tab is closed (this is actually the smallest issue IMO)
2) The media continues to stream, draining the user's battery and raising the privacy concerns of recording when it's not expected or wanted
3) To top it all off, Firefox can't even be closed using the normal method (swiping from the recent apps list) since we're using a foreground notification, keeping Firefox alive forever. The only way out is to force close Firefox as Aaron mentioned in comment 11, which many users won't know how to do.
Summary: Local shared active media notification does not dismiss on tab close; possible gUM leak → Local shared active media does not stop recording on tab close
Yikes. That's a whole lot worse than we initially expected. I'm going to followup in regards to this bug and webrtc crasher over email to discuss if we should still ship on 24 with these issues present.
This is also trivial to reproduce with Jason's script @ https://bugzilla.mozilla.org/attachment.cgi?id=799865
Severity: normal → major
Monica - Can you comment here to indicate if from a privacy perspective that you are okay or not okay with shipping with this webrtc bug? See comment 22 for context on the privacy concern.
Flags: needinfo?(mmc)
I agree with Brian. That the media continues to stream after the tab is closed violates the privacy principles of no surprises and user control (http://www.mozilla.org/en-US/privacy/).
Flags: needinfo?(mmc)
(In reply to Brian Nicholson (:bnicholson) from comment #22)
> After reproducing this somewhat reliably with minimal steps (just share
> audio+video and wait), I think this bug is a major problem. As has already
> been pointed out, this isn't a notification issue; the media itself never
> stops streaming after the tab is closed.
> 
> That means:
> 1) The notification hangs around after the tab is closed (this is actually
> the smallest issue IMO)
> 2) The media continues to stream, draining the user's battery and raising
> the privacy concerns of recording when it's not expected or wanted
> 3) To top it all off, Firefox can't even be closed using the normal method
> (swiping from the recent apps list) since we're using a foreground
> notification, keeping Firefox alive forever. The only way out is to force
> close Firefox as Aaron mentioned in comment 11, which many users won't know
> how to do.

I've been able to reproduce this as well on a Galaxy S4 - please call/txt (phonebook) if you need my help.
(In reply to Monica Chew [:mmc] (please use needinfo) from comment #26)
> I agree with Brian. That the media continues to stream after the tab is
> closed violates the privacy principles of no surprises and user control

The camera remains on, however, the application that was getting the data is gone, and origin restrictions mean no one else should be able to get at it (as normal).  The only issue for the user will be that the camera is indicated as on (and it may drain their battery) - neither of which are good, but there's no actual privacy leak so far as I know.  

I'm actively trying to narrow down exactly what's happening.  it sounds like something has hung the MediaThread so it never processes the STOP.
bnicholson has been able to repro it and get me logs from MediaThread with some debugs forced on.   It seems highly timing dependent; he needs to clear all the app data/profile before running the STR to get it to repro on the N4.  I'm still totally unable to repro it on a Galaxy S4 or a Nexus 10 (using either STR).

Audio is hanging in shutdown on his N4.  It's hanging in mVoeBase->StopSend() somewhere (webrtc.org code).  This hangs the thread and it never gets to shut down audio or video capture.  (Also note per above that capture continues, but nothing will be consuming the audio/video streams.)

If I had to guess, it's a deadlock between _callbackCritSect (grabbed in Channel::StopSend() - which does very little under lock - and _shared->crit_sect() (grabbed in VoEBaseImpl::StopSend(), which calls Channel::StopSend()).  but that's just a guess right now, and may well be wrong until I get to either debug it locally or get deeper into the callstack.  I'll have bnicholson see if he can repro it on an N4 Debug build, which may let us verify if we're deadlocked and if so where.


One contemplated idea is to add a deadman switch to MediaThread, that calls MOZ_CRASH() if MediaThread takes too long to process a command (to stop).  This would guarantee we don't end up hanging around no matter what causes the thread to hang (and it's a much safer/easier/guaranteed solution), and much easier to feel confident about.
A watchdog here seems like a useful debugging aid but I think we should actually
fix the underlying problem before ship.
Try: https://tbpl.mozilla.org/?tree=Try&rev=d6da23f71804   This is a deadman switch to avoid any potential issues if anything locks up starting or stopping media.  Tested on Linux and Android (though I can't repro this bug, so if others who can would try this I'd appreciate it).  I'll post an easier-to-read diff (I had to move a class, so the patch looks bigger than it is)
Attached patch deadman.patchSplinter Review
easier-to-read patch that assumes the class was already moved
FYI, bnicholson verified it deadmans correctly when this problem is hit on his N4.  He's going to try to get me a post-failure thread dump so I can identify the likely deadlock.
Attachment #800796 - Flags: review?(benjamin)
bnicholson has 
a) gotten me stacks that show MediaManager (MediaThread) is in ~AudioRecord(), and it's waiting on the delivery thread to exit.  That thread may be deadlocked somewhere in the webrtc.org code against the shutdown code on some critical section locks. 
b) he verified it does happen in Beta, and that the task is hung in the same place (~AudioRecord()).  This also means it can't be an regression from bug 884365.

We're trying to get better stacks to identify the mutexes involved
Slightly ugly root-cause fix attempt that adds a second lock for callback<->recordthread communication, so the callback never touches crit_sect_, and StopRecording() doesn't hold callback_crit_sect while destroying the openlsles objects.  Working on a cleaner alternative.
alternate patch that drops the crit_sect lock while releasing the opensles objects
Assignee: bnicholson → rjesup
Comment on attachment 800968 [details] [diff] [review]
use a separate critical section for the recording callback

Review of attachment 800968 [details] [diff] [review]:
-----------------------------------------------------------------

Do we need to hold the new lock in StartRecording?

Makes sense overall, though messy due to adding a new lock.
Using this try-build [1], this works for me. Unable to reproduce with the second patch posted. Great stuff!

[1] https://tbpl.mozilla.org/?tree=Try&rev=d713d5895ebe
Add a few more pedantic callback_lock grabs to ensure is_recording_ is accessed under lock always
Attachment #800968 - Attachment is obsolete: true
Attachment #801071 - Attachment description: use a separate critical section for the recording callback → Version 1: use a separate critical section for the recording callback
Attachment #801071 - Flags: review?(mwu)
Attachment #800980 - Attachment description: Avoid audio shutdown deadlock in opensles by dropping lock while releasing opensles → Version 2: Avoid audio shutdown deadlock in opensles by dropping lock while releasing opensles
Attachment #800980 - Flags: review?(mwu)
Attachment #800980 - Flags: review?(mwu) → review+
Attachment #801071 - Flags: review?(mwu) → review+
I prefer the 2nd approach of dropping the lock while releasing opensles, but the 1st approach is fine too. Pick whichever one seems less risky to you.
https://hg.mozilla.org/integration/mozilla-inbound/rev/69e78915fc92

Landed this one because while both were r+'d, and mwu liked #2 and blassey liked #1 (though both were ok with both patches), I think #1 is less prone to errors or unexpected results (though the cases that would cause that are almost certainly not hittable in our codebase, like calling StopRecording() from a second thread while one thread was already in the middle (unlocked) part of StopRecording().
Target Milestone: --- → Firefox 26
Comment on attachment 801071 [details] [diff] [review]
Version 1: use a separate critical section for the recording callback

pre-emptive uplift request per release-coordination meeting - patch is on inbound

[Approval Request Comment] (for Aurora and Beta)
Bug caused by (feature/regressing bug #): N/A

User impact if declined: Certain devices or OS versions may experience random failures to shut down media capture, leading to the camera/mic being locked and burning battery.

Testing completed (on m-c, etc.): Tested by bnicholson on a local build.  aaronmt tested the alternative patch with no problems - different patch, but shows that we correctly identified the cause of the lockups.

Risk to taking this patch (and alternatives if risky): Risk is low, and most likely failure mode would be a mutex deadlock (which is what we had before).

String or IDL/UUID changes made by this patch: none
Attachment #801071 - Flags: approval-mozilla-beta?
Attachment #801071 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/69e78915fc92
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Comment on attachment 801071 [details] [diff] [review]
Version 1: use a separate critical section for the recording callback

Approving the low-risk patch for uplift as this issue is a shipping blocker for webRTC on Android.

Please make sure to land asap to get this into tomorrow's beta.

Request to QA to verify this on a couple of devices post mozilla-beta landing.
Attachment #801071 - Flags: approval-mozilla-beta?
Attachment #801071 - Flags: approval-mozilla-beta+
Attachment #801071 - Flags: approval-mozilla-aurora?
Attachment #801071 - Flags: approval-mozilla-aurora+
I am still *on occasion* hitting this bug albeit not as frequent as prior only on mozilla-beta (24.0b10) and mozilla-aurora (09/10) thus far. I am testing on mozilla-central (09/10) currently and forcefully trying to reproduce this but no luck yet.

I am using my LG Nexus 4 (Android 4.3) and visiting http://talky.io/123, after sharing my camera and microphone, I close the tab and I see the original issue as reported and as seen in my console output with the continuation of preview frame logging. I am hitting this not as frequent as before. 

See video: http://www.youtube.com/watch?v=jurvzQTD9Pc

This leads me to believe that it might be possible to question wether there is there any possible code variance in the associated files between mozilla-central and mozilla-beta here that could cause contention or any potential mishap in the uplift?
Comment on attachment 800796 [details] [diff] [review]
Implement MediaThread start/stop capture deadman switch

I'm finding it very hard to review this because the class decl moved, and I'm not sure what is existing code and what's new. Is it possible to split it up so that moving the class decl is in one rubberstamp patch and the actual function change is separate?
Attachment #800796 - Flags: review?(benjamin)
Comment on attachment 800797 [details] [diff] [review]
deadman.patch

this is the patch that assumes the class has been moved first
Attachment #800797 - Flags: review?(benjamin)
Comment on attachment 800797 [details] [diff] [review]
deadman.patch

The major problem with this patch is the lack of guards for things like process sleep. It at least requires that we reset or cancel the timer if the computer goes to sleep.

I'm also a little concerned about what happens if the computer clock changes. I'm not sure what clock nsITimer uses to schedule its timeouts, but if it's a system clock, this will not survive clock resets and it needs to be double-checked against the results of mozilla::TimeStamp.
Attachment #800797 - Flags: review?(benjamin) → review-
(In reply to Aaron Train [:aaronmt] from comment #46)
> I am still *on occasion* hitting this bug albeit not as frequent as prior

can you help confirm the frequency here ? Is it 1/10, 1/15..etc ?

> only on mozilla-beta (24.0b10) and mozilla-aurora (09/10) thus far. I am
> testing on mozilla-central (09/10) currently and forcefully trying to
> reproduce this but no luck yet.
> 
> I am using my LG Nexus 4 (Android 4.3) and visiting http://talky.io/123,
> after sharing my camera and microphone, I close the tab and I see the
> original issue as reported and as seen in my console output with the
> continuation of preview frame logging. I am hitting this not as frequent as
> before. 
> 
> See video: http://www.youtube.com/watch?v=jurvzQTD9Pc
> 
> This leads me to believe that it might be possible to question wether there
> is there any possible code variance in the associated files between
> mozilla-central and mozilla-beta here that could cause contention or any
> potential mishap in the uplift?

Jesup, any ideas here ?
(In reply to bhavana bajaj [:bajaj] from comment #50)
> (In reply to Aaron Train [:aaronmt] from comment #46)
> > I am still *on occasion* hitting this bug albeit not as frequent as prior
> 
> can you help confirm the frequency here ? Is it 1/10, 1/15..etc ?

Well, I hate hard numbers but I just did a 1/8 attempt on Firefox 24 Beta 10.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> Comment on attachment 800797 [details] [diff] [review]
> deadman.patch
> 
> The major problem with this patch is the lack of guards for things like
> process sleep. It at least requires that we reset or cancel the timer if the
> computer goes to sleep.

TimerThread recomputes all timeouts after a sleep/suspend: See TimerThread::DoAfterSleep() and it doesn't fire them during a sleep/suspend.

> 
> I'm also a little concerned about what happens if the computer clock
> changes. I'm not sure what clock nsITimer uses to schedule its timeouts, but
> if it's a system clock, this will not survive clock resets and it needs to
> be double-checked against the results of mozilla::TimeStamp.

It uses PR_WaitCondVar(), which devolves down to pthread_cond_timedwait() at the core.  From the manpage:
  For cases when the system clock is advanced discontinuously by an operator,
  it is expected that implementations process any timed wait expiring at an 
  intervening time as if that time had actually occurred.

This would imply a clock change could cause all our timer to fire early (or very late, though that may be self-correcting by the insertion of new events).  However, when TimerThread wakes up, it compares TimeStamp::Now() to the calculated TimeStamp for the timer to fire, and if we're not past that time, it will sleep again.   So a clock change forward would cause TimerThread to wake up early, but not fire the timer early, assuming TimeStamp::Now() is monotonic.
From #mobile testing out Randell's Try builds for the same issue here on m-b 

* https://tbpl.mozilla.org/?tree=Try&rev=686e87689507
** Reproducible. Fail-safe crashes as expected https://crash-stats.mozilla.com/report/index/1a240b13-f689-4828-a9a1-84cfa2130910

* https://tbpl.mozilla.org/?tree=Try&rev=36f3fdb22427
** This works for me. I'm unable to reproduce the issues reported with this Try (m-b) build. This looks to be the silver-bullet.
upstream patch from webrtc.org
Comment on attachment 802453 [details] [diff] [review]
Upstream deadlock fix for issue 2201 (rev 4589)

r+ from me and upstream (fishman)
Attachment #802453 - Flags: review+
FYI, see comment 52 which I believe answers your comments.

That said, I don't think we need the deadman if we have a verified root-cause fix
Flags: needinfo?(benjamin)
hrm, ok. In that case I don't think we should use the deadman. Thanks for the investigation.
Flags: needinfo?(benjamin)
(In reply to Randell Jesup [:jesup] from comment #56)
> FYI, see comment 52 which I believe answers your comments.
> 
> That said, I don't think we need the deadman if we have a verified
> root-cause fix


Can you please help with the risk-evaluation of this upstream root-cause fix that is planning to be applied in a new beta that may be spun for this ? Also has this had any baketime/testing yet on other channel ? How confident are we with this not introducing new regressions at this point ?
verified fixed on android 24.0 beta 10 build-candidate #2
Backed out from esr24 because it's mobile-only and shouldn't have been merged there.
https://hg.mozilla.org/releases/mozilla-esr24/rev/c9b5ea714522
Verified fixed on 
Build: Firefox for Android 25 Beta 2 
Device: Samsung Galaxy Nexus
OS: Android 4.1.1
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: