Repeated WebRTC sessions exhaust the JNI LocalRef table

RESOLVED FIXED in Firefox 26

Status

()

Core
WebRTC: Audio/Video
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: gcp, Assigned: gcp)

Tracking

({verifyme})

unspecified
mozilla27
ARM
Android
verifyme
Points:
---

Firefox Tracking Flags

(firefox26 fixed, firefox27 fixed)

Details

Attachments

(4 attachments, 5 obsolete attachments)

2.25 KB, patch
blassey
: review+
Details | Diff | Splinter Review
10.56 KB, patch
blassey
: review+
Details | Diff | Splinter Review
5.74 KB, patch
blassey
: review+
Details | Diff | Splinter Review
29.63 KB, patch
blassey
: review+
Details | Diff | Splinter Review
(Assignee)

Description

4 years ago
Filing this based on the issue discovered here:
https://bugzilla.mozilla.org/show_bug.cgi?id=902431#c27

It may or may not be the cause of that bug, so filing separately.
(Assignee)

Updated

4 years ago
Blocks: 902431
(Assignee)

Comment 1

4 years ago
The problem is this code, which is in any WebRTC entry/setup point:
http://dxr.mozilla.org/mozilla-central/source/content/media/webrtc/MediaEngineWebRTC.cpp#l234

We attach the JNI there, from what I can tell only because we want to be able to delete the GlobalRef, but then don't detach. The upstream webrtc.org code all assumes it can Attach and Detach to the JVM and doesn't need to clean up LocalRefs because Detaching cleans them up anyway. It's also smart enough to check if it's already attached, and sets a flag and avoids the detach in that case.

The above code doesn't detach, which means we're always attached, never clean up LocalRefs, and eventually die after enough WebRTC usage.
(Assignee)

Comment 2

4 years ago
There's several possible things to do here:

1) Change the GetGlobalContext thing to make a single global ref and keep a hold of it for subsequent callers, and forget about cleaning it up. This makes the attach/detach unnecessary.

2) There's some argument if we want to keep the webrtc.org behavior of doing constant attach/detatch, or clean it up to do single attach at the start and keep it attached afterwards, much like AndroidBridge is doing. This does mean we have to clean up all local refs, using PushLocalFrame, possibly coupled with a check if we're attached. I guess we can change the existing AttachAndUseAndroidDeviceInfoObjects and Detach equivalent to use Push/PopLocal if we're already attached.
(Assignee)

Comment 3

4 years ago
Created attachment 807386 [details] [diff] [review]
Patch 1. Add some debug assertions
Assignee: nobody → gpascutto
Attachment #807386 - Flags: review?(blassey.bugs)
(Assignee)

Comment 4

4 years ago
Created attachment 807388 [details] [diff] [review]
Patch 2. Keep a single global Context reference around
Attachment #807388 - Flags: review?(blassey.bugs)
(Assignee)

Comment 5

4 years ago
Created attachment 807390 [details] [diff] [review]
Patch 3. Add more logging for WebRTC startup
Attachment #807390 - Flags: review?(blassey.bugs)
(Assignee)

Comment 6

4 years ago
Created attachment 807391 [details] [diff] [review]
Patch 4. Use JNI frames when not attaching/detaching
Attachment #807391 - Flags: review?(blassey.bugs)
Comment on attachment 807391 [details] [diff] [review]
Patch 4. Use JNI frames when not attaching/detaching

Review of attachment 807391 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/video_capture/android/video_capture_android.cc
@@ +255,5 @@
> +    int res = env->PushLocalFrame(64);
> +    if (res != 0) {
> +        WEBRTC_TRACE(webrtc::kTraceError, webrtc::kTraceVideoCapture, -1,
> +                     "%s: Could not push local stack frame");
> +    }

you shouldn't be pushing and popping local frames across functions. What references are you trying to hold onto with this frame? Really they should just have a global reference.
Attachment #807391 - Flags: review?(blassey.bugs) → review-
Attachment #807386 - Flags: review?(blassey.bugs) → review+
Attachment #807390 - Flags: review?(blassey.bugs) → review+
Comment on attachment 807388 [details] [diff] [review]
Patch 2. Keep a single global Context reference around

Review of attachment 807388 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/AndroidBridge.cpp
@@ +1436,5 @@
>  jobject
>  AndroidBridge::GetGlobalContextRef() {
> +    if (sGlobalContext) {
> +        return sGlobalContext;
> +    } else {

this needs to be done per thread. I think you may be only calling this on one thread, but if that is the case record the thread id and assert that (I'd be fine with that being debug only)
Attachment #807388 - Flags: review?(blassey.bugs) → review-
(Assignee)

Comment 9

4 years ago
>I think you may be only calling this on one thread, but if that is the case record the thread id and assert that (I'd be fine with that being debug only)

I would've presumed that too, but I added the assert and BOOM:
http://dxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#l1336
(Assignee)

Comment 10

4 years ago
Comment on attachment 807388 [details] [diff] [review]
Patch 2. Keep a single global Context reference around

Review of attachment 807388 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on IRC:
http://android-developers.blogspot.ca/2011/11/jni-local-reference-changes-in-ics.html
" A global is usable from any thread, using that thread’s JNIEnv*, and is valid until an explicit call to DeleteGlobalRef()."

This means it's okay to pass the GlobalRef'ed GlobalContext across threads. The other path in the code is using GetJNIForThread so it should use the correct JNIEnv.

<blassey> then you're good
<blassey> comment in the bug telling me I'm wrong
<blassey> and r+
Attachment #807388 - Flags: review- → review+
(Assignee)

Comment 11

4 years ago
Created attachment 807902 [details] [diff] [review]
Patch 4. v2 Use JNI frames when not attaching/detaching

We don't have any global references we want to keep, but the upstream code assumes it can Attach/Detach to free up LocalRefs. In quite a few (but not all - thanks to the GlobalRef caching of the previous patch) of our calling sequences we will actually attach the JVM already, which makes this code not do the attach/detach cycle.

This patch adds an AutoLocalJNIFrame that does the right thing in both circumstances, by noting when the Detach will work and being a NOP in that case, or making a frame if it won't.
Attachment #807391 - Attachment is obsolete: true
Attachment #807902 - Flags: review?(blassey.bugs)
Comment on attachment 807902 [details] [diff] [review]
Patch 4. v2 Use JNI frames when not attaching/detaching

Review of attachment 807902 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed, let's use a RAII pattern for attaching and detaching the current thread. You can call GetEnv() to get the JNIEnv, if it returns JNI_EDETACHED, then call AttachCurrentThread() to get the JNIEnv. If you have to attach, then set a member variable so you can detach in the destructor.

Whether we attach the thread or not, we should push a local frame (it allows us to catch and handle exceptions as well as clean up local refs). Just need to make sure we push after attaching and pop before detaching.

Finally, this patch should be upstreamed
Attachment #807902 - Flags: review?(blassey.bugs) → review-
Comment on attachment 807388 [details] [diff] [review]
Patch 2. Keep a single global Context reference around

Review of attachment 807388 [details] [diff] [review]:
-----------------------------------------------------------------

realized that there is another problem with this. Our context can change (and will change if the phone rotates for instance). Keeping a global reference to that context will keep a stale GeckoActivity in memory along with anything it has references to as well as leave us calling functions on the wrong context.

The best option may be to have GeckoApp call a native method to set the context whenever it changes and store that statically with a global reference. If the static context is non null when we set it, that reference needs to be deleted.

Setting the static context from the main ui thread will have the added benefit of not calling GetJNIForThread and attaching the thread, so we can run as the upstream code expects.
Attachment #807388 - Flags: review+ → review-
(Assignee)

Comment 14

4 years ago
Created attachment 808748 [details] [diff] [review]
Patch 4. v3 Use RAII for all attach/detach/localframes in Android Video
Attachment #807902 - Attachment is obsolete: true
Attachment #808748 - Flags: review?(blassey.bugs)
(Assignee)

Comment 15

4 years ago
>Our context can change (and will change if the phone rotates for instance)

This is not true because we deal with the rotation event ourselves and don't reinitialize. However I'm less certain that applies to GeckoView users.

>The best option may be to have GeckoApp call a native method to set the context whenever it changes and store that statically with a global reference

Alternatively we should use the Application context for GetGlobalContextRef. For your solution it's not so obvious to me what "whenever it changes is", whereas the Application Context is exactly there for things that need a long-lived reference.

>Setting the static context from the main ui thread will have the added benefit of not calling GetJNIForThread and attaching the thread, so we can run as the upstream code expects.

At the cost of adding locking everywhere, as the webrtc.org code is not expecting you to change the context it's operating on under it's feet...
Comment on attachment 808748 [details] [diff] [review]
Patch 4. v3 Use RAII for all attach/detach/localframes in Android Video

Review of attachment 808748 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/video_capture/android/video_capture_android.h
@@ +80,5 @@
> +        }
> +
> +        if (VideoCaptureAndroid::g_jvm->GetEnv((void**) &mJNIEnv, JNI_VERSION_1_4) != JNI_OK) {
> +            // Try to attach the thread and get the env.
> +            // Attach this thread to JVM.

should probably only attach if the return value is JNI_EDETACHED, not sure what the beviour will be if another error is returned

@@ +108,5 @@
> +    }
> +
> +    int Init(JNIEnv*& env,
> +             jclass& javaCmDevInfoClass,
> +             jobject& javaCmDevInfoObject)

I'm not a fan of these public Init() functions. You should be able to pass a JNIEnv into the constructor (though I don't think there will be any callers in the WebRTC code). But please get rid of the out params on the Init/constructors.

Add public GetJNIEnv(), GetCmDevInfoClass() and GetCmDevInfoObject() getters.
Attachment #808748 - Flags: review?(blassey.bugs) → review+
(Assignee)

Comment 17

4 years ago
Created attachment 808851 [details]
Patch 2. v2 Keep a single global Context reference around

Switch to getApplicationContext().
Attachment #807388 - Attachment is obsolete: true
Attachment #808851 - Flags: review?(blassey.bugs)
(Assignee)

Comment 18

4 years ago
Created attachment 808858 [details] [diff] [review]
Patch 2. v3 Keep a single global Context reference around

Add some error detection/logging.
Attachment #808851 - Attachment is obsolete: true
Attachment #808851 - Flags: review?(blassey.bugs)
Attachment #808858 - Flags: review?(blassey.bugs)
(Assignee)

Comment 19

4 years ago
Created attachment 808882 [details] [diff] [review]
Patch 4. v4 Use RAII for all attach/detach/localframes in Android Video

The review comments wrt. passing env to the constructor vs. removing the out params was a bit mutually exclusive, so passing final version for review. We use env as the out parameter and to indicate an error from the constructor.

GetCmDevInfo* are a bit strange to have in AutoLocalFrame now, though I guess for device_info_android.* them being in AutoLocalFrame does mean that you can't actually use them without a valid env+being attached.
Attachment #808748 - Attachment is obsolete: true
Attachment #808882 - Flags: review?(blassey.bugs)
(Assignee)

Comment 20

4 years ago
Comment on attachment 808882 [details] [diff] [review]
Patch 4. v4 Use RAII for all attach/detach/localframes in Android Video

Review of attachment 808882 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/video_capture/android/video_capture_android.h
@@ +160,5 @@
> +                         __FUNCTION__);
> +        }
> +
> +        // Scope will detach JVM when finished, no need to clean up.
> +        if (!mAttached)

This line should go, see previous comment:
"Whether we attach the thread or not, we should push a local frame (it allows us to catch and handle exceptions as well as clean up local refs)."
Comment on attachment 808882 [details] [diff] [review]
Patch 4. v4 Use RAII for all attach/detach/localframes in Android Video

Review of attachment 808882 [details] [diff] [review]:
-----------------------------------------------------------------

r=blassey with nits addressed

::: media/webrtc/trunk/webrtc/modules/video_capture/android/device_info_android.cc
@@ +56,5 @@
>  uint32_t DeviceInfoAndroid::NumberOfDevices() {
>    JNIEnv *env;
> +  AutoLocalJNIFrame JNIFrame(env);
> +  if (!env)
> +      return 0;

I'd really rather this be:

AutoLocalJNIFrame jniFrame();
JNIEnv* env = jniFrame.GetJNIEnv();

if (!env)
    return 0;

reasoning being that out params in constructors are really weird. It is also the same signature as the implementation in AndroidBridge but with a different meaning. Please make that change through out the patch.

Also note the lower case jni, having it upper case makes it look like a class

::: media/webrtc/trunk/webrtc/modules/video_capture/android/video_capture_android.h
@@ +137,5 @@
> +        if (!VideoCaptureAndroid::g_jvm) {
> +            WEBRTC_TRACE(webrtc::kTraceError, webrtc::kTraceVideoCapture, -1,
> +                         "%s: SetAndroidObjects not called with a valid JVM.",
> +                         __FUNCTION__);
> +            return -1;

null out mJNIEnv on errors, or better yet start out by nulling it out.
Attachment #808882 - Flags: review?(blassey.bugs) → review+
Attachment #808858 - Flags: review?(blassey.bugs) → review+
(Assignee)

Comment 22

4 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/cd7c48f51a2e
https://hg.mozilla.org/integration/mozilla-inbound/rev/f1184809ec39
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca438424df3a
https://hg.mozilla.org/integration/mozilla-inbound/rev/c57b52c33aab
https://hg.mozilla.org/mozilla-central/rev/cd7c48f51a2e
https://hg.mozilla.org/mozilla-central/rev/f1184809ec39
https://hg.mozilla.org/mozilla-central/rev/ca438424df3a
https://hg.mozilla.org/mozilla-central/rev/c57b52c33aab
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
(Assignee)

Comment 24

4 years ago
STR for QA:

On a Nexus 4:

1) Go to http://www.simpl.info/getusermedia/
2) Approve WebRTC permission dialog
3) Observe video starting
4) Reload site

Unpatched Firefox will crash after about 10 iterations. Patched Firefox shouldn't crash before you get bored.
Keywords: verifyme
Target Milestone: mozilla27 → ---
(Assignee)

Updated

4 years ago
Target Milestone: --- → mozilla27
(Assignee)

Comment 25

4 years ago
Comment on attachment 807386 [details] [diff] [review]
Patch 1. Add some debug assertions

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Android WebRTC
User impact if declined: WebRTC crash after many WebRTC sessions
Testing completed (on m-c, etc.): Landed on m-c
Risk to taking this patch (and alternatives if risky): The whole patchstack is a reasonable amount of changes, all of which remove technical debt/clean things up. At worst it will crash more.
Attachment #807386 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 26

4 years ago
Comment on attachment 807390 [details] [diff] [review]
Patch 3. Add more logging for WebRTC startup

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Android WebRTC
User impact if declined: WebRTC crash after many WebRTC sessions
Testing completed (on m-c, etc.): Landed on m-c
Risk to taking this patch (and alternatives if risky): The whole patchstack is a reasonable amount of changes, all of which remove technical debt/clean things up. At worst it will crash more.
Attachment #807390 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 27

4 years ago
Comment on attachment 808858 [details] [diff] [review]
Patch 2. v3 Keep a single global Context reference around

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Android WebRTC
User impact if declined: WebRTC crash after many WebRTC sessions
Testing completed (on m-c, etc.): Landed on m-c
Risk to taking this patch (and alternatives if risky): The whole patchstack is a reasonable amount of changes, all of which remove technical debt/clean things up. At worst it will crash more.
Attachment #808858 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 28

4 years ago
Comment on attachment 808882 [details] [diff] [review]
Patch 4. v4 Use RAII for all attach/detach/localframes in Android Video

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Android WebRTC
User impact if declined: WebRTC crash after many WebRTC sessions
Testing completed (on m-c, etc.): Landed on m-c
Risk to taking this patch (and alternatives if risky): The whole patchstack is a reasonable amount of changes, all of which remove technical debt/clean things up. At worst it will crash more.
Attachment #808882 - Flags: approval-mozilla-aurora?
(In reply to Gian-Carlo Pascutto (:gcp) from comment #28)
> Comment on attachment 808882 [details] [diff] [review]
> Patch 4. v4 Use RAII for all attach/detach/localframes in Android Video
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): Android WebRTC
> User impact if declined: WebRTC crash after many WebRTC sessions
> Testing completed (on m-c, etc.): Landed on m-c
> Risk to taking this patch (and alternatives if risky): The whole patchstack
> is a reasonable amount of changes, all of which remove technical debt/clean
> things up. At worst it will crash more.

If it risks crashing more, then why would this be safe enough for an aurora uplift?
(Assignee)

Comment 30

4 years ago
Because the expected behavior is for it crash far less. Worst case is exactly what it says.
Comment on attachment 807386 [details] [diff] [review]
Patch 1. Add some debug assertions

I understand what you meant, let's get this on Aurora and watch the results - we have time to back out if there's higher crashing (which is not expected).
Attachment #807386 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #807390 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #808858 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #808882 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Assignee)

Comment 32

4 years ago
https://hg.mozilla.org/releases/mozilla-aurora/rev/af8bc8ef89c4
https://hg.mozilla.org/releases/mozilla-aurora/rev/c8abbbef2a67
https://hg.mozilla.org/releases/mozilla-aurora/rev/bc7b5e1c60a2
https://hg.mozilla.org/releases/mozilla-aurora/rev/891eaf3a213e
status-firefox26: --- → fixed
status-firefox27: --- → fixed
You need to log in before you can comment on or make changes to this bug.