Hang/deadlock with Flash on 4.4 on Tegra3

RESOLVED FIXED in Firefox 27



Firefox for Android
3 years ago
3 years ago


(Reporter: snorp, Assigned: snorp)


Firefox 29
Dependency tree / graph

Firefox Tracking Flags

(firefox27 fixed, firefox28 fixed, firefox29 fixed)



(3 attachments, 1 obsolete attachment)

With the patches from bug 935676 applied, I get a pretty frequent deadlock in Flash.
Depends on: 935676
Created attachment 8357255 [details]
ANR from the hang
This is also one of the most frequent ANRs in the wild.
(In reply to Jim Chen [:jchen :nchen] from comment #2)
> This is also one of the most frequent ANRs in the wild.

Oh good then it's probably not a 4.4 problem.
Summary: Hang/deadlock with Flash on 4.4 → Hang/deadlock with Flash
Actually I can't seem to reproduce this so far on 4.2...
Summary: Hang/deadlock with Flash → Hang/deadlock with Flash on 4.4

Comment 5

3 years ago
Anyone have any idea when this whole 4.4/Flash issue might be solved? Thank You all for the work you are doing alsol
(In reply to toquinto253 from comment #5)
> Anyone have any idea when this whole 4.4/Flash issue might be solved? Thank
> You all for the work you are doing alsol

Bug 935676 is where most of that work is happening
I figured out why this hang is occuring.

The plugin is invalidating at appropriate times (for the most part), by calling nsPluginInstanceOwner::RedrawPlugin(). Sometimes this does not actually trigger a paint, so changes to the layer are not made, etc. Pre-4.4, this resulted in the Flash content not showing until something else triggered a flush. On 4.4, the buffer queue fills up and the producer (Flash) blocks until the consumer (our compositor) makes one available.

nsPluginInstanceOwner::RedrawPlugin() simply calls nsFrame::InvalidateLayer() for the PLUGIN layer. This then calls nsFrame::InvalidateFrame[Internal], which AFAICT only schedules paints for popups or descendants of popups.

Matt/Roc, should nsFrame::InvalidateLayer() also SchedulePaint() for the weird plugin case there?
Flags: needinfo?(roc)
Flags: needinfo?(matt.woodrow)
Created attachment 8358471 [details] [diff] [review]
Always schedule a paint when plugins invalidate without a layer

This does fix the bug, but not sure if it's really correct.
Attachment #8358471 - Flags: review?(roc)
Attachment #8358471 - Flags: review?(matt.woodrow)
tn helped me debug this some earlier today. We discovered that Flash causes invalidations during paint after the view manager flush has been revoked, but before the repaint state bits have been cleared. The code in nsFrame::InvalidateFrameInternal assumes that if there are frames with the state bits, then a view manager flush is already scheduled. In this case that is not true.

It's not clear to me what the correct fix really is, but maybe we should just verify that the refresh driver has a pending flush in InvalidateFrameInternal. If it doesn't, then we can schedule a paint then.
It turns out that Flash was invalidating in response to nsNPAPIPluginInstance::NotifySize, which we call in nsPluginInstanceContainer::GeImageContainer, which is of course called during paint. This explains the somewhat strange behavior.

The fix for now seems to be to move the NotifySize call to the top of GetImageContainer, that way the invalidation isn't even necessary (at least initially).
Created attachment 8359272 [details] [diff] [review]
Fix invalidation issue with Flash on Android
Attachment #8358471 - Attachment is obsolete: true
Attachment #8358471 - Flags: review?(roc)
Attachment #8358471 - Flags: review?(matt.woodrow)
Attachment #8359272 - Flags: review?(tnikkel)
Flags: needinfo?(roc)
Flags: needinfo?(matt.woodrow)
Comment on attachment 8359272 [details] [diff] [review]
Fix invalidation issue with Flash on Android

I don't quite understand why just moving the call earlier means we don't get an invalidate, can you expand on that?
(In reply to Timothy Nikkel (:tn) from comment #12)
> Comment on attachment 8359272 [details] [diff] [review]
> Fix invalidation issue with Flash on Android
> I don't quite understand why just moving the call earlier means we don't get
> an invalidate, can you expand on that?

Right now we do still attempt to do one, but it's not strictly necessary in this circumstance because the stuff that uses the newly-invalidated values is right below the NotifySize call.
Alright, here's a complete explanation of what's happening currently:

After a Flash plugin is instantiated, we of course try to paint it. This starts in nsObjectFrame::BuildLayer, which then calls nsPluginInstanceOwner::GetImageContainer. On Android, GetImageContainer needs a few things from the nsNPAPIPluginInstance. The actual image content is in a SharedTextureHandle, created by nsNPAPIPluginInstance::CreateSharedHandle. CreateSharedHandle returns a wrapper for the native android surface that the plugin is drawing into.

On modern Android (such as KitKat), we do not pre-create any surfaces for the plugin. The plugin asks for surfaces as it needs them through the functions in ANPNativeWindow.cpp. If it doesn't request one before we get to nsNPAPIPluginInstance::CreateSharedHandle (which is typical on the first paint), then we return null (or zero actually).

It turns out that the nsNPAPIPluginInstance::NotifySize() is what prompts Flash to start requesting surfaces and such. If we do that at the end of nsPluginInstance::GetImageContainer(), the resulting invalidations caused by NotifySize() are basically ignored because of comment 9, but we already have a null surface in the current paint, so we get into a bad state here. Moving the NotifySize ensures that things are in a good state from the beginning. We could optionally preallocate the content surface, but I like this better.
Comment on attachment 8359272 [details] [diff] [review]
Fix invalidation issue with Flash on Android

Okay, thank you.

Did you get anywhere in being able to stop or detect invalidations happening during painting like this?
Attachment #8359272 - Flags: review?(tnikkel) → review+
Assignee: nobody → snorp
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 29
Comment on attachment 8359272 [details] [diff] [review]
Fix invalidation issue with Flash on Android

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 721741
User impact if declined: blurry Flash content
Testing completed (on m-c, etc.): m-c for a day
Risk to taking this patch (and alternatives if risky): Low, only touches plugin code for Android
String or IDL/UUID changes made by this patch: None
Attachment #8359272 - Flags: approval-mozilla-beta?
Attachment #8359272 - Flags: approval-mozilla-aurora?
Oops I was confusing this with another bug. The "user impact" should be "hangs the browser".


3 years ago
Attachment #8359272 - Flags: approval-mozilla-beta?
Attachment #8359272 - Flags: approval-mozilla-beta+
Attachment #8359272 - Flags: approval-mozilla-aurora?
Attachment #8359272 - Flags: approval-mozilla-aurora+
status-firefox27: --- → fixed
status-firefox28: --- → fixed
status-firefox29: --- → fixed
Fix a stupid rebase typo:
Looks like this isn't totally fixed. The applet at http://people.mozilla.org/~mwargers/tests/flash/ hangs fairly quickly with the same stack.
Resolution: FIXED → ---
The patch here does help the problem, but we can still deadlock. Let's take another look, shall we?

Here's the Gecko thread:

"Gecko" prio=5 tid=15 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x41ec2d40 self=0x68749688
  | sysTid=16343 nice=0 sched=0/0 cgrp=apps handle=1752472288
  | state=S schedstat=( 13035856000 3091649000 7368 ) utm=1219 stm=84 core=0
  #00  pc 000009bc  /system/lib/libc.so (__futex_syscall3+4294832136)
  #01  pc 0000ef7c  /system/lib/libc.so (__pthread_cond_timedwait_relative+48)
  #02  pc 0000efdc  /system/lib/libc.so (__pthread_cond_timedwait+64)
  #03  pc 000255ed  /system/lib/libgui.so (android::BufferQueue::dequeueBuffer(int*, android::sp<android::Fence>*, bool, unsigned int, unsigned int, unsigned int, unsigned int)+508)
  #04  pc 0002e57b  /system/lib/libgui.so (android::Surface::dequeueBuffer(ANativeWindowBuffer**, int*)+98)
  #05  pc 0002ea31  /system/lib/libgui.so (android::Surface::hook_dequeueBuffer_DEPRECATED(ANativeWindow*, ANativeWindowBuffer**)+32)
  #06  pc 0000105d  /system/lib/libnvwsi.so
  #07  pc 000020d9  /system/lib/libnvwsi.so
  #08  pc 0000b58d  /system/lib/egl/libEGL_tegra.so
  #09  pc 0000b7e9  /system/lib/egl/libEGL_tegra.so
  #10  pc 00006fb5  /system/lib/egl/libEGL_tegra.so
  #11  pc 000077db  /system/lib/egl/libEGL_tegra.so (eglMakeCurrent+34)
  #12  pc 0000d3e5  /system/lib/libEGL.so (android::egl_display_t::makeCurrent(android::egl_context_t*, android::egl_context_t*, void*, void*, void*, void*, void*, void*)+84)
  #13  pc 0000f9c5  /system/lib/libEGL.so (eglMakeCurrent+240)
  #14  pc 0055eb15  /data/app-lib/com.adobe.flashplayer-1/libflashplayer.so
  at org.mozilla.gecko.mozglue.GeckoLoader.nativeRun(Native Method)

Apparently (on Nexus 7, at least) Flash draws into the surface with GL. Here it's just doing a MakeCurrent(), which for some reason is causing the Tegra driver to try to dequeue a new buffer for drawing (maybe it's the first MakeCurrent on that surface). We end up blocking on a semaphore, the likely cause being that there are no buffers available and we are in "can block" mode (http://androidxref.com/4.4.2_r1/xref/frameworks/native/libs/gui/BufferQueue.cpp#371). So we are blocked until a buffer is freed up. Because we are within android::egl_display_t::makeCurrent, we also hold the global EGL lock (!) (http://androidxref.com/4.4.2_r1/xref/frameworks/native/opengl/libs/EGL/egl_display.cpp#328).

Here's the compositor thread (stack is a little messed up towards the end, but doesn't matter)

  #00  pc 000009bc  /system/lib/libc.so (__futex_syscall3+4294832136)
  #01  pc 0000e724  /system/lib/libc.so
  #02  pc 0000c4d7  /system/lib/libEGL.so
  #03  pc 0000ce69  /system/lib/libEGL.so (android::egl_display_t::getDisplay(void*)+16)
  #04  pc 00030597  /system/lib/libgui.so (android::SyncFeatures::SyncFeatures()+22)
  #05  pc 00028535  /system/lib/libgui.so (android::Singleton<android::SyncFeatures>::getInstance()+32)
  #06  pc 000287ed  /system/lib/libgui.so (android::GLConsumer::syncForReleaseLocked(void*)+20)
  #07  pc 00028b5b  /system/lib/libgui.so (android::GLConsumer::updateAndReleaseLocked(android::IGraphicBufferConsumer::BufferItem const&)+146)
  #08  pc 00028ceb  /system/lib/libgui.so (android::GLConsumer::updateTexImage()+142)
  #09  pc 0007d045  /system/lib/libandroid_runtime.so
  #10  pc 0001dbcc  /system/lib/libdvm.so (dvmPlatformInvoke+112)
  #11  pc 0004e123  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
  #12  pc 00026fe0  /system/lib/libdvm.so
  #13  pc 0002dfa0  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
  #14  pc 0002b638  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184)
  #15  pc 00060581  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+336)
  #16  pc 0004cf87  /system/lib/libdvm.so
  #17  pc 0007c275  /dev/ashmem/libxul.so (deleted)
  at android.graphics.SurfaceTexture.nativeUpdateTexImage(Native Method)
  at android.graphics.SurfaceTexture.updateTexImage(SurfaceTexture.java:169)
  at dalvik.system.NativeStart.run(Native Method)
  at dalvik.system.NativeStart.run(Native Method)

Here we're trying to consume the SurfaceTexture via updateTexImage(). ST calls GLConsumer::updateAndReleaseLocked() which acquires a new buffer and releases the old one. There is apparently some EGL setup code needed before actually releasing the buffer, and it's here where we deadlock. android::egl_display_t::getDisplay needs to acquire the same global lock we already acquired above, so we're hosed. The Gecko thread is waiting for us to release a buffer, and the Compositor can't do that because it's waiting on a lock that Gecko (Flash) holds.

Possible solutions, in no particular order:

1) Try to get BufferQueue in "cannot block" mode. Not sure how this works really, need to investigate more. Also not sure how EGL/Flash will like getting null surfaces.

2) Maybe try to get Flash to stop drawing with GL. IIRC, it only tries to do this on NVIDIA devices. My test device is a 2012 Nexus 7 (Tegra3). I'm going to put KitKat on my Nexus 10 to see if I can repro there. If this works, the worst case is that we block NVIDIA devices from running Flash.

3) We can try calling SurfaceTexture::releaseTexImage(), which will release a buffer back earlier than we would otherwise. I don't know if actually have a good place to call this. I think it would need to be done after the eglSwapBuffers() call for the frame. I don't believe this would eliminate the possibility of deadlock entirely, however.
Alright, the hang does not occur on the Nexus 10. We blacklisted Flash on NVIDIA once before (bug 703056), so maybe it's time to do so again. Alternatively, maybe we can find a way to disable the GL rendering on those devices.
Created attachment 8363266 [details] [diff] [review]
Disable Flash on Tegra devices running KitKat
Attachment #8363266 - Flags: review?(blassey.bugs)
It looks like the only way to disable the GL rendering is to install a mms.cfg (http://www.developria.com/2010/11/flash-player-101-mobile-optimi.html), which we can't do without root. The attached patch just disables Flash on Tegra devices running KitKat and higher.
Android bug filed here: https://code.google.com/p/android/issues/detail?id=65086
Summary: Hang/deadlock with Flash on 4.4 → Hang/deadlock with Flash on 4.4 on Tegra3
Attachment #8363266 - Flags: review?(blassey.bugs) → review+
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Blocks: 786996
Depends on: 1025090
You need to log in before you can comment on or make changes to this bug.