Last Comment Bug 826829 - Crash when sending camera app to background: mozalloc_abort | pthread_mutex_unlock | ?? (mozalloc_abort() -- or -- abort() under libutils.so!android::RefBase::weakref_type::decWeak)
: Crash when sending camera app to background: mozalloc_abort | pthread_mutex_u...
Status: VERIFIED FIXED
[b2g-crash] [cr 437569], QARegressExc...
: crash
Product: Core
Classification: Components
Component: DOM: Device Interfaces (show other bugs)
: unspecified
: ARM Gonk (Firefox OS)
: -- critical (vote)
: mozilla21
Assigned To: Sotaro Ikeda [:sotaro]
:
Mentors:
: 828289 (view as bug list)
Depends on:
Blocks: 827833 832119
  Show dependency treegraph
 
Reported: 2013-01-04 13:18 PST by Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
Modified: 2013-03-27 12:50 PDT (History)
25 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
tef+
-
wontfix
wontfix
fixed
+
verified
fixed


Attachments
Minidump + dumplookup (49.38 KB, text/plain)
2013-01-04 13:18 PST, Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
no flags Details
logout when abort gets called (manually added logouts) (23.60 KB, text/plain)
2013-01-08 03:02 PST, Sotaro Ikeda [:sotaro]
no flags Details
Part 1 - change ImageContainerChild::SendFlush() to sync (919 bytes, patch)
2013-01-08 08:09 PST, Sotaro Ikeda [:sotaro]
kchen: review+
Details | Diff | Splinter Review
Deadlock log after applying attachment 699225 (4.05 KB, text/plain)
2013-01-08 09:36 PST, Sotaro Ikeda [:sotaro]
no flags Details
Part 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread (6.02 KB, patch)
2013-01-09 09:06 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread (7.31 KB, patch)
2013-01-10 07:42 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread (7.29 KB, patch)
2013-01-10 07:52 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part 2 rev 3 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread (7.39 KB, patch)
2013-01-10 09:20 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part 1 rev 2 - change ImageContainerChild::SendFlush() to sync (1001 bytes, patch)
2013-01-10 09:56 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part 2 rev 4 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread (7.48 KB, patch)
2013-01-10 09:57 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review
Part2 - change DispatchSetIdle() to synchronous SetIdle() (3.83 KB, patch)
2013-01-11 02:30 PST, Sotaro Ikeda [:sotaro]
kchen: review+
Details | Diff | Splinter Review
Part2 rev2 - change DispatchSetIdle() to synchronous SetIdle() (3.84 KB, patch)
2013-01-11 04:09 PST, Sotaro Ikeda [:sotaro]
no flags Details | Diff | Splinter Review

Description Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 13:18:02 PST
Created attachment 698059 [details]
Minidump + dumplookup

+++ This bug was initially created as a clone of Bug #824224 +++

Seen during MW1 [1].

What happened was, after step 8, when pressing HOME to send the camera to the background, it crashed.

[1] https://wiki.mozilla.org/B2G/Memory_acceptance_criteria#MW1:_Music_stays_alive
Comment 1 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 13:22:59 PST
dumplookup is nice and juicy and useful.  Trimming what appears to be noise, I see

0x43761b14: libutils.so!android::RefBase::weakref_type::decWeak [RefBase.cpp : 412 + 0x1]
0x43761b24: libutils.so!android::RefBase::decStrong [RefBase.cpp : 363 + 0x1]
0x43761b44: libmemalloc.so!gralloc::PmemKernelAlloc::map_buffer [pmemalloc.cpp : 370 + 0x1]
0x43761b54: libutils.so!android::RefBase::weakref_type::decWeak [RefBase.cpp : 412 + 0x1]
0x43761bb4: libxul.so!mozilla::layers::PImageContainerChild::Read [PImageContainerChild.cpp : 1125 + 0x9]
0x43761bcc: libxul.so!mozilla::layers::PImageContainerChild::Read [PImageContainerChild.cpp : 1033 + 0x5]
0x43761be4: libxul.so!mozilla::layers::PImageContainerChild::Read [PImageContainerChild.cpp : 640 + 0xd]
0x43761c0c: libxul.so!mozilla::layers::SurfaceDescriptor::~SurfaceDescriptor [LayersSurfaces.cpp : 487 + 0x1]
0x43761c14: libxul.so!mozilla::layers::PImageContainerChild::Read [PImageContainerChild.cpp : 833 + 0xd]
0x43761c64: libxul.so!mozilla::layers::PImageContainerChild::FatalError [PImageContainerChild.cpp : 528 + 0x1]
0x43761c74: libxul.so!mozilla::layers::PImageContainerChild::OnMessageReceived [PImageContainerChild.cpp : 383 + 0x9]
0x43761cdc: libxul.so!mozilla::layers::PCompositorChild::OnMessageReceived [PCompositorChild.cpp : 637 + 0x9]
0x43761cec: libxul.so!mozilla::layers::PCompositorChild::OnMessageReceived [PCompositorChild.cpp : 630 + 0x1]
0x43761cfc: libxul.so!mozilla::ipc::AsyncChannel::OnDispatchMessage [AsyncChannel.cpp : 473 + 0xb]

This is in code that's trying to look up a PGrallocBufferChild.  I didn't have the patches from bug 824224 applied and I didn't look at the logcat, so I wasn't able to see the abort message.

If the decWeak() can be believed, then this could be a crash at

    const int32_t c = android_atomic_dec(&impl->mWeak);
    LOG_ASSERT(c >= 1, "decWeak called on %p too many times", this);

Alternatively, we could be crashing trying to look up a gralloc buffer actor.
Comment 2 Sotaro Ikeda [:sotaro] 2013-01-07 07:49:48 PST
I could easily reproduce the Camera app crash by repating following [1][2] on Unagi phone.

1. start Camera application. Wait until a camera preview becomes active. 
2. push home button

It happens when I repeat [1][2] about 4-5 times. It is a same crash, I think.
Comment 3 Sotaro Ikeda [:sotaro] 2013-01-07 07:51:25 PST
When camera app crashed, I always see following log.

I/Gecko   (  482): [Child 482] ###!!! ABORT: [PImageContainerChild] abort()ing as a result: file /home/sotaro/build-b2g/B2G/objdir-gecko/ipc/ipdl/PImageContainerChild.cpp, line 527
F/libc    (  482): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
Comment 4 Sotaro Ikeda [:sotaro] 2013-01-07 08:24:59 PST
I trid to use GDB to debug this. But it takes very long time to connect GDB in Camera app's case and failed to debug soon. Therefore, I debug this by adding a lot of log-prints.

From logouts, it seems that the abort happens like following sequence.
-[1] ImageContainerChild calls ImageContainerChild::SendPublishImage()
-[2] camera app stops preview
-[3] GonkNativeWindow free all Gralloc buffers
-[4] GrallocBufferActors are freed by ImageBridgeChild
-[5] ImageContainerParent receive ImageContainerParent::RecvPublishImage()
-[6] ImageContainerParent sends back a SharedImage that refers alredy deleted GrallocBufferActor
-[7] ImageContainerChild received the message from parent, but failed to get SurfaceDescriptorGralloc because it is alredy deleted
Comment 5 Sotaro Ikeda [:sotaro] 2013-01-07 08:27:28 PST
Following causes to abort

> -[6] ImageContainerParent sends back a SharedImage that refers alredy deleted GrallocBufferActor
Comment 6 Sotaro Ikeda [:sotaro] 2013-01-07 08:49:52 PST
There is a function"ImageContainerChild::SendFlush()" to remove SharedImage of parent side. But ImageContainerParent::RecvFlush() was receved after child side aborted when abort happened.
Comment 7 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-07 09:13:26 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #2)
> I could easily reproduce the Camera app crash by repating following [1][2]
> on Unagi phone.
> 
> 1. start Camera application. Wait until a camera preview becomes active. 
> 2. push home button
> 
> It happens when I repeat [1][2] about 4-5 times. It is a same crash, I think.

Yep, that's exactly this bug.
Comment 8 Sotaro Ikeda [:sotaro] 2013-01-07 09:45:19 PST
I got following information from Kan-Ru Chen
 - Rendering and pending buffers are kept by GonkNativeWindow

So, the problem should be prevented by it... I need to investigate more.
Comment 9 Tim Taubert [:ttaubert] 2013-01-08 02:35:27 PST
While recording videos and switching between Camera and Gallery I see the PImageContainerChild abort mentioned here but also:

[Child 2325] ###!!! ABORT: [PBlobStreamChild] abort()ing as a result: file /home/tim/workspace/b2g-desktop/objdir-gecko-unagi/ipc/ipdl/PBlobStreamChild.cpp, line 364

Is this related or should I file a new bug? I do have a backtrace which also contains the first frame from above (mozilla::ipc::AsyncChannel::OnDispatchMessage).
Comment 10 Sotaro Ikeda [:sotaro] 2013-01-08 02:55:06 PST
(In reply to Tim Taubert [:ttaubert] from comment #9)
> While recording videos and switching between Camera and Gallery I see the
> PImageContainerChild abort mentioned here but also:
> 
> [Child 2325] ###!!! ABORT: [PBlobStreamChild] abort()ing as a result: file
> /home/tim/workspace/b2g-desktop/objdir-gecko-unagi/ipc/ipdl/PBlobStreamChild.
> cpp, line 364
> 
> Is this related or should I file a new bug? I do have a backtrace which also
> contains the first frame from above
> (mozilla::ipc::AsyncChannel::OnDispatchMessage).

It seems like that it is a different bug. Can you file a new bug?
Comment 11 Tim Taubert [:ttaubert] 2013-01-08 02:57:10 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #10)
> It seems like that it is a different bug. Can you file a new bug?

Yeah, the Gallery just crashed again in PBlobStreamChild.cpp while in the background right after I started recording. Doesn't seem related to me, will file a new bug.
Comment 12 Sotaro Ikeda [:sotaro] 2013-01-08 03:02:41 PST
Created attachment 699083 [details]
logout when abort gets called (manually added logouts)

I added more printout to confirm following.

> I got following information from Kan-Ru Chen
> - Rendering and pending buffers are kept by GonkNativeWindow

"BufferSlot::RENDERING" just prevents to delete a buffer in GonkNativeWindow::freeBufferLocked(int i).
The buffer is deleted after when CameraGraphicBuffer loses reference from others.

this log shows that.
Comment 13 Sotaro Ikeda [:sotaro] 2013-01-08 08:09:43 PST
Created attachment 699225 [details] [diff] [review]
Part 1 - change ImageContainerChild::SendFlush() to sync

one cause of abort was because of async ImageContainerChild::SendFlush(). This patch change ImageContainerChild::SendFlush() to synchronous.
Comment 14 Sotaro Ikeda [:sotaro] 2013-01-08 08:51:46 PST
By applying attachment 699225 [details] [diff] [review] , I face following problem.
- Sometimes, GonkNativeWindow::freeBufferLocked() falls into deadlock state

It seems that attachment 699225 [details] [diff] [review] hit another defect.
Comment 15 Sotaro Ikeda [:sotaro] 2013-01-08 09:36:18 PST
Created attachment 699293 [details]
Deadlock log after applying attachment 699225 [details] [diff] [review]

get log by following command
> adb shell logcat -v threadtime > log.txt

dead lock happens like following sequence.
- [1] camera application stops preview
- [2] GonkNativeWindow frees all gralloc buffers by calling GonkNativeWindow::abandon() //camera thread
- [3] GonkNativeWindow require mutex lock. //camera thread
- [4] GonkNativeWindow::freeAllBuffersLocked() tries to free all gralloc buffers except rendering state buffers.  //camera thread
- [5] ImageBridgeChild::DeallocSurfaceDescriptorGralloc() is called to deallocate gralloc buffer.
- [6] DeallocSurfaceDescriptorGralloc() post a task of deallocate gralloc buffer to ImageBridge's thread
            and wait until the buffer is deleted
- [7] ImageBridge's thread processes some tasks on the thread.
          one of task is ImageContainerChild::SetIdleNow()
- [8] ImageContainerChild::SetIdleNow() delete gralloc buffer synchronously by change of attachment 699225 [details] [diff] [review]
       and frees some GonkIOSurfaceImage   // ImageBridge's thread
- [9] In GonkIOSurfaceImage::~GonkIOSurfaceImage(), CameraGraphicBuffer::Unlock() is called    // ImageBridge's thread
- [10] In CameraGraphicBuffer::Unlock(), GonkNativeWindow::returnBuffer() is called    // ImageBridge's thread
- [11] GonkNativeWindow::returnBuffer() tries to hold the mutex lock. But it is alrealy hold by [3] in different thread !!!!!!!!
Comment 16 Sotaro Ikeda [:sotaro] 2013-01-08 09:48:42 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #14)
> By applying attachment 699225 [details] [diff] [review] , I face following
> problem.
> - Sometimes, GonkNativeWindow::freeBufferLocked() falls into deadlock state
> 
> It seems that attachment 699225 [details] [diff] [review] hit another defect.

The deadlock seems similar to Bug 826765
Comment 17 Sotaro Ikeda [:sotaro] 2013-01-09 01:10:06 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #14)
> By applying attachment 699225 [details] [diff] [review] , I face following
> problem.
> - Sometimes, GonkNativeWindow::freeBufferLocked() falls into deadlock state
> 
> It seems that attachment 699225 [details] [diff] [review] hit another defect.

I locally confirmed that the deadlock could be prevented by following change.

- dispatch GonkNativeWindow::returnBuffer() to a different thread than ImageBridge's thread

Right now, I use main thread for it. It is better to assign dedicated thread for it, because performance is very important.
Comment 18 Sotaro Ikeda [:sotaro] 2013-01-09 02:16:03 PST
I found 2 patterns to the aborts. Attachment 699225 [details] [diff] tried to address [1]. It is better to address [2] in different bug. 

[1] abort during stopping preview
After calling ImageContainerChild::SendPublishImage(), ImageContainerChild calls ImageContainerChild::SendFlush() and frees child side Graloc buffers.
Then on parent side, ImageContainerParent::RecvPublishImage() is called and send back previous image. But it is already deallocated on child side.

[2] abort on re-starting preview
camera app stop preview correctly and try to start preview again. 
There is a Image(GonkIOSurfaceImage) from last preview. And the Image references to a CameraGraphicBuffer.
When prevew starts, ImageContainer::SetCurrentImage() is called.
And within it, ImageContainerChild::SendImageAsync() is called and mActiveImage is replaced to new Image.
When mActiveImage is replaced, there is no refence to the GonkIOSurfaceImage from previous preview and the the GonkIOSurfaceImage is deleted.
During the destruction of the GonkIOSurfaceImage, CameraGraphicBuffer::Unlock() is called and Gralloc buffer is deleted.
After that on parent side, ImageContainerParent::RecvPublishImage() is called and send back previous image. But it is already deallocated on child side.
Comment 19 Sotaro Ikeda [:sotaro] 2013-01-09 09:06:01 PST
Created attachment 699888 [details] [diff] [review]
Part 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

this patch prevent deadlock
Comment 20 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-09 09:20:17 PST
I'll try to review this tonight.
Comment 21 Sotaro Ikeda [:sotaro] 2013-01-09 22:26:48 PST
 
For [2], I created Bug 828289.

> [2] abort on re-starting preview
> camera app stop preview correctly and try to start preview again. 
> There is a Image(GonkIOSurfaceImage) from last preview. And the Image
> references to a CameraGraphicBuffer.
> When prevew starts, ImageContainer::SetCurrentImage() is called.
> And within it, ImageContainerChild::SendImageAsync() is called and
> mActiveImage is replaced to new Image.
> When mActiveImage is replaced, there is no refence to the GonkIOSurfaceImage
> from previous preview and the the GonkIOSurfaceImage is deleted.
> During the destruction of the GonkIOSurfaceImage,
> CameraGraphicBuffer::Unlock() is called and Gralloc buffer is deleted.
> After that on parent side, ImageContainerParent::RecvPublishImage() is
> called and send back previous image. But it is already deallocated on child
> side.
Comment 22 Lucas Adamski [:ladamski] 2013-01-10 01:46:06 PST
We'll definitely take a fix for these 3 related bugs so please don't stop working on them, but we won't hold the release for a camera app crasher that the user can easily work around.
Comment 23 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 02:47:51 PST
Comment on attachment 699888 [details] [diff] [review]
Part 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

Review of attachment 699888 [details] [diff] [review]:
-----------------------------------------------------------------

Drive-by review!

I'm concerned about the thread-safety of 'sCameraGraphicBufferThreadRefCount'.  It looks like Start*() is called from the CameraGraphicBuffer (CGB) ctor, which is called from GonkNativeWindow::getCurrentBuffer() running on the camera driver's frame thread; but Release*() is called from sCameraGraphicBufferThread.  In this case, you should be using PR_ATOMIC_INCREMENT/_DECREMENT[1] on your reference counter.

Alternatively, you could add an nsRefPtr<nsIThread> to each CameraGraphicBuffer, and let the existing framework handle reference counting for you.  I just looked at [2] and nsThread uses NS_IMPL_THREADSAFE*().  I think this would simplify your code considerably.

1. See http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/pratom.h#66
2. http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#150

::: ./src_org/dom/camera/GonkNativeWindow.cpp
@@ +48,5 @@
> +    sCameraGraphicBufferThreadRefCount = 1;
> +
> +    nsresult rv = NS_NewNamedThread("CamGraphicBuf",
> +                                    getter_AddRefs(sCameraGraphicBufferThread));
> +    NS_ENSURE_SUCCESS(rv, );

Just use NS_ENSURE_SUCCESS_VOID(rv).

@@ +54,5 @@
> +
> +
> +static void ReleaseCameraGraphicBufferThread()
> +{
> +  if(--sCameraGraphicBufferThreadRefCount == 0) {

Put a check in here to make sure sCameraGraphicBufferThreadRefCount >= 0.

e.g.
if (NS_UNLIKELY(sCameraGraphicBufferThreadRefCount < 0)) {
  NS_ERROR("sCameraGraphicBufferThreadRefCount is < 0");
  MOZ_CRASH();
}

@@ +55,5 @@
> +
> +static void ReleaseCameraGraphicBufferThread()
> +{
> +  if(--sCameraGraphicBufferThreadRefCount == 0) {
> +    nsresult rv = sCameraGraphicBufferThread->Shutdown();

Based on your current implementation of StartCameraGraphicBufferThread(), --sCameraGraphicBufferThreadRefCount == 0 is not sufficient to guarantee that sCameraGraphicBufferThread != nullptr.  (i.e. if rv != NS_OK in the ctor.)

@@ +56,5 @@
> +static void ReleaseCameraGraphicBufferThread()
> +{
> +  if(--sCameraGraphicBufferThreadRefCount == 0) {
> +    nsresult rv = sCameraGraphicBufferThread->Shutdown();
> +    NS_ENSURE_SUCCESS(rv, );

NS_ENSURE_SUCCESS_VOID(rv)

@@ +649,5 @@
> +void CameraGraphicBuffer::Unlock() MOZ_OVERRIDE
> +{
> +    nsCOMPtr<nsIRunnable> returnBufferTask = new ReturnBufferTask(this);
> +    nsresult rv = sCameraGraphicBufferThread->Dispatch(returnBufferTask, NS_DISPATCH_NORMAL);
> +    NS_ENSURE_SUCCESS(rv, );

NS_ENSURE_SUCCESS_VOID(rv)

::: ./src_org/dom/camera/GonkNativeWindow.h
@@ +240,5 @@
>      typedef mozilla::layers::SurfaceDescriptor SurfaceDescriptor;
>      typedef mozilla::layers::ImageBridgeChild ImageBridgeChild;
>  
>  public:
> +public:

Remove duplicate 'public:' clause.

@@ +269,5 @@
> +        : mCameraGraphicBuffer(aCameraGraphicBuffer)
> +    {
> +    }
> +
> +    virtual ~ReturnBufferTask()

Make this protected.
Comment 24 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-10 02:56:32 PST
Comment on attachment 699888 [details] [diff] [review]
Part 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

Review of attachment 699888 [details] [diff] [review]:
-----------------------------------------------------------------

One more question..

::: ./src_org/dom/camera/GonkNativeWindow.cpp
@@ +634,5 @@
> +  , mGeneration(aGeneration)
> +  , mLocked(true)
> +{
> +    DOM_CAMERA_LOGT("%s:%d : this=%p\n", __func__, __LINE__, this);
> +    StartCameraGraphicBufferThread();

Since we are creating and releasing the camera graphic buffer thread on the creation and destruction of camera graphic buffer, could we just hold a reference to the thread here and let nsCOMPtr to handle the reference count?

Like

ctor:
  mThread = GetCameraGraphicBufferThread();

dtor:
  mThread = nullptr;

::: ./src_org/dom/camera/GonkNativeWindow.h
@@ +240,5 @@
>      typedef mozilla::layers::SurfaceDescriptor SurfaceDescriptor;
>      typedef mozilla::layers::ImageBridgeChild ImageBridgeChild;
>  
>  public:
> +public:

Remove this.

@@ +271,5 @@
> +    }
> +
> +    virtual ~ReturnBufferTask()
> +    {
> +    }

Remove this.
Comment 25 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-10 03:08:27 PST
> @@ +271,5 @@
> > +    }
> > +
> > +    virtual ~ReturnBufferTask()
> > +    {
> > +    }
> 
> Remove this.

What mikeh said.
Comment 26 Sotaro Ikeda [:sotaro] 2013-01-10 07:42:41 PST
Created attachment 700429 [details] [diff] [review]
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

change
- remove thread ref counter
- GonkNativeWindow creates and hold a thread
Comment 27 Sotaro Ikeda [:sotaro] 2013-01-10 07:52:44 PST
Created attachment 700434 [details] [diff] [review]
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

change
- remove thread ref counter
- GonkNativeWindow creates and hold a thread
Comment 28 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-10 08:07:37 PST
Comment on attachment 700434 [details] [diff] [review]
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

Review of attachment 700434 [details] [diff] [review]:
-----------------------------------------------------------------

Drive-by!

::: ./src_org/dom/camera/GonkNativeWindow.cpp
@@ +606,5 @@
> +                                    getter_AddRefs(mThread));
> +    NS_ENSURE_SUCCESS_VOID(rv);
> +}
> +
> +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()

nsCOMPtr<nsIThread>

@@ +608,5 @@
> +}
> +
> +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()
> +{
> +    return mThread;

I think we could create the thread here lazily.

::: ./src_org/dom/camera/GonkNativeWindow.h
@@ +82,5 @@
>      // Release all internal buffers
>      void abandon();
>  
>      SurfaceDescriptor *getSurfaceDescriptorFromBuffer(ANativeWindowBuffer* buffer);
> +    nsIThread* getGonkNativeWindowThread();

nsCOMPtr<nsIThread>

@@ +284,5 @@
> +    virtual ~ReturnBufferTask()
> +    {
> +    }
> +
> +protected:

Remove this one.
Comment 29 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 08:22:52 PST
Comment on attachment 700434 [details] [diff] [review]
Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

Review of attachment 700434 [details] [diff] [review]:
-----------------------------------------------------------------

Good work.  This looks much simpler.  Please fix up those issues and this will probably be ready to land after the next review.

::: ./src_org/dom/camera/GonkNativeWindow.cpp
@@ +603,5 @@
> +void GonkNativeWindow::startGonkNativeWindowThread()
> +{
> +    nsresult rv = NS_NewNamedThread("CamGraphicBuf",
> +                                    getter_AddRefs(mThread));
> +    NS_ENSURE_SUCCESS_VOID(rv);

Since there's not much we can do if this fails, it's probably better to assert here.  Something like:

if (NS_UNLIKELY(NS_FAILED(rv))) {
  NS_ERROR("Failed to start camera graphic buffer disposal thread");
  MOZ_CRASH();
}

@@ +606,5 @@
> +                                    getter_AddRefs(mThread));
> +    NS_ENSURE_SUCCESS_VOID(rv);
> +}
> +
> +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()

already_AddRefed<...> ... ?

@@ +608,5 @@
> +}
> +
> +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()
> +{
> +    return mThread;

I think it's better to create it when we init the GonkNativeWindow, so that we don't have to keep checking for a thread on every new CGB.

::: ./src_org/dom/camera/GonkNativeWindow.h
@@ +82,5 @@
>      // Release all internal buffers
>      void abandon();
>  
>      SurfaceDescriptor *getSurfaceDescriptorFromBuffer(ANativeWindowBuffer* buffer);
> +    nsIThread* getGonkNativeWindowThread();

kanru: shouldn't this be
already_AddRefed<nsIThread> getGonkNativeWindowThread(); ?
Comment 30 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-10 08:29:15 PST
(In reply to Mike Habicher [:mikeh] from comment #29)
> > +    nsIThread* getGonkNativeWindowThread();
> 
> kanru: shouldn't this be
> already_AddRefed<nsIThread> getGonkNativeWindowThread(); ?

We are not giving up the ownership so it's nsCOMPtr<nsIThread>
Comment 31 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 08:30:42 PST
(In reply to Mike Habicher [:mikeh] from comment #29)
> Comment on attachment 700434 [details] [diff] [review]
> Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's
> thread
> 
> Review of attachment 700434 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> > +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()
> 
> already_AddRefed<...> ... ?

Actually, never mind--we're both wrong.  You want the returned object to get a new reference when it's stuffed into an nsCOMPtr<>, so in that case, a raw nsIThread* is correct.

See: https://developer.mozilla.org/en-US/docs/Using_nsCOMPtr/Getting_Started_Guide#Using_nsCOMPtr

(Specfically, look under "nsCOMPtr<T> f() don't return an nsCOMPtr".)
Comment 32 Sotaro Ikeda [:sotaro] 2013-01-10 08:49:18 PST
(In reply to Kan-Ru Chen [:kanru] from comment #28)
> Comment on attachment 700434 [details] [diff] [review]
> Part 2 rev 2 - move GonkNativeWindow::returnBuffer() out of ImageBridge's
> thread
> 
> Review of attachment 700434 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +608,5 @@
> > +}
> > +
> > +nsIThread* GonkNativeWindow::getGonkNativeWindowThread()
> > +{
> > +    return mThread;
> 
> I think we could create the thread here lazily.

I am going to keep startGonkNativeWindowThread(), because Sometimes, creating a thread needs a longer time. It is not good for realtime performance.
Comment 33 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 08:53:51 PST
There's actually no need for getGonkNativeWindowThread() -- you're already in the class, just use mThread directly!
Comment 34 Sotaro Ikeda [:sotaro] 2013-01-10 09:15:17 PST
(In reply to Mike Habicher [:mikeh] from comment #33)
> There's actually no need for getGonkNativeWindowThread() -- you're already
> in the class, just use mThread directly!

As I talked to you, getGonkNativeWindowThread() is necessary to pass a thread pointer from GonkNativeWindow to CameraGraphicBuffer.
Comment 35 Sotaro Ikeda [:sotaro] 2013-01-10 09:20:36 PST
Created attachment 700489 [details] [diff] [review]
Part 2 rev 3 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

reflect comment
 - assert when failed to create a thread
 - remove redundant "protected:" in GonkNativeWindow.h
Comment 36 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 09:31:51 PST
Comment on attachment 700489 [details] [diff] [review]
Part 2 rev 3 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread

Review of attachment 700489 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!  r+ with that one nit fixed.

::: ./src_org/dom/camera/GonkNativeWindow.h
@@ +279,5 @@
> +    {
> +        mCameraGraphicBuffer->UnlockNow();
> +        return NS_OK;
> +    }
> +protected:

Nit: please insert a blank line before the 'protected' clause.
Comment 37 Sotaro Ikeda [:sotaro] 2013-01-10 09:56:00 PST
Created attachment 700514 [details] [diff] [review]
Part 1 rev 2 - change ImageContainerChild::SendFlush() to sync
Comment 38 Sotaro Ikeda [:sotaro] 2013-01-10 09:57:04 PST
Created attachment 700516 [details] [diff] [review]
Part 2 rev 4 - move GonkNativeWindow::returnBuffer() out of ImageBridge's thread
Comment 39 Sotaro Ikeda [:sotaro] 2013-01-10 11:23:19 PST
mikeh, I start to re-think about Part2. attachment 700516 [details] [diff] [review] [diff] [review] affect very badly to H.264 video rendering. OMXCodec becomes to abort more often than before. I am going to think about to separate thread for ImageBridgeChild and ImageContainerChild.
 
When OMXCodec::freeBuffersOnPort() is called, all gralloc buffers needed to be within OMXCodec or GonkNativeWindow. attachment 700516 [details] [diff] [review] [diff] [review] is going to keep gralloc buffer longer time and is going to fail a buffer status check in OMXCodec::freeBuffersOnPort().
Comment 40 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-10 16:05:55 PST
sotaro, could the OMXCodec instabilities be due to the sync flush change?
Comment 41 Sotaro Ikeda [:sotaro] 2013-01-10 20:46:04 PST
(In reply to Mike Habicher [:mikeh] from comment #40)
> sotaro, could the OMXCodec instabilities be due to the sync flush change?

Yes, I confirmed that.
Comment 42 Sotaro Ikeda [:sotaro] 2013-01-11 00:31:52 PST
(In reply to Mike Habicher [:mikeh] from comment #40)
> sotaro, could the OMXCodec instabilities be due to the sync flush change?

I checked about the problem. It hits another defect just because of function calls timing change.

- In ImageContainer::SetCurrentImage(), it the argument(aImage) is null pointer, ImageContainerChild::DispatchSetIdle() is called.
- ImageContainerChild::DispatchSetIdle() dispatched ImageContainerChild::SetIdleNow() to ImageBridge thread. It is asynchronous.
- In ImageContainerChild::SetIdleNow(), ImageContainerChild::SendFlush().
  SendFlush() is asynchronous compared to SetCurrentImage(). 

When the problem happens, SendFlush() is called after OMXCodec::stop().
After the SendFlush(), a buffer is returned to OMXCodec. But OMXCodec is already in shutdown state.
Comment 43 Sotaro Ikeda [:sotaro] 2013-01-11 00:33:37 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #42)
> (In reply to Mike Habicher [:mikeh] from comment #40)
> > sotaro, could the OMXCodec instabilities be due to the sync flush change?
> 
> I checked about the problem. It hits another defect just because of function
> calls timing change.

To solve the problem, it is necessary to change ImageContainerChild::DispatchSetIdle() to synchronous,
Comment 44 Sotaro Ikeda [:sotaro] 2013-01-11 01:18:03 PST
(In reply to Mike Habicher [:mikeh] from comment #40)
> sotaro, could the OMXCodec instabilities be due to the sync flush change?

It seems that this problems is Bug 823324.
Comment 45 Sotaro Ikeda [:sotaro] 2013-01-11 02:30:47 PST
Created attachment 700950 [details] [diff] [review]
Part2 - change DispatchSetIdle() to synchronous SetIdle()

fix OMXCodec instabilities be due to the sync flush change
Comment 46 Sotaro Ikeda [:sotaro] 2013-01-11 03:17:01 PST
After apply in attachment 700950 [details] [diff] [review]. I could not re-produce the abort and deadlock.
There is still a risk of deadlock but it becomes rarely happens, I assume because shutdown happens sequentially as expected.
Comment 47 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-11 03:30:44 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #46)
>
> There is still a risk of deadlock but it becomes rarely happens, I assume
> because shutdown happens sequentially as expected.

What's involved in eliminating this risk?
Comment 48 Sotaro Ikeda [:sotaro] 2013-01-11 03:45:03 PST
Before applying the patch, ImageContainerChild::SetIdleNow() and GonkNativeWindow::abandon() could happen at the same time. Because SetIdleNow() get called asynchronously. And deadlock happened when both functions called at the same time.

After applying the patch, SetIdleNow() called synchronously. Then both functions do not called at the same time.
Comment 49 Sotaro Ikeda [:sotaro] 2013-01-11 04:09:20 PST
Created attachment 700993 [details] [diff] [review]
Part2 rev2 - change DispatchSetIdle() to synchronous SetIdle()

landable patch
carry kchen: review+ from previous patch
Comment 50 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-11 06:06:59 PST
Comment on attachment 700993 [details] [diff] [review]
Part2 rev2 - change DispatchSetIdle() to synchronous SetIdle()

Review of attachment 700993 [details] [diff] [review]:
-----------------------------------------------------------------

Note that this patch doesn't apply cleanly on m-c.

::: gfx/layers/ipc/ImageContainerChild.h
@@ +104,5 @@
>     */
>    void DispatchDestroy();
>  
>    /**
> +   * flush and deallocate the shared images in the pool.

Fix the comment.
Comment 51 Kan-Ru Chen [:kanru] (UTC+8) 2013-01-11 06:08:00 PST
http://tbpl.mozilla.org/?tree=Try&rev=56d80d160f26
Comment 52 Sotaro Ikeda [:sotaro] 2013-01-13 02:51:48 PST
kanru, are there something I need to do more?
Comment 53 Sotaro Ikeda [:sotaro] 2013-01-14 08:41:35 PST
All green in try.
https://tbpl.mozilla.org/?tree=Try&rev=56d80d160f26
Comment 55 Sotaro Ikeda [:sotaro] 2013-01-15 06:58:45 PST
nominate to "blocking-b2g: tef"
Camera crashes lots without this fix.
Comment 56 Mike Habicher [:mikeh] (high bugzilla latency) 2013-01-15 07:00:03 PST
Will this fix the issue in bug 827833 as well?
Comment 57 Sotaro Ikeda [:sotaro] 2013-01-15 07:08:56 PST
(In reply to Mike Habicher [:mikeh] from comment #56)
> Will this fix the issue in bug 827833 as well?

Yes, I think.
Comment 59 Andrew Overholt [:overholt] (back Aug 31) 2013-01-15 15:13:22 PST
Can we uplift this to b2g18?
Comment 60 Sotaro Ikeda [:sotaro] 2013-01-15 17:53:59 PST
Yes, it is necessary to uplift patches to b2g18.
But I do not have permission to do it. Does someone could uplift them?
Comment 61 Sotaro Ikeda [:sotaro] 2013-01-15 17:55:31 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #60)
> Yes, it is necessary to uplift patches to b2g18.
> But I do not have permission to do it. Does someone could uplift them?

I've heard that if the bug has "blocking-b2g:tef+", it is automatically uplifted.
Comment 63 Ryan VanderMeulen [:RyanVM] 2013-01-15 18:10:36 PST
(In reply to Sotaro Ikeda [:sotaro] from comment #61)
> I've heard that if the bug has "blocking-b2g:tef+", it is automatically
> uplifted.

Yes, the checkin fairies take care of it ;)
Comment 64 Sotaro Ikeda [:sotaro] 2013-01-21 07:46:05 PST
*** Bug 828289 has been marked as a duplicate of this bug. ***
Comment 65 Lukas Blakk [:lsblakk] use ?needinfo 2013-01-30 15:08:34 PST
Landed on mozilla-b2g18/gaia master prior to the 1/25 branching to mozilla-b2g18_v1_0_0/v1.0.0, updating status-b2g-v1.0.0 to fixed.
Comment 66 Tracy Walker [:tracy] 2013-03-27 12:50:06 PDT
verified with 1.1 build of 20130322070202

Note You need to log in before you can comment on or make changes to this bug.