Closed Bug 1101826 Opened 10 years ago Closed 6 years ago

OMX_CommandFlush does not return all input buffers before command complete.

Categories

(Firefox OS Graveyard :: Vendcom, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sotaro, Unassigned)

References

Details

(Whiteboard: [POVB])

Crash Data

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #1099356 +++

On flame device, sometimes OMX_CommandFlush does not return all input buffers before command complete. This causes an application's crash with the following log.

> CHECK_EQ(countBuffersWeOwn(mPortBuffers[portIndex]),mPortBuffers[portIndex].size()) failed: 3 vs. 4
When OMX_CommandFlush is called, omx_vdec::execute_input_flush() handle input port's buffer flush. But it does not return all buffer's to OMX component's client. It handles only buffers that is in m_etb_q. It does not handle buffers that are already handled by omx_vdec::empty_this_buffer_proxy(), but does not yet handled by omx_vdec::empty_buffer_done().

https://www.codeaurora.org/cgit/quic/la/platform/hardware/qcom/media/tree/mm-video-v4l2/vidc/vdec/src/omx_vdec_msm8974.cpp?h=b2g_kk_3.5#n2460
Get logcat with attachment 8525604 [details] [diff] [review] and enabling OMXCodec's log.
Test environment also applied a patch of Bug 1101742 to suppress unrelated crash.
Attachment #8525605 - Attachment is patch: false
(In reply to Sotaro Ikeda [:sotaro] from comment #3)
> Created attachment 8525605 [details]

From the log, it is clear that omx_vdec::empty_buffer_done() is called after input port's OMX_CommandFlush completion.
This bug is POVB problem.
Whiteboard: [POVB]
[Blocking Requested - why for this release]: This bug affect to all kitkat's  msm8610 hardware. It also include b2g v2.1. When it happens, application process will crash.
blocking-b2g: --- → 2.0?
caf's OMX IL's OMX_CommandFlush seems to break not only OMXCodec implementation, but also OpenMAX IL's specification. Therefore, caf's implementation need to be addressed, it it is correct.
------------------------------------------------------------
3.2.2.4 OMX_CommandFlush
This IL client calls this command to flush one or more component ports. nParam specifies the index of the port to flush. If the value of nParam is OMX_ALL, the component shall flush all ports.
When the IL client flushes a non-tunnelling port, that port shall return all buffers it is holding to the IL client using EmptyBufferDone and FillBufferDone (appropriate for an input port or an output port, respectively) to return the buffers. When tunnelling, the flushed input port uses EmptyThisBuffer or FillThisBuffer (appropriate for an input port or an output port, respectively) to return the buffers to the output port.
For each port that the component successfully flushes, the component shall send an OMX_EventCmdComplete event, indicating OMX_CommandFlush for nData1 and the individual port index for nData2, even if the flush resulted from using a value of OMX_ALL for nParam. If a flush fails, the component shall notify the IL client of the error via an OMX_EventError event.
Diego, how do you think about this bug? Can caf fix this bug?
Flags: needinfo?(dwilson)
blocking-b2g: 2.0? → 2.0+
How do I reproduce this issue on device?
Flags: needinfo?(dwilson) → needinfo?(sotaro.ikeda.g)
(In reply to Diego Wilson [:diego] from comment #9)
> How do I reproduce this issue on device?

I've reproduced with the following:

1. Open camera app. Record a large number of short videos (5-30 seconds). Depending on how lucky you get, this can be anywhere from 10-30 videos. Close camera app.
2. Open video app. While it is generating the thumbnails, it will eventually encounter this issue.
3. If video app did not crash, reflash or clear cache/userdata (should keep the videos, but requires regenerating the thumbnails), and go back to step 1 or 2.

Bug 1100459 contains a log which shows the issue (with verbose logging for OMXCodec in libstagefright).
(In reply to Diego Wilson [:diego] from comment #9)
> How do I reproduce this issue on device?

I reproduced the crash by Bug 1099356 Comment 12. In my case, without doing pause. 

n.b. with current gecko, a crash because of Bug 1101742 could also happen. It is checking-in to m-c.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
> (In reply to Diego Wilson [:diego] from comment #9)
> > How do I reproduce this issue on device?
> 
> I reproduced the crash by Bug 1099356 Comment 12. In my case, without doing
> pause. 
> 
> n.b. with current gecko, a crash because of Bug 1101742 could also happen.
> It is checking-in to m-c.

It could take long time(more than 20 minutes) until the crash happens by this STR when log out is enabled.
diego, is there a progress?
Flags: needinfo?(dwilson)
Component: Video/Audio → Vendcom
Product: Core → Firefox OS
Hi! Diego,

Any update from your side? Thanks

--
Keven
Hmm, no update nor response from codeaurora... Then, I am going to investigate possible fix.
Assignee: nobody → sotaro.ikeda.g
I understand how the problem happens.
- OMXCodec flush input and output port individually. 2 flushes in total.
- But driver/hardware does not support flushing of individual ports.
- For each flush, driver/hardware flush both input and output port.
- V4L2_EVENT_MSM_VIDC_FLUSH_DONE is "flush done" message from driver.
  If it is received by omx hal, it tried OMX_CommandFlush command complete both input and output port.
- If input port flush and output port flush is sent very quickly.
   Both OMX_CommandFlush commands completes are sent under one V4L2_EVENT_MSM_VIDC_FLUSH_DONE message.
   Next V4L2_EVENT_MSM_VIDC_FLUSH_DONE message is just redundant. It do nothing.
- Before the above redundant V4L2_EVENT_MSM_VIDC_FLUSH_DONE message is received from driver,
  if consecutive flush is sent from OMXCodec, the redundant message is incorrectly recognized
  as the consecutive flush's complete. It causes early OMX_CommandFlush commands completes.
  The OMX_CommandFlush commands completes is sent before all input buffers return.
  Then OMXCodec's buffers state becomes inconsistent.
As in Comment 20, "previous flush's complete message" causes next flush command complete too early. It causes input buffers or output buffers not returned to OMXCodec when flush is complete at OMXCodec. OMXCodec detect buffer's status is inconsistent. Then OMXCodec abort because of the error status.
OMXCodec::read()
->OMXCodec::flushPortAsync() portIndex 0 // input port flush
 ->omx_vdec::execute_omx_flush() // flush both input and output ports
  ->ioctl(drv_ctx.video_driver_fd, VIDIOC_DECODER_CMD, &dec)
-OMXCodec::flushPortAsync()_E portIndex 1 // output port flush
 ->omx_vdec::execute_omx_flush() // flush both input and output ports
  ->ioctl(drv_ctx.video_driver_fd, VIDIOC_DECODER_CMD, &dec)
// on async_message_thread
async_message_thread()
->// handle V4L2_EVENT_MSM_VIDC_FLUSH_DONE message
 ->// Create and handle VDEC_MSG_RESP_FLUSH_INPUT_DONE message
  ->// post OMX_COMPONENT_GENERATE_EVENT_INPUT_FLUSH to message thread
 ->// Create and handle VDEC_MSG_RESP_FLUSH_OUTPUT_DONE message
  ->// post OMX_COMPONENT_GENERATE_EVENT_OUTPUT_FLUSH to message thread
// on message_thread
->omx_vdec::process_event_cb()
 ->case OMX_COMPONENT_GENERATE_EVENT_INPUT_FLUSH:
  ->pThis->execute_input_flush();
  ->pThis->m_cb.EventHandler() // input port flush complete
// on message_thread
->omx_vdec::process_event_cb()
 ->case OMX_COMPONENT_GENERATE_EVENT_OUTPUT_FLUSH:
  ->pThis->execute_output_flush()// out port flush complete
->OMXCodec::emptyBuffer() // send new input buffer to omx
           ************** input buffer that will cause state inconsistency
 ->omx_vdec::empty_this_buffer()
  ->omx_vdec::empty_this_buffer_proxy()

OMXCodec::read() // next read.
// on async_message_thread
async_message_thread()
->// handle V4L2_EVENT_MSM_VIDC_FLUSH_DONE message
      ************************ This is previous flush's done message
 ->// Create and handle VDEC_MSG_RESP_FLUSH_INPUT_DONE message
  ->// post OMX_COMPONENT_GENERATE_EVENT_INPUT_FLUSH to message thread
 ->// Create and handle VDEC_MSG_RESP_FLUSH_OUTPUT_DONE message
  ->// post OMX_COMPONENT_GENERATE_EVENT_OUTPUT_FLUSH to message thread
->OMXCodec::flushPortAsync() portIndex 0 // input port flush
 ->omx_vdec::execute_omx_flush() // flush both input and output ports
  ->ioctl(drv_ctx.video_driver_fd, VIDIOC_DECODER_CMD, &dec)
-OMXCodec::flushPortAsync()_E portIndex 1 // output port flush
 ->omx_vdec::execute_omx_flush() // flush both input and output ports
  ->ioctl(drv_ctx.video_driver_fd, VIDIOC_DECODER_CMD, &dec)
// on message_thread
->omx_vdec::process_event_cb()
 ->case OMX_COMPONENT_GENERATE_EVENT_INPUT_FLUSH:
  ->pThis->execute_input_flush();
  ->pThis->m_cb.EventHandler() // input port flush complete
       ***************************** early flush complete
// on message_thread
->omx_vdec::process_event_cb()
 ->case OMX_COMPONENT_GENERATE_EVENT_OUTPUT_FLUSH:
  ->pThis->execute_output_flush()// out port flush complete
       ***************************** early flush complete
//on async_message_thread
->// reveice VDEC_MSG_RESP_INPUT_BUFFER_DONE *************** this input buffer return is too late
->// handle V4L2_EVENT_MSM_VIDC_FLUSH_DONE message
->// handle V4L2_EVENT_MSM_VIDC_FLUSH_DONE message
omx_vdec's implementation does not handle additional port flush during previous port flush.
By applying attachment 8557434 [details] [diff] [review] to master flame, the problem was fixed.
Flags: needinfo?(dwilson)
Diego, can you give a comment to attachment 8557434 [details] [diff] [review]?
Flags: needinfo?(dwilson)
Attachment #8557434 - Flags: feedback?(dwilson)
m1, can we have a response from diego?
Flags: needinfo?(mvines)
This doesn't seem relevant to v2.2 efforts at the moment
Flags: needinfo?(mvines)
Actually in 2.2, the dependent bug 1099356 reproduces with about 50% frequency (in the video app), since the seek operation change implemented in 2.2.  I think we include this bug within the scope of 2.2 if it fixes bug 1099356, but will wait for feedback from Bhavana in that bug.
I searched the crash database here and we have observed this crash occur on v2.0, v2.1 and v2.2.  However only 8x10 and 8x26.  If/when this is observed on active chipsets for v2.2 then this becomes very interesting for us to look at.
triage:
This is edge case and we will revisit 2.2 later per comment 30.
blocking-b2g: 2.0+ → ---
(In reply to Michael Vines [:m1] [:evilmachines] from comment #30)
> I searched the crash database here and we have observed this crash occur on
> v2.0, v2.1 and v2.2.  However only 8x10 and 8x26.  If/when this is observed
> on active chipsets for v2.2 then this becomes very interesting for us to
> look at.

mvines/Inder, this seems to be frequently reproducible in 2.1 due to the new seek mechanism (https://bugzilla.mozilla.org/show_bug.cgi?id=1138671#c5). As you are aware we do not have the chipsets for 2.2 and 8x10 and 8x26 are are reference. Could you get QA on your end as the STR in https://bugzilla.mozilla.org/show_bug.cgi?id=1099356#c37 says we encounter this 50% of the time.
Flags: needinfo?(mvines)
Flags: needinfo?(ikumar)
(In reply to bhavana bajaj [:bajaj] from comment #32)
> (In reply to Michael Vines [:m1] [:evilmachines] from comment #30)
> > I searched the crash database here and we have observed this crash occur on
> > v2.0, v2.1 and v2.2.  However only 8x10 and 8x26.  If/when this is observed
> > on active chipsets for v2.2 then this becomes very interesting for us to
> > look at.
> 
> mvines/Inder, this seems to be frequently reproducible in 2.1 due to the new
sry, meant 2.2 here!
> seek mechanism (https://bugzilla.mozilla.org/show_bug.cgi?id=1138671#c5). As
> you are aware we do not have the chipsets for 2.2 and 8x10 and 8x26 are are
> reference. Could you get QA on your end as the STR in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1099356#c37 says we encounter
> this 50% of the time.
Flags: needinfo?(mvines)
Flags: needinfo?(mvines)
Flags: needinfo?(dwilson)
Attachment #8557434 - Flags: feedback?(dwilson)
There is nothing I can do more.
Assignee: sotaro.ikeda.g → nobody
(In reply to bhavana bajaj [:bajaj] from comment #33)
> > Could you get QA on your end as the STR in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1099356#c37 says we encounter
> > this 50% of the time.

I'm sad to report that I've been unable to reproduce here on v2.2 devices after many attempts.
Flags: needinfo?(mvines)
Flags: needinfo?(ikumar)
Crash Signature: [@ mozalloc_abort(char const*) | __android_log_assert | android::OMXCodec::onCmdComplete(OMX_COMMANDTYPE, unsigned long) ]
Crash Signature: [@ mozalloc_abort(char const*) | __android_log_assert | android::OMXCodec::onCmdComplete(OMX_COMMANDTYPE, unsigned long) ] → [@ mozalloc_abort(char const*) | __android_log_assert | android::OMXCodec::onCmdComplete(OMX_COMMANDTYPE, unsigned long) ] [@ mozalloc_abort | __android_log_assert | android::OMXCodec::onCmdComplete ]
See Also: → 1231257
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: