Closed Bug 1448863 Opened 6 years ago Closed 6 years ago

Stop sync dispatching in mozilla::WebrtcGmpVideoDecoder::Decode

Categories

(Core :: WebRTC: Audio/Video, defect, P2)

60 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox60 --- fixed
firefox61 --- fixed

People

(Reporter: bwc, Assigned: bwc)

References

Details

Attachments

(1 file)

This is causing frequent deadlocks over in bug 1375540. Also, fixing bug 1407653 makes the deadlocks even more frequent, because frames are actually being decoded at the rate that they are supposed to.
Rank: 15
Priority: -- → P2
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

Sorry, GMP is not my turf at all. I asked jesup on IRC, but no response yet. If not him, I'm not sure who can review this.
Attachment #8962371 - Flags: review?(apehrson)
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

https://reviewboard.mozilla.org/r/231228/#review238078

r+ assuming there's no frame-copy required, and that we're certain this doesn't introduce shutdown issues.

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp:806
(Diff revision 3)
> -  // Would be really nice to avoid this sync dispatch, but it would require a
> -  // copy of the frame, since it doesn't appear to actually have a refcount.
> -  // Passing 'this' is safe since this is synchronous.
> -  mozilla::SyncRunnable::DispatchToThread(mGMPThread,
> -                WrapRunnableRet(&ret, this,
> +  if (!aInputImage._length) {
> +    return WEBRTC_VIDEO_CODEC_ERROR;
> +  }
> +
> +  mGMPThread->Dispatch(WrapRunnableNM(&WebrtcGmpVideoDecoder::Decode_g,

How was the previous item blocking making this async resolved?  Doesn't need a comment in the code, but a comment in the bug about why that's no longer a blocker (or what overhead we're adding; which might need a comment) should be done.  If we are adding copying of frames, I'd like to see analysis of the impact of this.

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp:810
(Diff revision 3)
>    MOZ_ASSERT(!NS_IsMainThread());
> -  // Would be really nice to avoid this sync dispatch, but it would require a
> -  // copy of the frame, since it doesn't appear to actually have a refcount.
> -  // Passing 'this' is safe since this is synchronous.
> -  mozilla::SyncRunnable::DispatchToThread(mGMPThread,
> -                WrapRunnableRet(&ret, this,
> +  if (!aInputImage._length) {
> +    return WEBRTC_VIDEO_CODEC_ERROR;
> +  }
> +
> +  mGMPThread->Dispatch(WrapRunnableNM(&WebrtcGmpVideoDecoder::Decode_g,

Question: does this introduce any shutdown-timing issues?

::: media/webrtc/webrtc.mozbuild:25
(Diff revision 3)
> +    if CONFIG['CC_TYPE'] in ('msvc', 'clang-cl'):
> +        DEFINES['__PRETTY_FUNCTION__'] = '__FUNCSIG__'
> +
>      if CONFIG['CC_TYPE'] in ('clang', 'gcc'):

Perhaps this should be nominated for the master build config?
Attachment #8962371 - Flags: review?(rjesup) → review+
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

https://reviewboard.mozilla.org/r/231228/#review238862

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp:806
(Diff revision 3)
> -  // Would be really nice to avoid this sync dispatch, but it would require a
> -  // copy of the frame, since it doesn't appear to actually have a refcount.
> -  // Passing 'this' is safe since this is synchronous.
> -  mozilla::SyncRunnable::DispatchToThread(mGMPThread,
> -                WrapRunnableRet(&ret, this,
> +  if (!aInputImage._length) {
> +    return WEBRTC_VIDEO_CODEC_ERROR;
> +  }
> +
> +  mGMPThread->Dispatch(WrapRunnableNM(&WebrtcGmpVideoDecoder::Decode_g,

Deadlock bugs take precedence over optimizations, in my view, especially when they are causing a lot of pain on treeherder. Optimizing out the buffer copy would best be left to a separate bug in my opinion.
I have profiled this change, and the CPU cycles spent in the new version of Decode while running the h264 test on webrtc-landing are negligible (less than 0.01% of the samples taken across the content process). This isn't the delta, this is the total. Whatever the cost of these copies is, it is not significant.
Flags: needinfo?(rjesup)
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

https://reviewboard.mozilla.org/r/231228/#review238078

> Question: does this introduce any shutdown-timing issues?

I don't see any new issues on try, and I've done lots of retriggers while shaking out other problems.

> Perhaps this should be nominated for the master build config?

Possibly. I have opened bug 1451497 for this.
(In reply to Byron Campen [:bwc] from comment #9)
> Comment on attachment 8962371 [details]
> Bug 1448863: Stop sync dispatching in Decode.
> 
> https://reviewboard.mozilla.org/r/231228/#review238078
> 
> > Question: does this introduce any shutdown-timing issues?
> 
> I don't see any new issues on try, and I've done lots of retriggers while
> shaking out other problems.

Good, though I don't believe that the tests will necessarily trigger such problems; I was looking for analysis of the lifetime of these runnables vs the GMP instance and the calling code, and any checks that they've been resolved and that if there are any races.  I realize they hold a ref, of course.

> > Perhaps this should be nominated for the master build config?
> 
> Possibly. I have opened bug 1451497 for this.

thanks

> I have profiled this change, and the CPU cycles spent in the new version of Decode while running the h264 test on webrtc-landing are negligible (less than 0.01% of the samples taken across the content process). This isn't the delta, this is the total. Whatever the cost of these copies is, it is not significant.

Good, though that uses 640x480, but even 10x that shouldn't be large.  I presume it's actually copying, not copying a surface reference?  My actual concern was on low-end machines.  (IIRC Android uses platform H264, not OpenH264, but worth double-checking.

Thanks
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #10)
> (In reply to Byron Campen [:bwc] from comment #9)
> > Comment on attachment 8962371 [details]
> > Bug 1448863: Stop sync dispatching in Decode.
> > 
> > https://reviewboard.mozilla.org/r/231228/#review238078
> > 
> > > Question: does this introduce any shutdown-timing issues?
> > 
> > I don't see any new issues on try, and I've done lots of retriggers while
> > shaking out other problems.
> 
> Good, though I don't believe that the tests will necessarily trigger such
> problems; I was looking for analysis of the lifetime of these runnables vs
> the GMP instance and the calling code, and any checks that they've been
> resolved and that if there are any races.  I realize they hold a ref, of
> course.

   From what I can tell, the way this handles shutdown is by unsetting mGMP on the GMP thread, and we check for that in Decode_g. 

> > I have profiled this change, and the CPU cycles spent in the new version of Decode while running the h264 test on webrtc-landing are negligible (less than 0.01% of the samples taken across the content process). This isn't the delta, this is the total. Whatever the cost of these copies is, it is not significant.
> 
> Good, though that uses 640x480, but even 10x that shouldn't be large.  I
> presume it's actually copying, not copying a surface reference?  My actual
> concern was on low-end machines.  (IIRC Android uses platform H264, not
> OpenH264, but worth double-checking.
> 
> Thanks

   I'll look into it.
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

https://reviewboard.mozilla.org/r/231228/#review238078

> How was the previous item blocking making this async resolved?  Doesn't need a comment in the code, but a comment in the bug about why that's no longer a blocker (or what overhead we're adding; which might need a comment) should be done.  If we are adding copying of frames, I'd like to see analysis of the impact of this.

So I've profiled the latest patch (where I've confirmed that we are actually copying the frame), and we're still talking only about 0.01% of samples in Decode. Makes sense, since these are (small) encoded frames with a pretty low frequency (in the big picture).
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/549f0b8075d5
Stop sync dispatching in Decode. r=jesup
https://hg.mozilla.org/mozilla-central/rev/549f0b8075d5
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Is this low enough risk that we could consider backporting it to Beta? Would be nice to resolve bug 1375540 there too before we're stuck with it on ESR for the next year :)
Flags: needinfo?(docfaraday)
I think this would be a good thing to uplift.
Flags: needinfo?(docfaraday)
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

Approval Request Comment
[Feature/Bug causing the regression]:

   Bug 1040345.

[User impact if declined]:

   Intermittent deadlocks in webrtc h264 calls.

[Is this code covered by automated tests?]:

   Yes.

[Has the fix been verified in Nightly?]:

   Yes.

[Needs manual test from QE? If yes, steps to reproduce]: 

   No.

[List of other uplifts needed for the feature/fix]:

   None, I think.

[Is the change risky?]:

   Not particularly.

[Why is the change risky/not risky?]:

   We're just avoiding a sync dispatch, and we have all of the refcount stuff sorted.

[String changes made/needed]:

   None.
Attachment #8962371 - Flags: approval-mozilla-beta?
Comment on attachment 8962371 [details]
Bug 1448863: Stop sync dispatching in Decode.

Low-risk fix for a bunch of oranges and deadlocks during h264 webrtc calls. Approved for 60.0b13.
Attachment #8962371 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This patch causes an issue whereby firefox will not be able to recover the video stream in a timely manner when packet loss is encountered (i.e. many minutes can pass without a recovery and the video will remain frozen in the meantime). I confirmed this by performing a repro testcase both before and after backing out this changeset -- PL handling works as expected when it's backed out, but works poorly with the changeset in place.

jesup has the details on how to repro the issue, and has already forwarded those details via email over to bwc and others. jesup suspected that perhaps it has something to do with the loss of return value from Decode_g() -- prior to the patch, it would return int32_t and after the patch is void. Indeed, it's not clear to me how the caller of Decode_g can know if a decode issue has occurred and handle it (e.g. request an IDR)... is that supposed to happen via some other means?
Depends on: 1489757
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: