Overly aggressive locking in MediaEngineRemoteVideoSource::DeliverFrame

VERIFIED FIXED in Firefox 59

Status

()

defect
P1
normal
Rank:
5
VERIFIED FIXED
2 years ago
Last year

People

(Reporter: pehrsons, Assigned: pehrsons)

Tracking

({regression})

58 Branch
mozilla59
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox58 unaffected, firefox59+ verified, firefox60+ verified)

Details

Attachments

(3 attachments)

This was thought at first to be the reason behind the stutter in bug 1434353 due to numbers we saw in an Instruments profile on Mac by jib. It included waiting threads and showed that 2.9s out of 4s spent on the audio callback thread was spent waiting for the monitor in MediaEngineRemoteVideoSource::DeliverFrame.

The monitor is held while we copy images. That could cover allocations and deallocations of large buffers. Some of them could be allocations synchronously over IPC to the parent process (shmems).
This can lead to audio distortions when using a microphone.

Getting fixed in 60 soon by bug 1299515.
Regression by bug 1388219.
Blocks: 1388219
Keywords: regression
Increasing this to rank 5 on the assumption that this is indeed the root cause for bug 1434353.
Rank: 9 → 5
Are the issues here (and elsewhere) bad enough to warrant backing out bug 1388219's patches? Or are these things that are either fixable and upliftable for 59 beta or not bad enough to block the feature?
Flags: needinfo?(apehrson)
No need to back out, the attached patch is low risk enough to uplift to 59.

This was fixed on 60 by bug 1299515.
Flags: needinfo?(apehrson)
See Also: 1434353
Duplicate of this bug: 1434353
Comment on attachment 8947020 [details]
Bug 1434538 - Don't lock while rescaling or copying video frame buffers.

https://reviewboard.mozilla.org/r/216840/#review223934

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:489
(Diff revision 1)
> +    MonitorAutoUnlock open(mMonitor);
> +    {

Did you mean to lock mMonitor inside this scope block? The scope block seems redundant otherwise.
Attachment #8947020 - Flags: review?(jib) → review-
Sorry I meant unlock.
Comment on attachment 8947020 [details]
Bug 1434538 - Don't lock while rescaling or copying video frame buffers.

https://reviewboard.mozilla.org/r/216840/#review223934

> Did you mean to lock mMonitor inside this scope block? The scope block seems redundant otherwise.

Yes! Thanks for catching that.

Note s/lock/unlock/ in your comment.
Comment on attachment 8947020 [details]
Bug 1434538 - Don't lock while rescaling or copying video frame buffers.

https://reviewboard.mozilla.org/r/216840/#review223946

Lgtm.
Attachment #8947020 - Flags: review?(jib) → review+
Comment on attachment 8947020 [details]
Bug 1434538 - Don't lock while rescaling or copying video frame buffers.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1388219
[User impact if declined]: Possibly audio artifacts when using microphone and video (either of camera, screensharing, etc.). Whether you hear artifacts depends a lot on the performance of your machine.
[Is this code covered by automated tests?]: The code that's critical to cross-thread access is covered by all our automated tests that involves video on linux. This code is *not* platform specific.
[Has the fix been verified in Nightly?]: No, Nightly already had this issue fixed by bug 1299515.
[Needs manual test from QE? If yes, steps to reproduce]: Manual verification would be good, but reproducing is tricky. :achronop has it figured out in bug 1431056 comment 22.
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: I'd say fairly low risk
[Why is the change risky/not risky?]: The patch touches a fair bit of lines and we haven't been able to land the same patch on Nightly, but it's very mechanical in nature. We change a big chunk of code to operate without a lock since it's slow, and all member usage during and after the unlock has been changed to operate on copies instead. These changes are pretty straight-forward.
[String changes made/needed]: None
Attachment #8947020 - Flags: approval-mozilla-beta?
Comment on attachment 8947020 [details]
Bug 1434538 - Don't lock while rescaling or copying video frame buffers.

Approved for 59b8. Let's try to get QE verification too.
Attachment #8947020 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Backed out for permafailing mda's dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStreams.html, at least in debug-ish builds on Linux:

https://hg.mozilla.org/releases/mozilla-beta/rev/73ef186ad51ac2c4fc27b1f149fcc94355fe7dc6

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=73ef186ad51ac2c4fc27b1f149fcc94355fe7dc6&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=161141707&repo=mozilla-beta

[task 2018-02-08T19:21:11.044Z] 19:21:11     INFO - TEST-FAIL | dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStreams.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2018-02-08T19:21:11.046Z] 19:21:11     INFO - Checking for stats in {"outbound_rtcp_video_0":{"id":"outbound_rtcp_video_0","timestamp":5178814,"type":"inboundrtp","isRemote":true,"mediaType":"video","remoteId":"outbound_rtp_video_0","ssrc":"2364249609","bytesReceived":0,"jitter":0,"packetsLost":0,"packetsReceived":0,"roundTripTime":3},"outbound_rtp_video_0":{"id":"outbound_rtp_video_0","timestamp":1518117669782.12,"type":"outboundrtp","bitrateMean":0,"bitrateStdDev":0,"firCount":0,"framerateMean":0,"framerateStdDev":0,"isRemote":false,"mediaType":"video","nackCount":0,"pliCount":0,"remoteId":"outbound_rtcp_video_0","ssrc":"2364249609","bytesSent":0,"droppedFrames":0,"framesEncoded":0,"packetsSent":0},"GvhD":{"id":"GvhD","timestamp":1518117669782.12,"type":"candidatepair","bytesReceived":521793,"bytesSent":521855,"lastPacketReceivedTimestamp":1518117669771,"lastPacketSentTimestamp":1518117669766,"localCandidateId":"aLzM","nominated":true,"priority":9115005270282338000,"readable":true,"remoteCandidateId":"PzwX","selected":true,"state":"succeeded","transportId":"0-1518117636185424 (id=2147484098 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStr aLevel=0","writable":true},"aLzM":{"id":"aLzM","timestamp":1518117669782.12,"type":"localcandidate","candidateType":"host","componentId":"0-1518117636185424 (id=2147484098 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStr aLevel=0","ipAddress":"172.17.0.4","mozLocalTransport":"udp","portNumber":41753,"transport":"udp"},"aBkm":{"id":"aBkm","timestamp":1518117669782.12,"type":"localcandidate","candidateType":"host","componentId":"0-1518117636185424 (id=2147484098 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStr aLevel=0","ipAddress":"172.17.0.4","mozLocalTransport":"tcp","portNumber":62794,"transport":"tcp"},"PzwX":{"id":"PzwX","timestamp":1518117669782.12,"type":"remotecandidate","candidateType":"host","componentId":"0-1518117636185424 (id=2147484098 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStr aLevel=0","ipAddress":"172.17.0.4","portNumber":38566,"transport":"udp"}} for video track {97a99531-d5ce-4c72-bbdf-2dd182c2c92b}, retry number 60
[task 2018-02-08T19:21:11.047Z] 19:21:11     INFO - Should have RTP stats for track {97a99531-d5ce-4c72-bbdf-2dd182c2c92b}
[task 2018-02-08T19:21:11.049Z] 19:21:11     INFO - RTP stats: {"id":"outbound_rtp_video_0","timestamp":1518117669782.12,"type":"outbound-rtp","bitrateMean":0,"bitrateStdDev":0,"firCount":0,"framerateMean":0,"framerateStdDev":0,"isRemote":false,"mediaType":"video","nackCount":0,"pliCount":0,"remoteId":"outbound_rtcp_video_0","ssrc":"2364249609","bytesSent":0,"droppedFrames":0,"framesEncoded":0,"packetsSent":0}
[task 2018-02-08T19:21:11.050Z] 19:21:11     INFO - Track {97a99531-d5ce-4c72-bbdf-2dd182c2c92b} has 0 outbound-rtp RTP packets.
[task 2018-02-08T19:21:11.051Z] 19:21:11     INFO - TEST-FAIL | dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStreams.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2018-02-08T19:21:11.052Z] 19:21:11     INFO - Buffered messages finished
[task 2018-02-08T19:21:11.053Z] 19:21:11     INFO - TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_twoAudioVideoStreams.html | Error in test execution: Error: Timeout checking for stats for track {a2d5eafa-d9f0-48cf-b22e-7b8857f41a3b} after at least30000ms waitForRtpFlow@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1548:11 ...  
[task 2018-02-08T19:21:11.054Z] 19:21:11     INFO -     execute/<@dom/media/tests/mochitest/head.js:831:14
Status: RESOLVED → REOPENED
Flags: needinfo?(apehrson)
Resolution: FIXED → ---
Looks like it was Linux-only.
Linux is the only platform where we have real virtual devices. On other platforms we have internal ones that shortcut the code in question. At least if it was permafailing I can hopefully reproduce.
I can repro very deterministically. That's something.

After starting the 2nd capture (1 MediaEngineRemoteVideoSource, 2 allocations) mHandleIds is [0, 0].

Obviously those two ids should be distinct and the fact that they aren't is causing this problem.


We have some funky assumptions in MediaEngineRemoteVideoSource where Start() is supposed to always be the next call after Allocate() for the same AllocationHandle. Those assumptions are broken here. I'm impressed this worked at all. I guess we just didn't have tests covering all the corners here, because just like handle ids can end up being dupes, target capabilities can too. So a request for 640x480 and another request for 1280x720 can both end up being 640x480, or 1280x720 depending on their order.

In this case we have the following order of events:
- Allocate() allocates a handle with id 0 [1]
  - UpdateSingleSource sets mHandleId to 0 (the cache thing based on false assumptions) [2]
- Allocate() allocates a handle with id 1 [1]
  - NO UpdateSingleSource happens since the constraints for the device didn't change. mHandleId is not changed.
    (not only is the assumption broken here, we're also not setting mHandleId! Even if the assumption held we'd still be in a broken state)
- Start() happens, but this doesn't pass the allocation handle or id. Appends mHandleId (0) to mHandleIds. [3]
- Start() happens, but this doesn't pass the allocation handle or id. Appends mHandleId (0) to mHandleIds. [3]

- The first DeliverFrame() will update mImages[0] _twice_ (because that's the first index where the allocation handle ids matched, see the patch).



[1] https://dxr.mozilla.org/mozilla-beta/rev/0bb3b9214b658c1bf5f0cac2264d34fde665bff0/dom/media/webrtc/MediaEngineRemoteVideoSource.cpp#120
[2] https://dxr.mozilla.org/mozilla-beta/rev/0bb3b9214b658c1bf5f0cac2264d34fde665bff0/dom/media/webrtc/MediaEngineRemoteVideoSource.cpp#300
[3] https://dxr.mozilla.org/mozilla-beta/rev/0bb3b9214b658c1bf5f0cac2264d34fde665bff0/dom/media/webrtc/MediaEngineRemoteVideoSource.cpp#192
Flags: needinfo?(apehrson)
For the second Allocate() there is an UpdateSingleSource() happening, but it doesn't handle mState being kAllocated, which it is because of the first Allocate() and no Start() having happened yet.

Again an assumption about Allocate() and Start() always being coupled together. Weird dependency on external code being a good citizen. I wonder when it was being that good last.
Comment on attachment 8949675 [details]
Bug 1434538 - Handle kAllocated in UpdateSingleSource.

https://reviewboard.mozilla.org/r/219012/#review224786

Lgtm. Nice find! This presumably is broken with microphone as well!

You said you can repro, can we turn that into a test?
Attachment #8949675 - Flags: review?(jib) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=30fcc16c0e6114add3ae4134d353e42d44a5aad6

(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #24)
> Comment on attachment 8949675 [details]
> Bug 1434538 - Handle kAllocated in UpdateSingleSource.
> 
> https://reviewboard.mozilla.org/r/219012/#review224786
> 
> Lgtm. Nice find! This presumably is broken with microphone as well!
> 
> You said you can repro, can we turn that into a test?

So the repro is with the same test case that got us backed out of beta. I can repro with a simpler test but we still have coverage with that test.

For audio we don't allow two captures of the same device in one process right? Then it shouldn't pose a problem as two devices would use different MediaEngineWebRTCMicrophoneSources.
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

https://reviewboard.mozilla.org/r/219014/#review224796

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:149
(Diff revision 1)
>    }
> +
> +  // Pre-allocate per-allocation members.
> +  MonitorAutoLock lock(mMonitor);
> +  MOZ_DIAGNOSTIC_ASSERT(*aOutHandle);
> +  mSources.AppendElement();

This is pushing a RefPtr containing nullptr onto mSources. Just to talk through it: I see you handle this special value in Deallocate(), and in the only other instance it's handled implicitly, which I think deserves comments.

Specifically, Shutdown() calls Stop(mSources[0], ...) which expects a raw pointer. Stop() happens to handle it because 

    mSources.IndexOf(nullptr) == mSources.NoIndex

So this seems fine. If this weren't throwaway code for 59 I'd insist Stop() and methods like it take aSource as a RefPtr<>& in case future modification attempted to dereference it (which I believe RefPtr would yell about sooner), but as it is I think this looks good.

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:152
(Diff revision 1)
> +  MonitorAutoLock lock(mMonitor);
> +  MOZ_DIAGNOSTIC_ASSERT(*aOutHandle);
> +  mSources.AppendElement();
> +  mPrincipalHandles.AppendElement(PRINCIPAL_HANDLE_NONE);
> +  mTargetCapabilities.AppendElement(mTargetCapability);
> +  mHandleIds.AppendElement((**aOutHandle).mId); // Key

Nit: or

    (*aOutHandle)->mId

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:223
(Diff revision 1)
> +  // This is an assumption that this Start() belongs to the first source that
> +  // has not yet been assigned.

If there are two threads, there's no way one could interrupt after Allocate() and complete its own Allocate() and Start()?

I agree this is an existing assumption fwiw, so probably shouldn't hold up landing. Feel free to clear, just marking it as a question I'd like to hear your thoughts on.
Attachment #8949676 - Flags: review?(jib) → review+
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

https://reviewboard.mozilla.org/r/219014/#review224830

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:292
(Diff revision 1)
>      mPrincipalHandles.RemoveElementAt(i);
>      mTargetCapabilities.RemoveElementAt(i);
>      mHandleIds.RemoveElementAt(i);
>      mImages.RemoveElementAt(i);
>  
>      aSource->EndTrack(aID);

This line is not ok now I believe. From Shutdown() it looks like we can get here with

    nullptr->EndTrack(aID);
Attachment #8949676 - Flags: review+ → review-
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

https://reviewboard.mozilla.org/r/219014/#review224796

> If there are two threads, there's no way one could interrupt after Allocate() and complete its own Allocate() and Start()?
> 
> I agree this is an existing assumption fwiw, so probably shouldn't hold up landing. Feel free to clear, just marking it as a question I'd like to hear your thoughts on.

For completion I'll answer on the threads thing.

Allocate() happens on the media manager thread.
From there there's a dispatch to main thread (all runnables from media manager to main are in order).
From main thread there's a dispatch to the media manager thread (all runnables from main to media manager are in order).
Start() happens on the media manager thread.

And the only Allocate() calls that don't lead to Start() are during shutdown.

So while two requests can compete with each other, the order of the requests to call Allocate() is always the same as the order of the requests to call Start(). Makes sense?
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

https://reviewboard.mozilla.org/r/219014/#review224830

> This line is not ok now I believe. From Shutdown() it looks like we can get here with
> 
>     nullptr->EndTrack(aID);

Yep, good catch. It looks like the only place where we use an mSources item too.
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

https://reviewboard.mozilla.org/r/219014/#review224844
Attachment #8949676 - Flags: review?(jib) → review+
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

This approval request applies to the latest two patches.

Approval Request Comment
For other fields see comment 14.
[Is the change risky?]: I'd still say fairly low risk.
[Why is the change risky/not risky?]: In addition to comment 14 these two patches add even more line changes and this time does shift some logic around.

I did however on purpose put as many diagnostic asserts as I could in to catch issues while still in beta.

The change to logic is quite simple but there are many lines touched: We add items to an array *earlier* in this object's lifecycle. The big difference is that we add nullptrs where it otherwise would have been a real object off the bat. We've vetted the use of the array items (found one use) with nullptrs and it should be sufficiently guarded now. There are asserts in place that should catch issues in a beta opt build if there are any.
Attachment #8949676 - Flags: approval-mozilla-beta?
Comment on attachment 8949676 [details]
Bug 1434538 - Allocate per-allocation members in Allocate().

OK, let's try this in beta 10. If there are problems, we still have time to fix them in beta 59, with 2.5 weeks still to go until the first merge and RC week.
Attachment #8949676 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Resolution: --- → FIXED
(In reply to Andreas Pehrson [:pehrsons] from comment #14)
> [Needs manual test from QE? If yes, steps to reproduce]: Manual verification
> would be good, but reproducing is tricky. :achronop has it figured out in
> bug 1431056 comment 22.

Hi Alex,
I've tried to find out the STR this issue but I haven't found them in the ticket mentioned by Andreas. Could you maybe have a look and verify if the issue is fixed or provide the necessary information (environment, steps and result) in order for us to test this? All the information I have found is a bit too technical and I can't make out what should be tested and how.
Flags: needinfo?(achronop)
Hi Cristian, reproducing the issue is tricky since there are other performance issues that affect the scenario. What it looks like STR is the description in comment 13 of bug 1431056 which, indeed, is too technical. However I have done the testing, the patch is good and improves the situation dramatically. You can go on and verify the issue based on my testing. You can also redirect to me any question you might get about it.
Flags: needinfo?(achronop)
Closing the ticket as verified based on the comment from Alex Chronopoulos.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.