Closed Bug 1273314 Opened 3 years ago Closed 3 years ago

Intermittent test_peerConnection_captureStream_canvas_2d.html | Test timed out.

Categories

(Core :: WebRTC, defect, P3)

49 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox49 --- wontfix
firefox-esr45 --- unaffected
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: KWierso, Assigned: pehrsons)

References

Details

(Keywords: intermittent-failure)

Attachments

(4 files, 1 obsolete file)

Rank: 35
Priority: -- → P3
We're bailing at [1], [2], [3] or [4] if we're using skia, or it may always be [5].


Currently we use skia on linux but this bug also happened when we were still on cairo. It also happens on Windows. This points to [5], but for the sake of it I went through 1-4 as well:

It shouldn't be [1] (EME) since this frame is from a captured canvas.

It shouldn't be [2] (video principal) since we haven't drawn any cross-origin resources to the canvas.

It shouldn't be [3] (GetImageContainer()) since it only fails if we're in shutdown or we're asking an audio element.

It could be [4] (lockImage.GetImage()). It's gfx so that would take an extra analysis.



Looking closer at [5] we know it happens only when there's no source surface (mDrawInfo is not used in the video path) and when res.mIsStillLoading and res.mHasSize are both false. These are all the default values that get returned for a number of reasons (similar to the above):

* EME (no)
* Video Principal (shouldn't be, see above)
* No ImageContainer (shouldn't be, see above)
* mLayersImage = lockImage.GetImage() (could be, gfx, [6])
* mLayersImage.GetAsSourceSurface() (could be, gfx, [7])


[1] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/dom/canvas/CanvasRenderingContext2D.cpp#4790
[2] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/dom/canvas/CanvasRenderingContext2D.cpp#4805
[3] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/dom/canvas/CanvasRenderingContext2D.cpp#4811
[4] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/dom/canvas/CanvasRenderingContext2D.cpp#4818
[5] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/dom/canvas/CanvasRenderingContext2D.cpp#4890
[6] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/layout/base/nsLayoutUtils.cpp#7531
[7] http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/layout/base/nsLayoutUtils.cpp#7538
lockImage.GetImage() just gets the first image in the ImageContainer, so that could still be us having cleared it for some odd reason. I'll need some logging to figure this out. Local repro didn't lead anywhere.
Assignee: nobody → pehrson
We have a SourceSurfaceImage containing a DataSourceSurface (what the canvas produces). Doing GetAsSourceSurface() on that is really trivial, so it's unlikely to be that as well.

I think this leaves us with an issue in either

1) The video principal
2) Something setting a null surface in the image we're trying to draw
And since we have a visible image in the screenshot there would have to be an image with a null surface *first* and then a real one after, I imagine. Hopefully we assert that this is not null somewhere though.
Note: big jump in frequency around 9/23

Also note that GetAsSourceSurface() has thread-safety aspects to it (which may depend on HW/OS/drivers); to fix an android problem we had to dispatch a runnable to do that.
(In reply to Randell Jesup [:jesup] from comment #18)
> Note: big jump in frequency around 9/23
> 
> Also note that GetAsSourceSurface() has thread-safety aspects to it (which
> may depend on HW/OS/drivers); to fix an android problem we had to dispatch a
> runnable to do that.

1) This is main thread
2) This is only for canvas capture frames (SourceSurfaceImage backed by DataSourceSurface, which is extremely simple).
See Also: → 1301286, 1304535
I seem to have been completely wrong. I have logs in gecko showing the drawImage call succeeding.. Hmmm.
Status: NEW → ASSIGNED
See Also: → 1279817
I got it reproduced with more logging (which I think we should land): https://public-artifacts.taskcluster.net/No4lZWQoQ_6Gao8k73tPAA/0/public/logs/live_backing.log

It's a principal thing. And once you draw a cross-origin resource to a canvas it gets tainted so reading the data out from that canvas will fail forever.

I'll go through the code to see if there could be a hole here. We might also need a strategy for finding and recovering from this path in the CaptureStreamTestHelper.
Oh right, this is for the remote side. Then we're probably hitting a known bug of having connected ICE but not yet DTLS (DTLS connecting will update the principal). :(
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #23)
> Oh right, this is for the remote side. Then we're probably hitting a known
> bug of having connected ICE but not yet DTLS (DTLS connecting will update
> the principal). :(

I doubt that, because sender and receiver both claim that one RTP got send and received. That should not happen if DTLS has not finished yet.
But two things are strange to me in the log:
1) for some unknown reason the counters appear to be stuck at that exactly one packet only. It's hard to believe that all of the RTP stats check are done so quickly (even though according to the log that all happens within the same second) that we never send a second packet.
2) something which I have observed in other mochitest failure logs as well: PC_LOCAL_VERIFY_SDP_AFTER_END_OF_TRICKLE inspects the final SDP after gathering has finished. But PC_REMOTE_VERIFY_SDP_AFTER_END_OF_TRICKLE is empty, which means the promise for the final SDP on the remote side has not been fulfilled. But I think that is probably just a bug somewhere in the test code.
2) Here we go: bug 1307300

1) Andreas how about we add for debugging some getStats() calls while we are waiting for the color to appear? I'm curious if ever anything else then the first RTP packet gets send.
Flags: needinfo?(pehrson)
Sure, I'll put something on try.
Flags: needinfo?(pehrson)
In the meantime, here's a run with a complete js log: https://archive.mozilla.org/pub/firefox/try-builds/pehrsons@gmail.com-cb5ca894b881a66d3669ace757bab5745f919c93/try-linux64/try_ubuntu64_vm_test-mochitest-media-bm130-tests1-linux64-build657.txt.gz

On treeherder for my debug patches: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cb5ca894b881&selectedJob=28464647

This shows WriteOnly is true. What I don't understand is why my logging of SetWriteOnly (the only way to make it true) doesn't show up.
There we go: SetWriteOnly is set but NS_ERROR doesn't bite because of the opt build.

I also log how video principals are set in the media element that we draw and they look fine. For the log I looked at they are set during step 38 and we fail in step 56.
See Also: → 1279095
Here is a screen shot of pcap taken from a successful local run. I find it really surprising that we see exactly two RTP packets. Which I guess could make sense if the first is just one color and the second is just the update for the color change.
The RTCP in the mochitest log shows the same receivedPackets: 1 as in the failure logs. So the receivedPackets: 1 is probably not an actual failure indicator then.
You'd need to wait until an RTCP is sent after the second packet is received, and the connection might close before that occurs (avg 2.5 seconds, up to 5+)
Ok, this is interesting.

In a working case I see (log at [1]):
> 03:20:27     INFO -  PCImpl::SetDtlsConnected mPrivacyRequested=0, aPrivacyRequested=0, mDtlsConnected=0
> 03:20:27     INFO -  PCImpl::SetDtlsConnected mPrivacyRequested=0, aPrivacyRequested=0, mDtlsConnected=0
> 03:20:27     INFO -  Media element 0x7ff8f16ca000 mSrcStreamVideoPrincipal combined to 0x7ff93c0ca200.
> 03:20:27     INFO -  null=0, codebase=0, system=1
then a ton of these (in PC_LOCAL_WAIT_FOR_MEDIA_FLOW), but at last it succeeds
> 03:20:27     INFO -  2351 INFO Track {17b0c8aa-7d03-4bea-a0f4-a9ed29844a94} has 0 outboundrtp RTP packets.
then
> 03:20:29     INFO -  2790 INFO Run step 43: PC_REMOTE_WAIT_FOR_MEDIA_FLOW
> 03:20:29     INFO -  2791 INFO Checking data flow to element: pcRemote_remote_{e66ce872-ebdf-4433-a53e-e89fce970d72}
> 03:20:29     INFO -  2792 INFO Checking RTP packet flow for track {17b0c8aa-7d03-4bea-a0f4-a9ed29844a94}
> 03:20:29     INFO -  2793 INFO Element pcRemote_remote_{e66ce872-ebdf-4433-a53e-e89fce970d72} saw 'timeupdate', currentTime=0.3831292517006803s, readyState=0
> 03:20:29     INFO -  Media element 0x7ff8f16ca000 mSrcStreamVideoPrincipal combined to 0x7ff93c0ca200.
> 03:20:29     INFO -  null=0, codebase=0, system=1
> 03:20:29     INFO -  Media element 0x7ff8f16ca000 mSrcStreamVideoPrincipal updated to 0x7ff91060b700.
> 03:20:29     INFO -  null=0, codebase=1, system=0
> 03:20:29     INFO -  2794 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_captureStream_canvas_2d.html | Should have RTP stats for track {17b0c8aa-7d03-4bea-a0f4-a9ed29844a94}
> 03:20:29     INFO -  2795 INFO Track {17b0c8aa-7d03-4bea-a0f4-a9ed29844a94} has 1 outboundrtp RTP packets.
> 03:20:29     INFO -  2796 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_captureStream_canvas_2d.html | RTP flowing for track {17b0c8aa-7d03-4bea-a0f4-a9ed29844a94}
> 03:20:29     INFO -  Media element 0x7ff8f16ca000 mSrcStreamVideoPrincipal updated to 0x7ff91060b700.
> 03:20:29     INFO -  null=0, codebase=1, system=0
That's a good principal. system should be false.
> 03:20:29     INFO -  Canvas Element 0x7ff90e050040 drawing video element 0x7ff8f16ca000.
> 03:20:29     INFO -  ##### DrawImage SUCCESS
> 03:20:29     INFO -  2901 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_captureStream_canvas_2d.html | pcRemote_remote_{e66ce872-ebdf-4433-a53e-e89fce970d72} pcRemote's remote should become green


In the failing case instead I see (log at [2]):
> 05:31:43     INFO -  PCImpl::SetDtlsConnected mPrivacyRequested=0, aPrivacyRequested=0, mDtlsConnected=0
> 05:31:43     INFO -  PCImpl::SetDtlsConnected mPrivacyRequested=0, aPrivacyRequested=0, mDtlsConnected=0
> 05:31:43     INFO -  Media element 0x7f1286c94800 mSrcStreamVideoPrincipal combined to 0x7f12d1cca200.
> 05:31:43     INFO -  null=0, codebase=0, system=1
So far the same... but now, again in PC_LOCAL_WAIT_FOR_MEDIA_FLOW
> 05:31:43     INFO -  2343 INFO Track {4e17e48b-9dfd-425b-8d7a-f84cebe0c2ca} has 0 outboundrtp RTP packets.
...
> 05:31:43     INFO -  2347 INFO Track {4e17e48b-9dfd-425b-8d7a-f84cebe0c2ca} has 1 outboundrtp RTP packets.
That worked on the second check. Big timing difference. And below this the principal doesn't get updated in the media element, so we get:
> 05:31:44     INFO -  2468 INFO Run step 56: PC_REMOTE_WAIT_FOR_REMOTE_GREEN
> 05:31:44     INFO -  2469 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_captureStream_canvas_2d.html | Should have remote video element for pcRemote
> 05:31:44     INFO -  2470 INFO Waiting for video pcRemote_remote_{2499e9d7-c62b-4aa7-9f52-1d0854ad72bd} to match [0,255,0,255] - green (pcRemote's remote should become green)
> 05:31:44     INFO -  Canvas Element 0x7f12932dc6b0 drawing video element 0x7f1286c94800.
> 05:31:44     INFO -  ##### DrawImage SUCCESS
> 05:31:44     INFO -  Canvas DrawImageSecurityCheck failed.
> 05:31:44     INFO -  Canvas NodePrincipal: 0x7f12af528700, Image principal: 0x7f12d1cca200
> 05:31:44     INFO -  Canvas NodePrincipal: null=0, codebase=1, system=0
> 05:31:44     INFO -  Image principal: null=0, codebase=0, system=1
> 05:31:44     INFO -  Assertion failure: false (HTMLCanvasElement::SetWriteOnly), at /builds/slave/try-l64-0000000000000000000000/build/src/dom/html/HTMLCanvasElement.cpp:967


[1] https://archive.mozilla.org/pub/firefox/try-builds/pehrsons@gmail.com-cd9470c45c4f3bf2ae65a65cbf82f887e98abbdb/try-linux64/try_ubuntu64_vm_test-mochitest-media-bm120-tests1-linux64-build714.txt.gz
[2] https://archive.mozilla.org/pub/firefox/try-builds/pehrsons@gmail.com-cd9470c45c4f3bf2ae65a65cbf82f887e98abbdb/try-linux64/try_ubuntu64_vm_test-mochitest-media-bm120-tests1-linux64-build717.txt.gz
See Also: → 1020516
If interpret the log and code correctly in the first case we waited 147 seconds for PC_LOCAL_WAIT_FOR_MEDIA_FLOW to succeed (147 occurrences of "{17b0c8aa-7d03-4bea-a0f4-a9ed29844a94} has 0 outboundrtp..." which each trigger a wait(1000))?
That is pretty bad I would say.

So basically the behavior on the PC_REMOTE side depends on how quickly PC_LOCAL proceeded, right?
(In reply to Nils Ohlmeier [:drno] from comment #35)
> If interpret the log and code correctly in the first case we waited 147
> seconds for PC_LOCAL_WAIT_FOR_MEDIA_FLOW to succeed (147 occurrences of
> "{17b0c8aa-7d03-4bea-a0f4-a9ed29844a94} has 0 outboundrtp..." which each
> trigger a wait(1000))?
> That is pretty bad I would say.


That can't be. The whole test only took 4 seconds.
03:20:30     INFO -  2928 INFO TEST-OK | dom/media/tests/mochitest/test_peerConnection_captureStream_canvas_2d.html | took 4051ms


> So basically the behavior on the PC_REMOTE side depends on how quickly
> PC_LOCAL proceeded, right?

Not entirely. I see in the case where we fail that the principal never gets properly updated (there's only one unique frame so no future frame to carry the right principal). No timing on the local side could fix that.
Two new observations I've made:

1) When DTLS connects we get notified on the STS thread. Then bounce it to main thread to get the document principal. Then bounce it to the MediaStreamGraph to update the PipelineListeners on the right thread. Meanwhile, the first frame comes under locks from webrtc.org. No thread switching needed. This has clear race potential, but is not something new. We should still propagate that principal change when it comes in.

2) VideoFrameContainer (which is what's holding our real principal back in this case according to logs) guards off passing new frames to the compositor (and running through all the logic around them, like how we notify HTMLMediaElement of principal changes) when any new VideoFrame is the same as the last. Note, not when the Image is the same but the whole VideoFrame. This includes things like flags for intrinsic size and whether we should force it black. The VideoFrame also holds the PrincipalHandle, but we don't check it in operator==. This could be the whole culprit. On try now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3082d8e35e8833af94c658db07e216ab21ed1932
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #38)
> This could be the whole culprit.

Indeed!
Ah, here's even proof that it works. The log at [1] shows events coming in an unexpected order (first frame comes from webrtc.org before we update the principal on MSG thread).

With a bit of filtering it looks like this (the interesting bits marked ###):
> 04:45:00     INFO -  PCImpl::SetDtlsConnected mPrivacyRequested=0, aPrivacyRequested=0, mDtlsConnected=0
> 04:45:00     INFO -  Media element 0x7f12d84ed000 mSrcStreamVideoPrincipal combined to 0x7f131b1ca200.
> 04:45:00     INFO -  null=0, codebase=0, system=1
> 04:45:00     INFO -  Media element 0x7f12d84ed000 UpdateSrcStreamVideoPrincipal considering principal (nil).
> 04:45:00     INFO -  Media element 0x7f12d84ed000 comparing to video track with principal 0x7f131b1ca200
> 04:45:00     INFO -  MediaPipelineReceive 0x7f12f88f1290 got principal 0x7f12f716dde0 on main thread
> 04:45:00     INFO -  MediaPipelineReceiveVideo 0x7f12f88f1290 RenderVideoFrame           <- ###
> 04:45:01     INFO -  MediaPipelineReceive 0x7f12f88f1290 updated principal on msg thread <- ### The below is thanks to the patch.
> 04:45:01     INFO -  Media element 0x7f12d84ed000 mSrcStreamVideoPrincipal combined to 0x7f131b1ca200.
> 04:45:01     INFO -  null=0, codebase=0, system=1
> 04:45:01     INFO -  Media element 0x7f12d84ed000 UpdateSrcStreamVideoPrincipal considering principal 0x7f12f716dde0.
> 04:45:01     INFO -  null=0, codebase=1, system=0
> 04:45:01     INFO -  Media element 0x7f12d84ed000 comparing to video track with principal 0x7f12f716dde0
> 04:45:01     INFO -  Media element 0x7f12d84ed000 mSrcStreamVideoPrincipal updated to 0x7f12f716dde0.
> 04:45:01     INFO -  null=0, codebase=1, system=0
> 04:45:01     INFO -  Media element 0x7f12d84ed000 UpdateSrcStreamVideoPrincipal considering principal 0x7f12f716dde0.
> 04:45:01     INFO -  null=0, codebase=1, system=0
> 04:45:01     INFO -  Media element 0x7f12d84ed000 comparing to video track with principal 0x7f12f716dde0
> 04:45:01     INFO -  Media element 0x7f12d84ed000 mSrcStreamVideoPrincipal updated to 0x7f12f716dde0.
> 04:45:01     INFO -  null=0, codebase=1, system=0

[1] https://archive.mozilla.org/pub/firefox/try-builds/pehrsons@gmail.com-3082d8e35e8833af94c658db07e216ab21ed1932/try-linux64/try_ubuntu64_vm_test-mochitest-media-bm114-tests1-linux64-build715.txt.gz
Comment on attachment 8798390 [details]
Bug 1273314 - Fix VideoFrameContainer logging.

https://reviewboard.mozilla.org/r/83896/#review82482
Attachment #8798390 - Flags: review?(rjesup) → review+
Comment on attachment 8798391 [details]
Bug 1273314 - Let mPrincipalHandle decide uniqueness of a VideoFrame.

https://reviewboard.mozilla.org/r/83898/#review82484
Attachment #8798391 - Flags: review?(rjesup) → review+
Comment on attachment 8798392 [details]
Bug 1273314 - Fix waitForRtpFlow retry delay.

https://reviewboard.mozilla.org/r/83900/#review82508
Attachment #8798392 - Flags: review?(drno) → review+
This is crazy. Is this patch causing android to no longer have any rtp flow on a bunch of test? https://treeherder.mozilla.org/#/jobs?repo=try&revision=14033d9d78b5

Nils, advice on what to do?

For now, I'm running this on try again without the retry delay fix. If that works I'm pushing that patch over to bug 1212460 and landing anyway.
Flags: needinfo?(drno)
Comment on attachment 8798389 [details]
Bug 1273314 - Log the caught error properly in CaptureStreamHelper.

https://reviewboard.mozilla.org/r/83894/#review82796
Attachment #8798389 - Flags: review?(jib) → review+
It was the retry delay fix that caused it. So when we waste time checking getStats every millisecond the tests all pass. Wat. I'm moving that patch to bug 1212460.
Flags: needinfo?(drno)
Attachment #8798392 - Attachment is obsolete: true
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #51)
> It was the retry delay fix that caused it. So when we waste time checking
> getStats every millisecond the tests all pass. Wat. I'm moving that patch to
> bug 1212460.

Oh, and the proof: https://treeherder.mozilla.org/#/jobs?repo=try&revision=11de71a3933a98f4d3a20180288d718e4efc5281
Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/6c23fee0a201
Log the caught error properly in CaptureStreamHelper. r=jib
https://hg.mozilla.org/integration/autoland/rev/c027260968ae
Fix VideoFrameContainer logging. r=jesup
https://hg.mozilla.org/integration/autoland/rev/2f462ac4c8e4
Let mPrincipalHandle decide uniqueness of a VideoFrame. r=jesup
https://hg.mozilla.org/mozilla-central/rev/6c23fee0a201
https://hg.mozilla.org/mozilla-central/rev/c027260968ae
https://hg.mozilla.org/mozilla-central/rev/2f462ac4c8e4
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment on attachment 8798391 [details]
Bug 1273314 - Let mPrincipalHandle decide uniqueness of a VideoFrame.

Approval Request Comment
[Feature/regressing bug #]: bug 1208371
[User impact if declined]: A video element may not be properly updated when displaying a received WebRTC stream that only contains very few (typically 1) frames. This is not common in the wild, but happens frequently in automation. Hence this uplift request is mostly to avoid intermittents in aurora and beta.
[Describe test coverage new/current, TreeHerder]: Covered in multiple mochitests.
[Risks and why]: Low risk. Tiny change in how we compare VideoFrames.
[String/UUID change made/needed]: None

The request only applies to this patch.
Attachment #8798391 - Flags: approval-mozilla-beta?
Attachment #8798391 - Flags: approval-mozilla-aurora?
Comment on attachment 8798391 [details]
Bug 1273314 - Let mPrincipalHandle decide uniqueness of a VideoFrame.

Fixes an intermittent, Aurora51+, Beta50+
Attachment #8798391 - Flags: approval-mozilla-beta?
Attachment #8798391 - Flags: approval-mozilla-beta+
Attachment #8798391 - Flags: approval-mozilla-aurora?
Attachment #8798391 - Flags: approval-mozilla-aurora+
Duplicate of this bug: 1279095
Duplicate of this bug: 1279817
oops, wrong bug (nico as well)
You need to log in before you can comment on or make changes to this bug.