Closed Bug 1351531 Opened 7 years ago Closed 7 years ago

Intermittent dom/media/tests/mochitest/test_peerConnection_simulcastAnswer.html | Error in test execution: Error: Timeout for element pcLocal_remote_{5c32567b-e920-4183-9554-7fe01c3b9670} waitForMediaElementFlow@http://mochi.test:8888/tests/dom/media/test

Categories

(Core :: WebRTC, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: mjf)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(2 files)

Seems like we recently started having issues with getting video frames on inbound tracks.
Rank: 25
Priority: -- → P2
See Also: → 1351590
Byron, could you take a look? I think this is caused by bug 1344556. The first occurrence of this (together with bug 1351590) intermittent was on autoland roughly 2h after bug 1344556 was pushed to autoland.
Blocks: 1344556
Flags: needinfo?(docfaraday)
:mreavy, can you help find someone to look into this?  It has been a week with no action on this bug, also bug 1351590 seems to be related.
Flags: needinfo?(mreavy)
Whiteboard: [stockwell needswork]
Thanks, Joel.

Nils -- As we discussed on irc, we need an owner for this.
Flags: needinfo?(mreavy)
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
Michael agreed to take a look at this.
Assignee: nobody → mfroman
Flags: needinfo?(drno)
Michael, do you have an update here, this is a very frequent issue and combined with the failures in bug 1351590, this is one of the most frequent issues in our tests in the last week.
Flags: needinfo?(mfroman)
I've been digging into it for a week.  I can reproduce the intermittent locally, but I have not yet figured out the root of the issue.  I'm sorry I don't have a fix yet, but I'm still working on it.  It is my top priority.
Flags: needinfo?(mfroman)
Attached file log.txt.gz
When this intermittent case fails, it is when the first ssrc received is unknown.  You can see this more easily using the command:
egrep 'new ssrc|unknown SSRC' log.txt | less
Flags: needinfo?(rjesup)
I'll look on Tuesday (pto til then).  Note we added code to Videoconduit recently to better adapt (n FF53 and later) to unknown/un-signaled SSRCs.  This fixed Spark for example.
So from looking at the log file in attachment 8858356 [details] it looks to me like there is some confusion going on with the SSRCs being used on the two PeerConnection in the test.
The test creates a second fake offer on PcLocal here: http://searchfox.org/mozilla-central/source/dom/media/tests/mochitest/test_peerConnection_simulcastAnswer.html#60 which generates the SSRC's 3724330823 an 2816312869.
But when it comes to generating the answer it shows exactly the same SSRC's. Not sure why. I guess the offerer theoretically should never use the two SSRC's from the fake offer.

I would try to remove that call which generates the second offer and replace it with some code which simply inserts the needed attributes into the SDP offer from hard coded values instead.
(In reply to Nils Ohlmeier [:drno] from comment #14)
> I would try to remove that call which generates the second offer and replace
> it with some code which simply inserts the needed attributes into the SDP
> offer from hard coded values instead.

Never mind. That second offer call is made on the PcRemote side. So it makes sense that the SSRC's are matching with the one from the answer later on.
My suspicion is that these lines starting in line 1050:

GECKO(15334) | (vie_receiver.cc:321): Packet received on SSRC: 2816312869 with payload type: 120, timestamp: 2460978607, sequence number: 1954, arrival time: 67340983, toffset: 450, abs send time: 3400532
GECKO(15334) | (congestion_controller.cc:104): WrappingBitrateEstimator: Switching to absolute send time RBE.
GECKO(15334) | (remote_bitrate_estimator_abs_send_time.cc:119): RemoteBitrateEstimatorAbsSendTime: Instantiating.
GECKO(15334) | (vie_channel.cc:1302): OnInitializeDecoder 120 VP8
GECKO(15334) | (call.cc:727): DeliverRtp: found unknown SSRC: 3724330823

might be pointing to the root cause. This looks to me like our filter let pass the initial packets for both SSRC's (2816312869 and 3724330823) through to the decoder. Which makes sense as the answerer in this test case can probably start sending the two RTP streams, before the offerer (and receiver in this case) has received the SSRC's via the signaling.

I'm starting to wonder if filtering by RID would be better and safer thing to do here. But that would be a lot of effort just to fix an intermittent.
Thanks for the comments and analysis :drno!  It sounds like there might be something to do here to fix this!

:drno, is this something you could work on or could help get someone to look at this?  We have pretty high failure rates on this bug and bug 1351590 for the last 3 weeks and I would like to have something in the works this week or temporarily disable (windows && !debug - we would have coverage on windows debug still, osx*, linux*) in case we need another week or two to schedule the work in.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #16)
> My suspicion is that these lines starting in line 1050:
> 
> GECKO(15334) | (vie_receiver.cc:321): Packet received on SSRC: 2816312869
> with payload type: 120, timestamp: 2460978607, sequence number: 1954,
> arrival time: 67340983, toffset: 450, abs send time: 3400532
> GECKO(15334) | (congestion_controller.cc:104): WrappingBitrateEstimator:
> Switching to absolute send time RBE.
> GECKO(15334) | (remote_bitrate_estimator_abs_send_time.cc:119):
> RemoteBitrateEstimatorAbsSendTime: Instantiating.
> GECKO(15334) | (vie_channel.cc:1302): OnInitializeDecoder 120 VP8
> GECKO(15334) | (call.cc:727): DeliverRtp: found unknown SSRC: 3724330823
> 
> might be pointing to the root cause. This looks to me like our filter let
> pass the initial packets for both SSRC's (2816312869 and 3724330823) through
> to the decoder. Which makes sense as the answerer in this test case can
> probably start sending the two RTP streams, before the offerer (and receiver
> in this case) has received the SSRC's via the signaling.
> 
> I'm starting to wonder if filtering by RID would be better and safer thing
> to do here. But that would be a lot of effort just to fix an intermittent.

If "our filter" refers to the filter that we're setting up just for this simulcast mochitest (PeerConnectionImpl::SelectSsrc), we're not getting that far in the mochitest.  If you're referring to some other filtering, you may be on to something.
(In reply to Joel Maher ( :jmaher) from comment #17)
> Thanks for the comments and analysis :drno!  It sounds like there might be
> something to do here to fix this!
> 
> :drno, is this something you could work on or could help get someone to look
> at this?  We have pretty high failure rates on this bug and bug 1351590 for
> the last 3 weeks and I would like to have something in the works this week
> or temporarily disable (windows && !debug - we would have coverage on
> windows debug still, osx*, linux*) in case we need another week or two to
> schedule the work in.

I think Michael is on a good track of hunting down the root cause right now (we are chatting about this particular bug for hours now). Unfortunately the logic in this area is pretty complex and there is still a chance that it turns our we are chasing the wrong thing again.
As Michael is able to repro the problem locally (unfortunately only with low frequency) I don't have a problem with disabling the tests temporarily until we have fix, if we need to. I suggest if we don't have a fix by Friday morning we will land a temporary test disable on Friday.
Flags: needinfo?(drno)
Comment on attachment 8860132 [details]
Bug 1351531 - disable test_peerConnection_simulcastAnswer.html.

https://reviewboard.mozilla.org/r/132164/#review134990
Attachment #8860132 - Flags: review?(drno) → review+
Keywords: checkin-needed
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/9bfeb0378415
disable test_peerConnection_simulcastAnswer.html. r=drno
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/9bfeb0378415
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
reopening until we get a real fix here
Status: RESOLVED → REOPENED
Flags: needinfo?(rjesup)
Keywords: leave-open
Resolution: FIXED → ---
Whiteboard: [stockwell needswork] → [stockwell disabled]
(In reply to Joel Maher ( :jmaher) from comment #24)
> reopening until we get a real fix here

We decided to develop the real fix over in bug 1358224 (because the fix is not going to trivial and it affects multiple intermittent bugs). So I think we can close this then, alright?
Blocks: 1358224
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Keywords: leave-open
Resolution: --- → FIXED
sounds great
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: