Closed Bug 977944 Opened 9 years ago Closed 8 years ago

Intermittent test_ipc.html | test_peerConnection_offerRequiresReceiveVideo.html | pc_local: ICE failed to switch to connected | Test timed out.

Categories

(Core :: WebRTC: Networking, defect, P2)

29 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED INCOMPLETE
mozilla35

People

(Reporter: KWierso, Assigned: drno)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 2 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=35383521&tree=Mozilla-Inbound
slave: tst-linux64-spot-359



15:58:56     INFO -  831 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  signalingState after local setRemoteDescription is 'stable'
15:58:56     INFO -  832 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  iceConnectionState should not be undefined
15:58:56     INFO -  833 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  signalingState after remote setLocalDescription is 'stable'
15:58:56     INFO -  834 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  iceConnectionState should not be undefined
15:58:56     INFO -  835 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  iceConnectionState should not be undefined
15:58:56     INFO -  836 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  pc_local: ICE failed to switch to connected
16:03:52     INFO -  Xlib:  extension "RANDR" missing on display ":0".
16:03:53     INFO -  837 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_peerConnection_offerRequiresReceiveVideo.html |  Test timed out.
16:03:53     INFO -  138680768[7fe8fb115140]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:1638: CloseInt: Closing PeerConnectionImpl 7e05bd119948ba82; ending call
16:03:53     INFO -  -131930368[7fe8fb119dc0]: [CCAPP Task|def] ccapi.c:1445: SIPCC-CC_API: 1/39, cc_int_onhook: UI -> GSM: ONHOOK
16:03:53     INFO -  -131930368[7fe8fb119dc0]: [CCAPP Task|def] ccapi.c:1447: (1/39) On-hook called from /builds/slave/m-in-l64-000000000000000000000/build/media/webrtc/signaling/src/sipcc/core/ccapp/ccprovider.c:657
16:03:53     INFO -  -195864832[7fe8fb119700]: [GSM Task|def] dcsm.c:532: SIPCC-DCSM: dcsm_process_event: DCSM 11  :(DCSM_READY:ONHOOK )
16:03:53     INFO -  -195864832[7fe8fb119700]: [GSM Task|fsm_sm] sm.c:46: SIPCC-FSM: sm_process_event: DEF 39  : 0x7fe9049b79edx: sm entry: (STABLE:ONHOOK)
16:03:53     INFO -  -195864832[7fe8fb119700]: [GSM Task|fsm_sm] fsmdef.c:6711: SIPCC-FSM: fsmdef_ev_onhook: Entered.
Assignee: nobody → drno
Test timeouts are probably caused by Bug 976182
Progress: at least the fix from Bug 976182 as expected results now in connection failures from both ends and no test timeout any more.

Lets hope for now that Bug 977933 will give us more clues in the logging once this hits again.
From logging:

11:54:16     INFO -  (ice/WARNING) ICE-PEER(PC:1393962854814631 (id=58 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.:default): no pairs for 1393962854814631 (id=58 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.: stream2/video
11:54:16     INFO -  (ice/WARNING) ICE-PEER(PC:1393962855025412 (id=58 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.:default): no pairs for 1393962855025412 (id=58 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.: stream2/video


That looks pretty much fatal.
(In reply to Byron Campen [:bwc] from comment #9)
> That looks pretty much fatal.

Indeed. But the questions are
- why
- why now
- why only for one of the streams

As this seem to happen always inside the test_ipc.html I'm starting to wonder if we are running into resource problems (e.g. out of socket or something like that).
It would be extremely useful if the test cases dumped the local and remote description when ICE fails.
I suspect we've hit some sort of problem in the interaction between NrSocketIpc and candidate creation.
(In reply to Byron Campen [:bwc] from comment #11)
> It would be extremely useful if the test cases dumped the local and remote
> description when ICE fails.

Is there a way to dump that information from JS land?

I thinking about increasing the ICE log levels so that we see whats going on. But very likely that would blow the logs beyond there limits on the build servers.
(In reply to Nils Ohlmeier [:drno] from comment #13)
> (In reply to Byron Campen [:bwc] from comment #11)
> > It would be extremely useful if the test cases dumped the local and remote
> > description when ICE fails.
> 
> Is there a way to dump that information from JS land?
> 
> I thinking about increasing the ICE log levels so that we see whats going
> on. But very likely that would blow the logs beyond there limits on the
> build servers.

    The test cases should be able to save a copy of both the local and remote description, but I'm not sure there is a way to ask for them again.
Attached patch dump_sdp_after_ice_failure.patch (obsolete) — Splinter Review
Thanks for the idea. This patch writes out the SDP's from both sides if the ICE connection failed.
Obviously we are still missing trickle ICE and timing information. But hopefully this helps to debug the issues.
Attachment #8390799 - Flags: review?(docfaraday)
Comment on attachment 8390799 [details] [diff] [review]
dump_sdp_after_ice_failure.patch

Review of attachment 8390799 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like it will do the right thing, but it seems likely that we'll need to add more post-mortem logging, so putting it all in a function call is probably a good idea.
Attachment #8390799 - Flags: review?(docfaraday) → review+
Try run (although it is doubtful it will hit the code) https://tbpl.mozilla.org/?tree=Try&rev=82cd9660c237

I would request landing of this for now, and re-factor when I put in more post-mortem debugging info.
(In reply to Nils Ohlmeier [:drno] from comment #18)
> I would request landing of this for now, and re-factor when I put in more
> post-mortem debugging info.

I'm cool with that.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/add3f8ae5443
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Sorry this was probably confusing, but the checking was just instrumentation for future debugging of this intermittent test failure.
Is there a better way for me mark this properly in here in the future?

We are waiting for another test failure with the instrumentation in place...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch dump_log_on_ice_error.patch (obsolete) — Splinter Review
Attachment #8396045 - Flags: review?(docfaraday)
Comment on attachment 8396045 [details] [diff] [review]
dump_log_on_ice_error.patch

Review of attachment 8396045 [details] [diff] [review]:
-----------------------------------------------------------------

Looking at the implementation of requestCompleteLog(), all it seems to do is flip a bit that causes all subsequent logging to be output instead of accumulated in the circular buffer. This circular buffer _should_ be getting dumped as soon as a failure is observed, but I'm guessing it is not? Maybe the following chunk of code is causing some confusion somehow in the ipc case?

dxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/SimpleTest/SimpleTest.js#412
Good point Byron. That whole logging seems pretty useless to me.
Attachment #8396045 - Attachment is obsolete: true
Attachment #8396045 - Flags: review?(docfaraday)
This should hopefully be able to bypass the mochitest logging for dumping the big SDP's.
Attachment #8396818 - Flags: review?(docfaraday)
Comment on attachment 8396818 [details] [diff] [review]
dump_log_on_ice_error.patch

Review of attachment 8396818 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me, provided |dump| actually does what it says.
Attachment #8396818 - Flags: review?(docfaraday) → review+
(In reply to Byron Campen [:bwc] from comment #45)
> Looks good to me, provided |dump| actually does what it says.

I verified that it works locally, and it bypasses the mochitest logging logic...
Attachment #8390799 - Attachment is obsolete: true
Blocks: 978415
Now we finally have the SDP in the logs. But I don't see anything bad or wrong in there.
Byron: anything in the SDP which looks suspicious to you?
Flags: needinfo?(docfaraday)
The port that seems to never receive any traffic seems to show up twice in the same SDP, once for each component. That's highly suspicious...

19:24:23     INFO -  a=candidate:0 1 UDP 2122252543 10.134.57.120 42238 typ host
19:24:23     INFO -  a=candidate:0 2 UDP 2122252542 10.134.57.120 42238 typ host
Flags: needinfo?(docfaraday)
Adding jduell for visability.
Jason: as per my separate email can you please help us debug why the network library over IPC hands out the same port twice (or point us to the best person)?
Flags: needinfo?(jduell.mcbugs)
I might be able to help on debugging the UDP IPC part. Can you point out which log file you guys found the duplicate port?
(In reply to Shih-Chiang Chien [:schien] (UTC+8) from comment #62)
> I might be able to help on debugging the UDP IPC part. Can you point out
> which log file you guys found the duplicate port?

I found it here:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37039733&full=1&branch=mozilla-inbound

However, this does not seem to happen always. There are probably other errors lurking.
Seeing logging like this a lot:

02:43:15     INFO -  (ice/WARNING) Peer offered candidates for disabled local component

I would not expect this to happen in this test. Maybe we're hitting a race where the offer comes in before we've enabled the component somehow? I'm going to try improving the logging here, and look at the code in the meantime.
It looks like we get that kind of logging in normal operation. I'm trying to get this to reproduce on try with slightly better logging.
FYI as we have limited the amount of tests getting executed within test_ipc.html the signature of this problem has changed a little. New bug 992902 got filled which still shows the same ICE connection problem.
Sorry about my latency--was on vacation last week.  Randell is a better person to ask about anything ICE-related.
Flags: needinfo?(jduell.mcbugs) → needinfo?(rjesup)
Sigh.  ekr, can you look at this?  Or Byron, if you think you have the info you need to look at the duplicate port, please cancel the needinfo.  If you think it's related to IPC specifically, please needinfo jduell again.
Flags: needinfo?(rjesup) → needinfo?(ekr)
(In reply to Jason Duell (:jduell) from comment #70)
> Sorry about my latency--was on vacation last week.  Randell is a better
> person to ask about anything ICE-related.

This isn't actually ICE-related; we caught this code binding to the same port twice when passed 0, and were wondering whether we were somehow misusing the service or whether there was a bug:

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/nr_socket_prsock.cpp#1005
Flags: needinfo?(jduell.mcbugs)
Byron looked at it.
Trying to clear the needinfo
Flags: needinfo?(ekr)
Bryan, 
  Am looking through the logs, trying to piece together how you came to the conclusion that the binding was happening on the same port, could you provide the breakdown of what you are seeing.
Attached patch Bug977944.patchSplinter Review
https://tbpl.mozilla.org/?tree=Try&rev=6bbc6e6fde84

An approach to fixing this problem by staggering the calls to getUserMedia. 
The current scenario is unlikely to happen in real life, and as such It seems reasonable to fix the test, and then revisit ipdl pieces if we get reports from the field of this happening ?
Attachment #8411282 - Flags: review?(drno)
Comment on attachment 8411282 [details] [diff] [review]
Bug977944.patch

Review of attachment 8411282 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not convinced that this is the root cause. But on the other hand it is a general cleanup of our test cases which does not harm. So worth a try.

Looks good to me from the coding and functional side.
Attachment #8411282 - Flags: review?(drno) → review+
Yes agreed, kind of a shot in the dark.
An update on my efforts: 

The ice candidates are determined when you initialize a mozRTCPeerConnection. 
19:24:17 in the logs you can see the initialization occur, with the error. 

So I created a test case which initializes many mozRTCPeerConnections, and then 

 grep "trickling candidate" | cut -d' ' -f14 | sort | uniq -c

The std output from the mochitest, but I wasn't able to reproduce duplicate ports being generated. 

I did notice netwerk/base/src/nsUDPSocket.cpp
has reuse_addr = true, which I believe could allow for UDP reuse. 

But since I couldn't verify the failure case, I have no idea if the solution would or would not work. 



? Possibly a platform issue ?
Blocks: 992902
Current intermittent failures all seem to happen on Linux and MacOSX. I haven't seen any failures on Windows.
The most recent log looks like bug 963524 (specifically, the bug 1006809 part of it).
I should also note that in the IPC case, there is some packet filtering logic that exacerbates the problem detailed here: https://bugzilla.mozilla.org/show_bug.cgi?id=963524#c52

Here's the code that makes this mess worse, and makes this more frequent in the IPC case, AFAICT:

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/stun_udp_socket_filter.cpp#130

This drops incoming check requests until we've sent our own check request in the other direction, which causes the first sender to have their first check request dropped, increasing the likelihood that the first sender will be pushed to the max retransmit count by triggered checks.
So it sounds like we're not actually sure whether we saw bind(port=0) hand out the same port twice?

> I did notice netwerk/base/src/nsUDPSocket.cpp has reuse_addr = true,

Not sure why you'd want to have reuse_addr = true.  Generally I'd avoid that?
Flags: needinfo?(jduell.mcbugs)
(In reply to Jason Duell (:jduell) from comment #117)
> So it sounds like we're not actually sure whether we saw bind(port=0) hand
> out the same port twice?
> 

   Oh, I did see it in the logs. I'm just not convinced it is the cause of the bulk of these failures. Once we land a fix or two, we should get a better idea of what effect it actually has.
Depends on: 1013822
Targeting finishing this by Fx 33.
Priority: -- → P2
Target Milestone: mozilla30 → mozilla33
Nils -- Where does this stand?  How close are we to solving this?
Flags: needinfo?(drno)
Not much new information here.
I analyzed the log from comment #153. The ICE connections for audio and video are successful. Then the GC kicks in. And when the GC is finished the third ICE connection for the data channel is timed out.
Makes me wish we could better control the GC from within a test. But also makes you wonder how the GC affect real world usage of timing sensitive ICE operations.

In general the majority of the test failures in here fall into the ICE connection problem bucket. Which raises my wish for better/more ICE connection debug possibilities from JS land.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #156)
> Not much new information here.
> I analyzed the log from comment #153. The ICE connections for audio and
> video are successful. Then the GC kicks in. And when the GC is finished the
> third ICE connection for the data channel is timed out.
> Makes me wish we could better control the GC from within a test. But also
> makes you wonder how the GC affect real world usage of timing sensitive ICE
> operations.

How long did the GC take?  Our tests tend to accumulate garbage more than normal use, and so the GC's when they happen can take a while (especially on b2g emulator!)  Running gc inbetween each test might be a good (though not new) idea.  Or better yet, try our darnedest to remove mainthread touches, though for trickle ice it's unavoidable.  (To bad we can't run in a worker.....)

> In general the majority of the test failures in here fall into the ICE
> connection problem bucket. Which raises my wish for better/more ICE
> connection debug possibilities from JS land.

hear hear
(In reply to Randell Jesup [:jesup] from comment #157)
> (In reply to Nils Ohlmeier [:drno] from comment #156)
> > I analyzed the log from comment #153. The ICE connections for audio and
> > video are successful. Then the GC kicks in. And when the GC is finished the
> > third ICE connection for the data channel is timed out.
> > Makes me wish we could better control the GC from within a test. But also
> > makes you wonder how the GC affect real world usage of timing sensitive ICE
> > operations.
> 
> How long did the GC take?  Our tests tend to accumulate garbage more than
> normal use, and so the GC's when they happen can take a while (especially on
> b2g emulator!)  Running gc inbetween each test might be a good (though not
> new) idea.  Or better yet, try our darnedest to remove mainthread touches,
> though for trickle ice it's unavoidable.  (To bad we can't run in a
> worker.....)

In this case it looks like it took 1-2 seconds. BTW this is not the B2G emulator, but e10s on Linux. But that shows similar issues like B2G, just not as bad/often as B2G.

Byron: from looking at the logs it looks like the three ICE connections get tried one after another. Does that sounds right? If that is the case does each of them have independent timeouts which start on the first check, or is the last ICE connection left with whatever time is left from establishing all previous connections?
Flags: needinfo?(docfaraday)
From what I could tell, every candidate that should have been tried was, but it is hard to discern whether there were any network IO problems, or the exact timing. Each candidate runs its own timers, and there is also a timer per ICE that prevents ICE from failing until it pops to allow more candidates to be gathered, and new pairs to be formed (this is that "grace period" timer we added recently).
Flags: needinfo?(docfaraday)
Target Milestone: mozilla33 → mozilla35
backlog: --- → webRTC+
Rank: 25
Closing as not seen in 7 months
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → INCOMPLETE
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.