Closed Bug 1037618 Opened 6 years ago Closed 2 years ago

Intermittent ice_unittest | test failed with return code 1

Categories

(Core :: WebRTC, defect, P4)

x86_64
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox33 --- disabled
firefox34 --- disabled
firefox35 --- disabled
firefox40 --- wontfix
firefox41 --- affected
firefox42 --- affected
firefox43 --- affected
firefox-esr31 --- disabled
firefox-esr38 --- affected
b2g-v1.4 --- disabled
b2g-v2.0 --- disabled
b2g-v2.1 --- disabled
b2g-v2.2 --- disabled
b2g-master --- affected
Blocking Flags:

People

(Reporter: RyanVM, Assigned: drno)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(4 files, 1 obsolete file)

Attached file full ice_unittest log
https://tbpl.mozilla.org/php/getParsedLog.php?id=43622829&tree=Fx-Team

Rev4 MacOSX Snow Leopard 10.6 fx-team opt test cppunit on 2014-07-11 10:05:37 PDT for push ed9ec8fed9a4
slave: t-snow-r4-0025

10:08:29     INFO -  CANDIDATES:
10:08:29     INFO -  Stream P1
10:08:29     INFO -  candidate:0 1 UDP 2128609535 10.26.56.189 59147 typ host
10:08:29     INFO -  candidate:1 1 UDP 1692467199 63.245.214.82 48826 typ srflx raddr 10.26.56.189 rport 59147
10:08:30     INFO -  [       OK ] IceGatherTest.TestGatherDNSStunServerIpAddress (1088 ms)
10:08:30     INFO -  [ RUN      ] IceGatherTest.TestGatherDNSStunServerHostname
10:08:30     INFO -  Candidate initialized: candidate:0 1 UDP 2128609535 10.26.56.189 50841 typ host
10:08:37     INFO -  /builds/slave/fx-team-osx64-0000000000000000/build/media/mtransport/test/ice_unittest.cpp:844: Failure
10:08:37     INFO -  Value of: res
10:08:37     INFO -    Actual: false
10:08:37     INFO -  Expected: true
10:08:37     INFO -  ICE(P1): No STUN servers specified
10:08:37     INFO -  ICE(P1): No TURN servers specified
10:08:38     INFO -  [  FAILED  ] IceGatherTest.TestGatherDNSStunServerHostname (8008 ms)
10:08:38     INFO -  [ RUN      ] IceGatherTest.TestGatherDNSStunBogusHostname
These seem to be network-related; for some reason the STUN server that these tests are hitting is being unreliable.
Depends on: 1045337
Looks like our stun server is being flaky again.
I think having these test being disabled on Mac OSX is pretty bad. The solution is not pretty, but hopefully we at least get the tests back this way.
Attachment #8499814 - Flags: review?(docfaraday)
Comment on attachment 8499814 [details] [diff] [review]
bug_1037618_fix_ice_unitests_for_osx.patch

Review of attachment 8499814 [details] [diff] [review]:
-----------------------------------------------------------------

*sadface*

Maybe we should instead just use the test STUN server for all of this?
Hmm, looking at some of the more recent failures, I see evidence that EXPECT_TRUE_WAIT is not waiting for the full duration. We should be waiting 7 seconds for gathering to complete, but we do not seem to be waiting even 1. Maybe there's some gtest bug on OS X?
Ok, that EXPECT_TRUE_WAIT macro was something we wrote, based on PRIntervalTime. I do not see any obvious flaws in the implementation of the macro, have not evaluated the timer implementation stuff.
Tried Nils' patch, still seeing problems on 10.6, even with the test stun server. Something is rotten on those 10.6 boxes.
As far as I can tell, those 10.6 boxes have no network connectivity to the outside world at all.
Comment on attachment 8499814 [details] [diff] [review]
bug_1037618_fix_ice_unitests_for_osx.patch

Review of attachment 8499814 [details] [diff] [review]:
-----------------------------------------------------------------

Since this doesn't seem to help us any.
Attachment #8499814 - Flags: review?(docfaraday)
Rank: 35
Flags: firefox-backlog+
Priority: -- → P3
backlog: --- → webRTC+
Flags: firefox-backlog+
Nils,

All of the recent tests here seem to be in the TCP tests.
Flags: needinfo?(drno)
So for the test failure where TestSendReceiveTcp failed the problem seems to be that in these cases the TCP active connection gets advertised in the ICE candidate as coming from port 9. As we are obviously not sending from port 9 but some ephemeral number the receiving side marks that candidate as peer reflexive. Which then makes our tests unhappy, because they expect a host-to-host connection happening here.
Flags: needinfo?(drno)
I see two options for fixing the issue described in comment #103:
1) Open up the tests to accept peer reflexive, at least when TCP is used/expected
2) Teach nICEr that port 9 is actually never a port to expect traffic from, so don't change the candidate type if we notice the port mis-match.

I'm not sure if option 2) is better or could cause collateral damage. Ekr, bwc any thought/opinions on that?
Flags: needinfo?(ekr)
Flags: needinfo?(docfaraday)
Further more I see failures in TestGatherDNSStunServerHostnameTcp and TestGatherDNSStunServerHostnameTcp which I don't have answers for yet why they are failing.
In both cases the 'tcptype so' seem to be missing, but no log message why. I'm going to read some code and see if I can come up with an explanation for that.
I think for now, let's do #1
Flags: needinfo?(ekr)
(In reply to Eric Rescorla (:ekr) from comment #106)
> I think for now, let's do #1

Agreed. That candidate is definitely a peer reflexive according to the definition, and is also the candidate we expect to see in that test.
Flags: needinfo?(docfaraday)
Bug 1037618: relax candidate verification for TCP. r?bwc
Attachment #8646176 - Flags: review?(docfaraday)
Comment on attachment 8646176 [details]
MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc

https://reviewboard.mozilla.org/r/15653/#review13997

::: media/mtransport/test/ice_unittest.cpp:705
(Diff revision 1)
> +          if (expected_local_transport_==kNrIceTransportTcp &&
> +              expected_remote_type_==NrIceCandidate::ICE_HOST) {

Spaces around ==

::: media/mtransport/test/ice_unittest.cpp:704
(Diff revision 1)
> +          /* Special case for ICE TCP using port 9 for its active candidates */

This comment is probably too cryptic. Maybe something like "Active remote ICE TCP candidates will be discovered as peer reflexive, since they are trickled with port 9 as a placeholder. So, allow peer reflexive here."
Attachment #8646176 - Flags: review?(docfaraday) → review+
Comment on attachment 8646176 [details]
MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc

Bug 1037618: relax candidate verification for TCP. r=bwc
Attachment #8646176 - Attachment description: MozReview Request: Bug 1037618: relax candidate verification for TCP. r?bwc → MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc
Assignee: nobody → drno
Try run is green for attachment 8646176 [details]
(In reply to Treeherder Robot from comment #119)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=4257884
> repository: fx-team
> start_time: 2015-08-14T11:54:10
> who: rvandermeulen[at]mozilla[dot]com
> machine: t-yosemite-r5-0017
> buildname: Rev5 MacOSX Yosemite 10.10 fx-team opt test cppunit
> revision: a303d8119dc8
> 
> TEST-UNEXPECTED-FAIL | ice_unittest | test failed with return code 1
> Return code: 1

Ohoh looks like we are "loosing" candidate on the fly :-0

12:02:30     INFO -  Candidate initialized: candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host
12:02:30     INFO -  Candidate initialized: candidate:1 1 TCP 2105442559 10.26.56.83 56425 typ host tcptype passive
12:02:30     INFO -  Candidate initialized: candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active
12:02:30     INFO -  Candidate initialized: candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so
12:02:30     INFO -  Gathering complete for P1
12:02:30     INFO -  CANDIDATES:
12:02:30     INFO -  Stream P1
12:02:30     INFO -  candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host
12:02:30     INFO -  candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active
12:02:30     INFO -  candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so
12:02:30     INFO -  Returning candidate: candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host
12:02:30     INFO -  Returning candidate: candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active
12:02:30     INFO -  Returning candidate: candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so
12:02:30     INFO -  /builds/slave/fx-team-m64-000000000000000000/build/src/media/mtransport/test/ice_unittest.cpp:1761: Failure
12:02:30     INFO -  Value of: StreamHasMatchingCandidate(0, "tcptype passive")
12:02:30     INFO -    Actual: false
12:02:30     INFO -  Expected: true
Theory of what might be going on with missing or loosing candidates: for the tcpype active candidate we currently allocate a port number, but do not actually create a socket with/for that port number (sockets for active connection get allocated once a connection to a destination gets requested). Internally the port number is a random ephemeral number, and the number 9 we see in the test output only result from over-writing the port number before handing it off. What could be a problem here is that deduping might kill new candidates, e.g. tcptype so, or even removed previously created candidates like the missing tcptype passive.

I'll look into improving our code base in that area, probably in a separate bug.
(In reply to Nils Ohlmeier [:drno] from comment #123)
> I'll look into improving our code base in that area, probably in a separate
> bug.

Created bug 1195396 to follow up on this.
Depends on: 1195396
Whiteboard: [test disabled on OSX][leave open]
It looks like a DNS resolution is occasionally taking a long enough time to cause the test to time out. We could try extending the timeout and see if that helps. It could be argued that this test should not run in CI, since a DNS lookup relies on the external network.
(In reply to Byron Campen [:bwc] from comment #151)
> It looks like a DNS resolution is occasionally taking a long enough time to
> cause the test to time out. We could try extending the timeout and see if
> that helps. It could be argued that this test should not run in CI, since a
> DNS lookup relies on the external network.

Why would it be looking up an external network address?  The test machines have no access to an external network, so any access to DNS should be local.  If it's timing out, that implies the entire system stalled for a while, it was chewed up by higher-priority threads, or something is wrong with DNS (local server).
Flags: needinfo?(docfaraday)
Comment on attachment 8709504 [details]
MozReview Request: Bug 1037618: Extend DNS timeout for bogus STUN server tests.

https://reviewboard.mozilla.org/r/31459/#review28213

On a machine with working DNS resolution trying to resolve anything at .invalid resolve really fast. So I guess that means on machines were this takes long DNS resolution isn't working at all. So I'm wondering if a check for empty |g_stun_server_address| to disable these tests is more appropriate?

I'll leave the decision up to you.
Attachment #8709504 - Flags: review+
(In reply to Randell Jesup [:jesup] from comment #153)
> (In reply to Byron Campen [:bwc] from comment #151)
> > It looks like a DNS resolution is occasionally taking a long enough time to
> > cause the test to time out. We could try extending the timeout and see if
> > that helps. It could be argued that this test should not run in CI, since a
> > DNS lookup relies on the external network.
> 
> Why would it be looking up an external network address?  The test machines
> have no access to an external network, so any access to DNS should be local.
> If it's timing out, that implies the entire system stalled for a while, it
> was chewed up by higher-priority threads, or something is wrong with DNS
> (local server).

We're using a bogus FQDN (a .invalid address) in a test to make sure the code-path works. But if that doesn't try to hit the wire, it may be some other problem...
Flags: needinfo?(docfaraday)
Six of these seem to be due to failure to form a TCP connection in a reasonable amount of time; a similar problem as seen in bug 1201209 recently.

http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456350708/fx-team_yosemite_r7_test-cppunit-bm106-tests1-macosx-build452.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456440145/fx-team_yosemite_r7_test-cppunit-bm107-tests1-macosx-build366.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1456590806/mozilla-inbound_snowleopard_test-cppunit-bm108-tests1-macosx-build2266.txt.gz
https://public-artifacts.taskcluster.net/FVDnFk4_Sf-DqktJ1MojIA/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/Zj2ME_8UT06G3LsKjWvHsg/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/ImdA3S28SoSL_hrZ-NLKOg/0/public/logs/live_backing.log

Two seem to be caused by TestStunServer::SetDropInitialPackets not working as expected.

http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-linux64-asan/1456393484/mozilla-inbound_ubuntu64-asan_vm_test-cppunit-bm120-tests1-linux64-build59.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64-debug/1456242790/mozilla-central_ubuntu64_vm-debug_test-cppunit-bm67-tests1-linux64-build1.txt.gz

Two have two test-cases that look up an invalid FQDN, which both fail.

http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/1456520126/fx-team_ubuntu64_vm_test-cppunit-bm120-tests1-linux64-build5.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/1456525679/fx-team_ubuntu64_vm_test-cppunit-bm116-tests1-linux64-build11.txt.gz

Two are due to a bug in the test code where duplicate candidates will cause some checking code to be sad.

http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456325925/fx-team_snowleopard_test-cppunit-bm107-tests1-macosx-build1085.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1456241709/mozilla-inbound_snowleopard_test-cppunit-bm108-tests1-macosx-build2133.txt.gz
Depends on: 1248637
Depends on: 1252516
(In reply to Byron Campen [:bwc] from comment #159)
> 
> Two have two test-cases that look up an invalid FQDN, which both fail.
> 
> http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/
> 1456520126/fx-team_ubuntu64_vm_test-cppunit-bm120-tests1-linux64-build5.txt.
> gz
> http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/
> 1456525679/fx-team_ubuntu64_vm_test-cppunit-bm116-tests1-linux64-build11.txt.
> gz
> 

   Both of these failures occurred on linux64, less than a minute apart, in exactly the same way (DNS resolution for .invalid seems to hang). I suspect this is infra-related.
Attachment #8709504 - Attachment is obsolete: true
(In reply to Byron Campen [:bwc] from comment #159)
> 
> Two seem to be caused by TestStunServer::SetDropInitialPackets not working
> as expected.
> 
> http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-
> linux64-asan/1456393484/mozilla-inbound_ubuntu64-asan_vm_test-cppunit-bm120-
> tests1-linux64-build59.txt.gz
> http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-
> linux64-debug/1456242790/mozilla-central_ubuntu64_vm-debug_test-cppunit-bm67-
> tests1-linux64-build1.txt.gz

   This is happening because a UDP port is being reused from an earlier test, and the test STUN server still has a packet count for that address which is large enough to exceed the packet drop count. We need to clear out these counts before running these tests.
Depends on: 1252585
Comment on attachment 8725365 [details]
MozReview Request: Bug 1037618 - Part 1: Add some logging to highlight TCP connection failures. r?drno

https://reviewboard.mozilla.org/r/37445/#review34215

::: media/mtransport/third_party/nICEr/src/net/nr_socket_multi_tcp.c:591
(Diff revision 1)
> +      r_log(LOG_ICE,LOG_WARNING,"%s:%d %s failed to accept new TCP connection: %d",__FILE__,__LINE__,__FUNCTION__,_status);

Might be helpful to include |remote_addr.as_string| here as well if it is set already.
Attachment #8725365 - Flags: review?(drno) → review+
Byron FYI on the TCP connection failures: I found the other day that not surprisingly all of the nICEr is not taking into account that establishing connections can take time. So in case of HTTP proxies connect() gets called, pretty soon followed by a call to write(). The write() returns R_WOULDBLOCK, which then results in the socket getting armed for write status. The OS then immediately returns that the socket in fact is writeable. Which then causes another call to write(). And then we go into something like a busy loop, until the proxy code either connects or gets failure.
I'm not sure if we have similar problems in general with TCP connections. But as we don't have a callback for when connections get established it seems plausible that also our TCP code in general might run into corner cases like this, where a busy loop on a small try machine might stall the test indefinitely.
https://reviewboard.mozilla.org/r/37445/#review34215

> Might be helpful to include |remote_addr.as_string| here as well if it is set already.

I thought about that, we'd either need to zero out |remote_addr| up top, or have another variable that tracks whether it has been filled in. As it stands, if this fails I'm primarily interested in the error, not so much who is the lucky winner.
Mass change P3->P4 to align with new Mozilla triage process.
Priority: P3 → P4
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.