Closed Bug 1037618 Opened 11 years ago Closed 8 years ago

Intermittent ice_unittest | test failed with return code 1

Categories

(Core :: WebRTC, defect, P4)

Product:

Component:

Platform:

x86_64

macOS

Type:

defect

Priority:

P4

Severity:

normal

Rank:

35

Tracking

()

Status:

RESOLVED WORKSFORME

Tracking Flags:

Tracking

Status

firefox33

---

disabled

firefox34

---

disabled

firefox35

---

disabled

firefox40

---

wontfix

firefox41

---

affected

firefox42

---

affected

firefox43

---

affected

firefox-esr31

---

disabled

firefox-esr38

---

affected

b2g-v1.4

---

disabled

b2g-v2.0

---

disabled

b2g-v2.1

---

disabled

b2g-v2.2

---

disabled

b2g-master

---

affected

Tracking

Status

firefox33

firefox34

firefox35

firefox40

firefox41

firefox42

firefox43

firefox-esr31

firefox-esr38

b2g-v1.4

b2g-v2.0

b2g-v2.1

b2g-v2.2

relnote-firefox

b2g-master

thunderbird_esr115

thunderbird_esr128

thunderbird_esr140

firefox-esr115

firefox-esr128

firefox-esr140

firefox141

firefox142

firefox143

Blocking Flags:

backlog

webrtc/webaudio+

People

(Reporter: RyanVM, Assigned: drno)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(4 files, 1 obsolete file)

full ice_unittest log 11 years ago Ryan VanderMeulen [:RyanVM] 82.60 KB, text/plain		Details
bug_1037618_fix_ice_unitests_for_osx.patch 11 years ago Nils Ohlmeier [:drno] 2.51 KB, patch		Details \| Diff \| Splinter Review
MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc 10 years ago Nils Ohlmeier [:drno] 40 bytes, text/x-review-board-request	bwc : review+	Details
MozReview Request: Bug 1037618: Extend DNS timeout for bogus STUN server tests. 10 years ago Byron Campen [:bwc] 58 bytes, text/x-review-board-request	drno : review+	Details
MozReview Request: Bug 1037618 - Part 1: Add some logging to highlight TCP connection failures. r?drno 9 years ago Byron Campen [:bwc] 58 bytes, text/x-review-board-request	drno : review+	Details

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

11 years ago

Attached file full ice_unittest log — Details

https://tbpl.mozilla.org/php/getParsedLog.php?id=43622829&tree=Fx-Team Rev4 MacOSX Snow Leopard 10.6 fx-team opt test cppunit on 2014-07-11 10:05:37 PDT for push ed9ec8fed9a4 slave: t-snow-r4-0025 10:08:29 INFO - CANDIDATES: 10:08:29 INFO - Stream P1 10:08:29 INFO - candidate:0 1 UDP 2128609535 10.26.56.189 59147 typ host 10:08:29 INFO - candidate:1 1 UDP 1692467199 63.245.214.82 48826 typ srflx raddr 10.26.56.189 rport 59147 10:08:30 INFO - [ OK ] IceGatherTest.TestGatherDNSStunServerIpAddress (1088 ms) 10:08:30 INFO - [ RUN ] IceGatherTest.TestGatherDNSStunServerHostname 10:08:30 INFO - Candidate initialized: candidate:0 1 UDP 2128609535 10.26.56.189 50841 typ host 10:08:37 INFO - /builds/slave/fx-team-osx64-0000000000000000/build/media/mtransport/test/ice_unittest.cpp:844: Failure 10:08:37 INFO - Value of: res 10:08:37 INFO - Actual: false 10:08:37 INFO - Expected: true 10:08:37 INFO - ICE(P1): No STUN servers specified 10:08:37 INFO - ICE(P1): No TURN servers specified 10:08:38 INFO - [ FAILED ] IceGatherTest.TestGatherDNSStunServerHostname (8008 ms) 10:08:38 INFO - [ RUN ] IceGatherTest.TestGatherDNSStunBogusHostname

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Byron Campen [:bwc]

Comment 11

•

11 years ago

These seem to be network-related; for some reason the STUN server that these tests are hitting is being unreliable.

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Byron Campen [:bwc]

Updated

•

11 years ago

Depends on: 1045337

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Byron Campen [:bwc]

Comment 60

•

11 years ago

Looks like our stun server is being flaky again.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 61

•

11 years ago

This appears to have gone off the rails across all trees today. Disabled until someone can figure out what's wrong with this test on OSX and ensure that it's written in a way that meets visibility standards. https://hg.mozilla.org/mozilla-central/rev/d08c31df0c1a https://hg.mozilla.org/releases/mozilla-aurora/rev/9199063d2e16 https://hg.mozilla.org/releases/mozilla-beta/rev/0225b61c4f71 https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/f62cff5d1744 https://hg.mozilla.org/releases/mozilla-esr31/rev/1a78f56a046d https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/767991019ae3

status-b2g-v1.4: --- → disabled

status-b2g-v2.0: --- → disabled

status-b2g-v2.1: --- → disabled

status-b2g-v2.2: --- → disabled

status-firefox33: --- → disabled

status-firefox34: --- → disabled

status-firefox35: --- → disabled

status-firefox-esr31: --- → disabled

Whiteboard: [test disabled on OSX][leave open]

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nils Ohlmeier [:drno]

Assignee

Comment 71

•

11 years ago

Attached patch bug_1037618_fix_ice_unitests_for_osx.patch — Details — Splinter Review

I think having these test being disabled on Mac OSX is pretty bad. The solution is not pretty, but hopefully we at least get the tests back this way.

Attachment #8499814 - Flags: review?(docfaraday)

Byron Campen [:bwc]

Comment 72

•

11 years ago

Comment on attachment 8499814 [details] [diff] [review] bug_1037618_fix_ice_unitests_for_osx.patch Review of attachment 8499814 [details] [diff] [review]: ----------------------------------------------------------------- *sadface* Maybe we should instead just use the test STUN server for all of this?

Byron Campen [:bwc]

Comment 73

•

11 years ago

Hmm, looking at some of the more recent failures, I see evidence that EXPECT_TRUE_WAIT is not waiting for the full duration. We should be waiting 7 seconds for gathering to complete, but we do not seem to be waiting even 1. Maybe there's some gtest bug on OS X?

Byron Campen [:bwc]

Comment 74

•

11 years ago

Ok, that EXPECT_TRUE_WAIT macro was something we wrote, based on PRIntervalTime. I do not see any obvious flaws in the implementation of the macro, have not evaluated the timer implementation stuff.

Byron Campen [:bwc]

Comment 75

•

11 years ago

Tried Nils' patch, still seeing problems on 10.6, even with the test stun server. Something is rotten on those 10.6 boxes.

Byron Campen [:bwc]

Comment 76

•

11 years ago

As far as I can tell, those 10.6 boxes have no network connectivity to the outside world at all.

Byron Campen [:bwc]

Comment 77

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=6b7edeb6baa4

Byron Campen [:bwc]

Comment 78

•

11 years ago

Comment on attachment 8499814 [details] [diff] [review] bug_1037618_fix_ice_unitests_for_osx.patch Review of attachment 8499814 [details] [diff] [review]: ----------------------------------------------------------------- Since this doesn't seem to help us any.

Attachment #8499814 - Flags: review?(docfaraday)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Maire Reavy [:mreavy]

Updated

•

10 years ago

Rank: 35

Flags: firefox-backlog+

Priority: -- → P3

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

:shell escalante

Updated

•

10 years ago

backlog: --- → webRTC+

Flags: firefox-backlog+

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Eric Rescorla (:ekr)

Comment 99

•

10 years ago

Nils, All of the recent tests here seem to be in the TCP tests.

Flags: needinfo?(drno)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nils Ohlmeier [:drno]

Assignee

Comment 103

•

10 years ago

So for the test failure where TestSendReceiveTcp failed the problem seems to be that in these cases the TCP active connection gets advertised in the ICE candidate as coming from port 9. As we are obviously not sending from port 9 but some ephemeral number the receiving side marks that candidate as peer reflexive. Which then makes our tests unhappy, because they expect a host-to-host connection happening here.

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Assignee

Comment 104

•

10 years ago

I see two options for fixing the issue described in comment #103: 1) Open up the tests to accept peer reflexive, at least when TCP is used/expected 2) Teach nICEr that port 9 is actually never a port to expect traffic from, so don't change the candidate type if we notice the port mis-match. I'm not sure if option 2) is better or could cause collateral damage. Ekr, bwc any thought/opinions on that?

Flags: needinfo?(ekr)

Flags: needinfo?(docfaraday)

Nils Ohlmeier [:drno]

Assignee

Comment 105

•

10 years ago

Further more I see failures in TestGatherDNSStunServerHostnameTcp and TestGatherDNSStunServerHostnameTcp which I don't have answers for yet why they are failing. In both cases the 'tcptype so' seem to be missing, but no log message why. I'm going to read some code and see if I can come up with an explanation for that.

Eric Rescorla (:ekr)

Comment 106

•

10 years ago

I think for now, let's do #1

Flags: needinfo?(ekr)

Byron Campen [:bwc]

Comment 107

•

10 years ago

(In reply to Eric Rescorla (:ekr) from comment #106) > I think for now, let's do #1 Agreed. That candidate is definitely a peer reflexive according to the definition, and is also the candidate we expect to see in that test.

Flags: needinfo?(docfaraday)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nils Ohlmeier [:drno]

Assignee

Comment 109

•

10 years ago

Attached file MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc — Details

Bug 1037618: relax candidate verification for TCP. r?bwc

Attachment #8646176 - Flags: review?(docfaraday)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Byron Campen [:bwc]

Comment 112

•

10 years ago

Comment on attachment 8646176 [details] MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc https://reviewboard.mozilla.org/r/15653/#review13997 ::: media/mtransport/test/ice_unittest.cpp:705 (Diff revision 1) > + if (expected_local_transport_==kNrIceTransportTcp && > + expected_remote_type_==NrIceCandidate::ICE_HOST) { Spaces around == ::: media/mtransport/test/ice_unittest.cpp:704 (Diff revision 1) > + /* Special case for ICE TCP using port 9 for its active candidates */ This comment is probably too cryptic. Maybe something like "Active remote ICE TCP candidates will be discovered as peer reflexive, since they are trickled with port 9 as a placeholder. So, allow peer reflexive here."

Attachment #8646176 - Flags: review?(docfaraday) → review+

Nils Ohlmeier [:drno]

Assignee

Comment 113

•

10 years ago

Comment on attachment 8646176 [details] MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc Bug 1037618: relax candidate verification for TCP. r=bwc

Attachment #8646176 - Attachment description: MozReview Request: Bug 1037618: relax candidate verification for TCP. r?bwc → MozReview Request: Bug 1037618: relax candidate verification for TCP. r=bwc

Nils Ohlmeier [:drno]

Assignee

Updated

•

10 years ago

Assignee: nobody → drno

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nils Ohlmeier [:drno]

Assignee

Comment 117

•

10 years ago

Try run is green for attachment 8646176 [details]

Keywords: checkin-needed, leave-open

Comment 118

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2bf4aa070e86

Keywords: checkin-needed

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 120

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/2bf4aa070e86

Nils Ohlmeier [:drno]

Assignee

Comment 121

•

10 years ago

(In reply to Treeherder Robot from comment #119) > log: > https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=4257884 > repository: fx-team > start_time: 2015-08-14T11:54:10 > who: rvandermeulen[at]mozilla[dot]com > machine: t-yosemite-r5-0017 > buildname: Rev5 MacOSX Yosemite 10.10 fx-team opt test cppunit > revision: a303d8119dc8 > > TEST-UNEXPECTED-FAIL | ice_unittest | test failed with return code 1 > Return code: 1 Ohoh looks like we are "loosing" candidate on the fly :-0 12:02:30 INFO - Candidate initialized: candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host 12:02:30 INFO - Candidate initialized: candidate:1 1 TCP 2105442559 10.26.56.83 56425 typ host tcptype passive 12:02:30 INFO - Candidate initialized: candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active 12:02:30 INFO - Candidate initialized: candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so 12:02:30 INFO - Gathering complete for P1 12:02:30 INFO - CANDIDATES: 12:02:30 INFO - Stream P1 12:02:30 INFO - candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host 12:02:30 INFO - candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active 12:02:30 INFO - candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so 12:02:30 INFO - Returning candidate: candidate:0 1 UDP 2130444543 10.26.56.83 61619 typ host 12:02:30 INFO - Returning candidate: candidate:1 1 TCP 2105458943 10.26.56.83 9 typ host tcptype active 12:02:30 INFO - Returning candidate: candidate:1 1 TCP 2105426175 10.26.56.83 64468 typ host tcptype so 12:02:30 INFO - /builds/slave/fx-team-m64-000000000000000000/build/src/media/mtransport/test/ice_unittest.cpp:1761: Failure 12:02:30 INFO - Value of: StreamHasMatchingCandidate(0, "tcptype passive") 12:02:30 INFO - Actual: false 12:02:30 INFO - Expected: true

Nils Ohlmeier [:drno]

Assignee

Comment 122

•

10 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=465dbae11b48

Nils Ohlmeier [:drno]

Assignee

Comment 123

•

10 years ago

Theory of what might be going on with missing or loosing candidates: for the tcpype active candidate we currently allocate a port number, but do not actually create a socket with/for that port number (sockets for active connection get allocated once a connection to a destination gets requested). Internally the port number is a random ephemeral number, and the number 9 we see in the test output only result from over-writing the port number before handing it off. What could be a problem here is that deduping might kill new candidates, e.g. tcptype so, or even removed previously created candidates like the missing tcptype passive. I'll look into improving our code base in that area, probably in a separate bug.

Nils Ohlmeier [:drno]

Assignee

Comment 124

•

10 years ago

(In reply to Nils Ohlmeier [:drno] from comment #123) > I'll look into improving our code base in that area, probably in a separate > bug. Created bug 1195396 to follow up on this.

Depends on: 1195396

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

10 years ago

status-b2g-master: --- → affected

status-firefox40: --- → wontfix

status-firefox41: --- → affected

status-firefox42: --- → affected

status-firefox43: --- → affected

status-firefox-esr38: --- → affected

Whiteboard: [test disabled on OSX][leave open]

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Comment 151

•

10 years ago

It looks like a DNS resolution is occasionally taking a long enough time to cause the test to time out. We could try extending the timeout and see if that helps. It could be argued that this test should not run in CI, since a DNS lookup relies on the external network.

Byron Campen [:bwc]

Comment 152

•

10 years ago

Attached file MozReview Request: Bug 1037618: Extend DNS timeout for bogus STUN server tests. (obsolete) — Details

Review commit: https://reviewboard.mozilla.org/r/31459/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/31459/

Randell Jesup [:jesup] (needinfo me)

Comment 153

•

10 years ago

(In reply to Byron Campen [:bwc] from comment #151) > It looks like a DNS resolution is occasionally taking a long enough time to > cause the test to time out. We could try extending the timeout and see if > that helps. It could be argued that this test should not run in CI, since a > DNS lookup relies on the external network. Why would it be looking up an external network address? The test machines have no access to an external network, so any access to DNS should be local. If it's timing out, that implies the entire system stalled for a while, it was chewed up by higher-priority threads, or something is wrong with DNS (local server).

Flags: needinfo?(docfaraday)

Nils Ohlmeier [:drno]

Assignee

Comment 154

•

10 years ago

Comment on attachment 8709504 [details] MozReview Request: Bug 1037618: Extend DNS timeout for bogus STUN server tests. https://reviewboard.mozilla.org/r/31459/#review28213 On a machine with working DNS resolution trying to resolve anything at .invalid resolve really fast. So I guess that means on machines were this takes long DNS resolution isn't working at all. So I'm wondering if a check for empty |g_stun_server_address| to disable these tests is more appropriate? I'll leave the decision up to you.

Attachment #8709504 - Flags: review+

Byron Campen [:bwc]

Comment 155

•

10 years ago

(In reply to Randell Jesup [:jesup] from comment #153) > (In reply to Byron Campen [:bwc] from comment #151) > > It looks like a DNS resolution is occasionally taking a long enough time to > > cause the test to time out. We could try extending the timeout and see if > > that helps. It could be argued that this test should not run in CI, since a > > DNS lookup relies on the external network. > > Why would it be looking up an external network address? The test machines > have no access to an external network, so any access to DNS should be local. > If it's timing out, that implies the entire system stalled for a while, it > was chewed up by higher-priority threads, or something is wrong with DNS > (local server). We're using a bogus FQDN (a .invalid address) in a test to make sure the code-path works. But if that doesn't try to hit the wire, it may be some other problem...

Flags: needinfo?(docfaraday)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Comment 159

•

9 years ago

Six of these seem to be due to failure to form a TCP connection in a reasonable amount of time; a similar problem as seen in bug 1201209 recently. http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456350708/fx-team_yosemite_r7_test-cppunit-bm106-tests1-macosx-build452.txt.gz http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456440145/fx-team_yosemite_r7_test-cppunit-bm107-tests1-macosx-build366.txt.gz http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1456590806/mozilla-inbound_snowleopard_test-cppunit-bm108-tests1-macosx-build2266.txt.gz https://public-artifacts.taskcluster.net/FVDnFk4_Sf-DqktJ1MojIA/0/public/logs/live_backing.log https://public-artifacts.taskcluster.net/Zj2ME_8UT06G3LsKjWvHsg/0/public/logs/live_backing.log https://public-artifacts.taskcluster.net/ImdA3S28SoSL_hrZ-NLKOg/0/public/logs/live_backing.log Two seem to be caused by TestStunServer::SetDropInitialPackets not working as expected. http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-linux64-asan/1456393484/mozilla-inbound_ubuntu64-asan_vm_test-cppunit-bm120-tests1-linux64-build59.txt.gz http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64-debug/1456242790/mozilla-central_ubuntu64_vm-debug_test-cppunit-bm67-tests1-linux64-build1.txt.gz Two have two test-cases that look up an invalid FQDN, which both fail. http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/1456520126/fx-team_ubuntu64_vm_test-cppunit-bm120-tests1-linux64-build5.txt.gz http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/1456525679/fx-team_ubuntu64_vm_test-cppunit-bm116-tests1-linux64-build11.txt.gz Two are due to a bug in the test code where duplicate candidates will cause some checking code to be sad. http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-macosx64/1456325925/fx-team_snowleopard_test-cppunit-bm107-tests1-macosx-build1085.txt.gz http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1456241709/mozilla-inbound_snowleopard_test-cppunit-bm108-tests1-macosx-build2133.txt.gz

Byron Campen [:bwc]

Updated

•

9 years ago

Depends on: 1248637

Byron Campen [:bwc]

Updated

•

9 years ago

Depends on: 1252516

Byron Campen [:bwc]

Comment 160

•

9 years ago

(In reply to Byron Campen [:bwc] from comment #159) > > Two have two test-cases that look up an invalid FQDN, which both fail. > > http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/ > 1456520126/fx-team_ubuntu64_vm_test-cppunit-bm120-tests1-linux64-build5.txt. > gz > http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-linux64/ > 1456525679/fx-team_ubuntu64_vm_test-cppunit-bm116-tests1-linux64-build11.txt. > gz > Both of these failures occurred on linux64, less than a minute apart, in exactly the same way (DNS resolution for .invalid seems to hang). I suspect this is infra-related.

Byron Campen [:bwc]

Comment 161

•

9 years ago

Attached file MozReview Request: Bug 1037618 - Part 1: Add some logging to highlight TCP connection failures. r?drno — Details

Review commit: https://reviewboard.mozilla.org/r/37445/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/37445/

Attachment #8725365 - Flags: review?(drno)

Byron Campen [:bwc]

Updated

•

9 years ago

Attachment #8709504 - Attachment is obsolete: true

Byron Campen [:bwc]

Comment 162

•

9 years ago

(In reply to Byron Campen [:bwc] from comment #159) > > Two seem to be caused by TestStunServer::SetDropInitialPackets not working > as expected. > > http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound- > linux64-asan/1456393484/mozilla-inbound_ubuntu64-asan_vm_test-cppunit-bm120- > tests1-linux64-build59.txt.gz > http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central- > linux64-debug/1456242790/mozilla-central_ubuntu64_vm-debug_test-cppunit-bm67- > tests1-linux64-build1.txt.gz This is happening because a UDP port is being reused from an earlier test, and the test STUN server still has a packet count for that address which is large enough to exceed the packet drop count. We need to clear out these counts before running these tests.

Byron Campen [:bwc]

Updated

•

9 years ago

Depends on: 1252585

Nils Ohlmeier [:drno]

Assignee

Comment 163

•

9 years ago

Comment on attachment 8725365 [details] MozReview Request: Bug 1037618 - Part 1: Add some logging to highlight TCP connection failures. r?drno https://reviewboard.mozilla.org/r/37445/#review34215 ::: media/mtransport/third_party/nICEr/src/net/nr_socket_multi_tcp.c:591 (Diff revision 1) > + r_log(LOG_ICE,LOG_WARNING,"%s:%d %s failed to accept new TCP connection: %d",__FILE__,__LINE__,__FUNCTION__,_status); Might be helpful to include |remote_addr.as_string| here as well if it is set already.

Attachment #8725365 - Flags: review?(drno) → review+

Nils Ohlmeier [:drno]

Assignee

Comment 164

•

9 years ago

Byron FYI on the TCP connection failures: I found the other day that not surprisingly all of the nICEr is not taking into account that establishing connections can take time. So in case of HTTP proxies connect() gets called, pretty soon followed by a call to write(). The write() returns R_WOULDBLOCK, which then results in the socket getting armed for write status. The OS then immediately returns that the socket in fact is writeable. Which then causes another call to write(). And then we go into something like a busy loop, until the proxy code either connects or gets failure. I'm not sure if we have similar problems in general with TCP connections. But as we don't have a callback for when connections get established it seems plausible that also our TCP code in general might run into corner cases like this, where a busy loop on a small try machine might stall the test indefinitely.

Byron Campen [:bwc]

Comment 165

•

9 years ago

https://reviewboard.mozilla.org/r/37445/#review34215 > Might be helpful to include |remote_addr.as_string| here as well if it is set already. I thought about that, we'd either need to zero out |remote_addr| up top, or have another variable that tracks whether it has been filled in. As it stands, if this fails I'm primarily interested in the error, not so much who is the lucky winner.

Comment 166

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/73f8e7317425

Carsten Book [:Tomcat]

Comment 167

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/73f8e7317425

Bulk Bug Changes for mreavy's org

Comment 168

•

8 years ago

Mass change P3->P4 to align with new Mozilla triage process.

Priority: P3 → P4

Byron Campen [:bwc]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Firefox Bug Husbandry Bot

Comment 169

•

7 years ago

Removing leave-open keyword from resolved bugs, per :sylvestre.

Keywords: leave-open

You need to log in before you can comment on or make changes to this bug.