mochitest-media chunk 2 fails consistently on taskcluster windows 10

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
a year ago
2 months ago

People

(Reporter: grenade, Assigned: mjf)

Tracking

(Depends on: 1 bug)

unspecified
x86_64
Windows 10
Dependency tree / graph

Firefox Tracking Flags

(firefox58 fixed)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments)

(Reporter)

Description

a year ago
mochitest-media fails on windows 10 with:

dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | Test timed out.

for chunk 2 on both opt and debug

This is tracked as an intermittent on buildbot tests under bug 1318119

see: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c7c0c515f27da8b910435363cb8e9ad54286ce4&group_state=expanded&exclusion_profile=false&filter-searchStr=mda
(Reporter)

Updated

a year ago
Summary: mochitest-devtools-chrome (e10s) fails on taskcluster windows 10 → mochitest-media chunk 2 fails consistently on taskcluster windows 10
I think the failure here is different than bug 1318119.  in bug 1318119 there are crashes, here we just timeout, and I see this in the log[0]:
3:24:12     INFO -  403 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | pcRemote: SDP contains non-zero IP c line
23:24:12     INFO -  404 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | SDP MLine Index needs to exist
23:24:12     INFO -  Buffered messages logged at 23:18:48
23:24:12     INFO -  405 INFO pcRemote: iceCandidate = {"candidate":"candidate:9 1 UDP 8331263 172.31.35.36 64577 typ relay raddr 172.31.35.36 rport 64577","sdpMid":"sdparta_0","sdpMLineIndex":0}
23:24:12     INFO -  406 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | ICE candidate contains candidate
23:24:12     INFO -  407 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | SDP mid not empty
23:24:12     INFO -  408 INFO CLINE-NO-DEFAULT-ADDR-SDP: "audio 64576 UDP/TLS/RTP/SAVPF 109 101\r\nc=IN IP4 172.31.35.36\r\na=candidate:0 1 UDP 2122252543 172.31.35.36 64571 typ host\r\na=candidate:4 1 UDP 2122187007 2001:0:4137:9e76:386d:742:dd30:a52e 64572 typ host\r\na=candidate:8 1 TCP 2105524479 172.31.35.36 9 typ host tcptype active\r\na=candidate:10 1 TCP 2105458943 2001:0:4137:9e76:386d:742:dd30:a52e 9 typ host tcptype active\r\na=candidate:1 1 UDP 1686052863 172.31.35.36 64575 typ srflx raddr 172.31.35.36 rport 64571\r\na=candidate:3 1 UDP 92216831 172.31.35.36 64576 typ relay raddr 172.31.35.36 rport 64576\r\na=candidate:9 1 UDP 8331263 172.31.35.36 64577 typ relay raddr 172.31.35.36 rport 64577\r\na=sendrecv\r\na=fmtp:109 maxplaybackrate=48000;stereo=1;useinbandfec=1\r\na=fmtp:101 0-15\r\na=ice-pwd:7f8d8c7a851c5a2aff2ab2aa99d78a8b\r\na=ice-ufrag:d6e1e6be\r\na=mid:sdparta_0\r\na=msid:{54eea385-dc0a-43c2-ad34-a7cc3541c23c} {a302d28a-e945-49d3-8f16-69ba7682e6e3}\r\na=rtcp-mux\r\na=rtpmap:109 opus/48000/2\r\na=rtpmap:101 telephone-event/8000\r\na=setup:active\r\na=ssrc:3822901440 cname:{c671d3e7-ce32-4a7e-8319-fc8f8a84a1ef}\r\n"
23:24:12     INFO -  409 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | pcRemote: SDP contains non-zero IP c line
23:24:12     INFO -  410 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | SDP MLine Index needs to exist
23:24:12     INFO -  Buffered messages logged at 23:20:30
23:24:12     INFO -  411 INFO ICE Server: Lost connection from IPv4Address(TCP, '172.31.35.36', 52063)
23:24:12     INFO -  412 INFO ICE Server: Closing allocation due to dropped connection: ('172.31.35.36', 52063, 'TCP', '172.31.35.36', 3478)
23:24:12     INFO -  Buffered messages finished
23:24:12    ERROR -  413 INFO TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | Test timed out.
23:24:12     INFO -      reportError@SimpleTest/TestRunner.js:121:7
23:24:12     INFO -      TestRunner._checkForHangs@SimpleTest/TestRunner.js:142:7
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
23:24:12     INFO -      TestRunner.runTests@SimpleTest/TestRunner.js:380:5
23:24:12     INFO -      RunSet.runtests@SimpleTest/setup.js:194:3
23:24:12     INFO -      RunSet.runall@SimpleTest/setup.js:173:5
23:24:12     INFO -      hookupTests@SimpleTest/setup.js:266:5
23:24:12     INFO -  parseTestManifest@http://mochi.test:8888/manifestLibrary.js:36:5
23:24:12     INFO -  getTestManifest/req.onload@http://mochi.test:8888/manifestLibrary.js:49:11
23:24:12     INFO -  EventHandlerNonNull*getTestManifest@http://mochi.test:8888/manifestLibrary.js:45:3
23:24:12     INFO -      hookup@SimpleTest/setup.js:246:5
23:24:12     INFO -  EventHandlerNonNull*@http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&manifestFile=tests.json&dumpOutputDirectory=c%3A%5Cusers%5Cgenericworker%5Cappdata%5Clocal%5Ctemp&cleanupCrashes=true:11:1
23:24:12     INFO -  GECKO(3288) | [Unnamed thread 000002BEAB8FE400]: E/signaling [|WebrtcAudioSessionConduit] AudioConduit.cpp:744: A/V sync: GetAVStats failed
23:24:13     INFO -  GECKO(3288) | MEMORY STAT | vsize 1482MB | vsizeMaxContiguous 131300762MB | residentFast 119MB | heapAllocated 20MB
23:24:13     INFO -  414 INFO TEST-OK | dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html | took 326875ms


in addition, I also see dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html  timing out.  this is the data I am looking at:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c7c0c515f27da8b910435363cb8e9ad54286ce4&filter-searchStr=mochitest-media&exclusion_profile=false

:drno, can you take a look at this- this is getting testing on windows 10, so this is a new platform.  Maybe we need to adjust expectations, or install tools differently?


[0] https://public-artifacts.taskcluster.net/TG7f5GJASSupyXUwr4fBRg/0/public/logs/live_backing.log
Flags: needinfo?(drno)
Hmm I'm guessing this is the most interesting part of the log file:

23:14:23     INFO -  GECKO(3676) | (ice/WARNING) z:/build/build/src/media/mtransport/third_party/nICEr/src/net/nr_socket_multi_tcp.c:591 nr_tcp_multi_lsocket_readable_cb failed to accept new TCP connection: 13
23:14:23     INFO -  GECKO(3676) | (generic/INFO) Invoking writable_cb on connected (98)
23:14:23     INFO -  GECKO(3676) | (generic/INFO) Writable_cb IP4:172.31.38.178:60897/TCP (0 (0000026B36ACFDF4) pending)
23:14:23     INFO -  GECKO(3676) | (generic/EMERG) Error in read
23:14:23     INFO -  GECKO(3676) | (ice/ERR) ICE(PC:1493853262864000 (id=184 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html): Error 13 on reliable socket(0000026B36ACCC00). Abandoning.

Looks to me like something prevents Firefox from accepting new incoming connections. Could this be a firewall issue on the test machines?

@Byron: any additional thoughts from you?
Flags: needinfo?(drno) → needinfo?(docfaraday)
(Reporter)

Updated

a year ago
Blocks: 1359760
(In reply to Rob Thijssen (:grenade - EEST) from comment #3)
> just a note that the instance Windows firewall is disabled completely:
> https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/
> Manifest/gecko-t-win10-64.json#L1090-L1102

Thanks. That helps.

I guess the next best thing I can do at this point is to try to run this on my Win10 laptop and see if I can reproduce it.
(Reporter)

Comment 5

a year ago
no worries Nils, let me know if you need credentials for an ec2 windows 10 box for rdp'ng to one of the test instances.
So locally I can't even get the ICE server to start and the tests fail with clear error message that the ICE server is gone.

FYI all the tests *NATRelay*.html depend on piece of Python server running. My current suspicion is that this is causing the problems here.
I should added that in the test logs the ICE appears to be up and running. So I currently face a different problem.
Finally able to repro locally, which only got me to an error message which is present in the test logs already:

23:17:16     INFO -  GECKO(1956) | (generic/EMERG) Error in recvfrom: -5961

I'll have to investigate further what that error means and why we get it.
Flags: needinfo?(docfaraday)
Haven't found the root cause yet. But one thing is more clear now: the test hangs because it waits for the end-of-trickle event from the PC_REMOTE PeerConnection.
Turning off IPv6 removed the error about failing to bind to a certain port. But I still get these recv errors.

I though the python based iceserver might be terminating, but I ran it standalone and it kept running through the test without errors.
One interesting observation is that when I removed the iceserver startup from test, but did not have the test actually connect to any STUN/TURN server the test got stuck while waiting for ICE to connect, which means the end-of-candidates event popped in that case.

I'm wondering if disabling ICE TCP will change anything. And I'm starting to wonder if we have any Windows specific limit of open file descriptors/socket (for example _setmaxstdio https://msdn.microsoft.com/en-us/library/6e3b887c.aspx )
(Reporter)

Comment 12

a year ago
pmoore: comment 11 looks interesting.
I think I have identified the root cause: apparently Win 10 does not like if we try to send packet between link-local and non-link-local IPv6 IPs.

Coding up a patch right now to confirm...
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Rob, judging by your initial try build it looks like running tests on Win 10 requires patches. Can you verify that my patch fixes the problems you are seeing?
Flags: needinfo?(rthijssen)
(Reporter)

Comment 18

a year ago
gladly Nils. thank you for the patches!

here's a try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a8bea9c99863eb1bfbab625a9c82259f5fe9cc05&exclusion_profile=false
note the `&exclusion_profile=false` above, as most of the failing tests are currently invisible by default

the only change required to enable a test suite *should* be something like this in tests.yml:
https://hg.mozilla.org/try/rev/a8bea9c99863eb1bfbab625a9c82259f5fe9cc05
Flags: needinfo?(rthijssen)
(Reporter)

Comment 19

a year ago
hmm, we got some pretty inconsistent results in that run. i spoke to pmoore and we'd like to test a theory that it's related to bug 1362070, so trying again:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe3d81d3d85003e511cb9f86c8a528e0fc614fdf&exclusion_profile=false
(In reply to Rob Thijssen (:grenade - EEST) from comment #19)
> hmm, we got some pretty inconsistent results in that run. i spoke to pmoore
> and we'd like to test a theory that it's related to bug 1362070, so trying
> again:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=fe3d81d3d85003e511cb9f86c8a528e0fc614fdf&exclusion_pro
> file=false

So this try run https://treeherder.mozilla.org/logviewer.html#?job_id=97652958&repo=try&lineNumber=5330
from that run above encountered an error I have never seen before, which I can only explain by failing to load this file:
http://searchfox.org/mozilla-central/source/dom/media/tests/mochitest/network.js

While looking at the ICE error one more time I was to my surprise not able to repro the bug I have fixed any more, but instead the same test failed with a completely different problem which we have seen in other intermittents as well:

http://searchfox.org/mozilla-central/source/dom/media/test/manifest.js#460

was complaining about some missing files. Our tests don't actually use these files. And the unexpected-failures were actually reported before our webrtc tests started at all.

So to me it looks like maybe the web server used for mochitests is working unreliable on Windows in some cases... or something along these lines?!

Comment 21

a year ago
mozreview-review
Comment on attachment 8865564 [details]
Bug 1361894: ignore STUN/TURN servers with mis-matching link local addr.

https://reviewboard.mozilla.org/r/137188/#review141572

So we are failing to check for IP version mismatch in a bunch of these cases; we probably want to have a single function that sanity-checks a srflx/relay candidate that we can call in these places.
Attachment #8865564 - Flags: review?(docfaraday)

Comment 22

a year ago
mozreview-review
Comment on attachment 8865609 [details]
Bug 1361894: add pypiwin32 to requ to allow local testing.

https://reviewboard.mozilla.org/r/137232/#review141580

Was I supposed to review this, or was it just a one-off thing for testing this bug? We probably cannot put a platform-specific python lib in this requirements file.
Attachment #8865609 - Flags: review?(docfaraday)
(In reply to Byron Campen [:bwc] from comment #22)
> Comment on attachment 8865609 [details]
> Bug 1361894: add pypiwin32 to requ to allow local testing.
> 
> https://reviewboard.mozilla.org/r/137232/#review141580
> 
> Was I supposed to review this, or was it just a one-off thing for testing
> this bug? We probably cannot put a platform-specific python lib in this
> requirements file.

I was not able to start the python ICE server on Windows without this fix here. I assume the Win based test machines pull in pypiwin32 somehow else.
But it is a good point that we can't put platform specific stuff into that file.
the permafailing mochitest-media tests on windows-10-vm are:
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html
dom/media/mediasource/test/test_WaitingToEndedTransition_mp4.html


I am not sure if the work in this bug would help these 3 test files, or if old issues are fixed/new issues.

Ideally we could skip these tests on win10, run these [tests] on hardware (instead of vm), or get a fix in place.  I see that :drno is on pto for the next week, :grenade, do you have further thoughts here?
Flags: needinfo?(rthijssen)
(Reporter)

Comment 25

11 months ago
pypiwin32 is supposed to be available to the tests on windows due to to this bit of config (some suites do make use of this):

https://dxr.mozilla.org/mozilla-central/rev/5322c03f4c8587fe526172d3f87160031faa6d75/testing/mozharness/configs/unittests/win_taskcluster_unittest.py#28

perhaps the test suite in question doesn't implement an install of virtualenv_modules from the config?
Flags: needinfo?(rthijssen)
Given a somewhat recent try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ef8ce37d3038d864928c1ff91f5ba982f2c88b4&filter-searchStr=mochitest-media

I compiled the list of 3 failing tests in comment 24.  I am not sure if the same failures are occurring (my attempt at figuring out if we have a ice server startup issue are not successful), or if these are new errors.

:drno, do you have any tips for me?
Flags: needinfo?(drno)

Comment 27

11 months ago
I'm pretty sure the other bug I mentioned in comment #20 is/was caused by bug 1242051. I fail to find the recent intermittent so I'll leave that for later.

Sorry for not explaining this better. But the problem is not that the ICE server is not starting. You don't need to spend time on investigating that. The problem is that Win 10 prevents Firefox from talking successfully to the ICE server over IPv6.

After Byron's comment regarding my patch in attachment 8865564 [details] I'll have to re-work that. If I remember right that should fix this problem.

Sorry for dropping the ball on this.
Thanks for the clarification here- I will wait until you get some cycles to finish up the patches.  Do you think you can work on this in the next month or so?
Blocks: 1403523
I tried applying this patch and pushing to windows 10 vm- I see see the failures mentioned in bug 1318119:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=806866832a3c170651f88d21c12cedd5f9826748

Updated

10 months ago
Flags: needinfo?(drno)

Comment 30

10 months ago
Michal this is the bug we were just talking about. Would appreciate it if you could refactor my patch, so we can land it.
Flags: needinfo?(mfroman)
(Assignee)

Comment 31

10 months ago
I'll get started on it.
Flags: needinfo?(mfroman)
(Assignee)

Updated

10 months ago
Assignee: nobody → mfroman
Priority: -- → P2
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 34

10 months ago
mozreview-review
Comment on attachment 8913816 [details]
Bug 1361894 - ignore STUN/TURN servers with mis-matched link local addr.

https://reviewboard.mozilla.org/r/185192/#review190232

This looks fine, although maybe we should name the function something like nr_transport_addr_check_compatibility, and also check transport protocol (we would still want to check transport proto separately in nr_ice_component_initialize_tcp/udp since we don't necessarily have an address yet). Maybe in the future we could also compare things like localhost vs non-localhost. Just make sure _all_ of the bases are covered, and if we come up with new compat checks in the future we'll have an easy place to add them.
Attachment #8913816 - Flags: review?(docfaraday) → review+
(Assignee)

Comment 35

10 months ago
Here is a new try run on the taskcluster win10 instances:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dbeb4c147fe9fa2634555956b9452593aa33ae0
Comment hidden (mozreview-request)
(Assignee)

Comment 37

10 months ago
mozreview-review-reply
Comment on attachment 8913816 [details]
Bug 1361894 - ignore STUN/TURN servers with mis-matched link local addr.

https://reviewboard.mozilla.org/r/185192/#review190232

When rebasing to land, I went ahead and made the name change to the function and added the protocol check.
(Assignee)

Comment 38

10 months ago
Here is an updated try run with the modified function name and transport checks:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=125616989413a520ec2ed284a487232d783d3e5d

Comment 39

10 months ago
Pushed by mfroman@nostrum.com:
https://hg.mozilla.org/integration/autoland/rev/a6ffff955547
ignore STUN/TURN servers with mis-matched link local addr. r=bwc

Comment 40

10 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a6ffff955547
Status: NEW → RESOLVED
Last Resolved: 10 months ago
status-firefox58: --- → fixed
Resolution: --- → FIXED
Depends on: 1434531
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.