Open Bug 1180388 Opened 9 years ago Updated 2 years ago

test_peerConnection_iceFailure.html is disabled on all platforms because of test timeouts

Categories

(Core :: WebRTC, defect, P4)

42 Branch
defect

Tracking

()

People

(Reporter: kaustabh93, Unassigned)

References

Details

Attachments

(1 file)

We would like to have --run-by-dir become more reliable i.e. lesser oranges and more greens. And test_peerConnection_iceFailure.html is producing oranges for all platforms. Here's the try results : https://treeherder.mozilla.org/#/jobs?repo=try&revision=1711e6339ee8. An excerpt from the raw log : 11:41:04 INFO - 4613 INFO TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_iceFailure.html | Test timed out. - expected PASS 11:41:04 INFO - -1929893056[9c5e68c0]: [|WebrtcAudioSessionConduit] AudioConduit.cpp:666: A/V sync: GetAVStats failed 11:41:04 INFO - -1929893056[9c5e68c0]: [|WebrtcAudioSessionConduit] AudioConduit.cpp:666: A/V sync: GetAVStats failed 11:41:04 INFO - MEMORY STAT | vsize 1147MB | residentFast 163MB | heapAllocated 92MB 11:41:04 INFO - 4614 INFO TEST-OK | dom/media/tests/mochitest/test_peerConnection_iceFailure.html | took 308454ms 11:41:04 INFO - -1220557056[b715f180]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2353: CloseInt: Closing PeerConnectionImpl b23f42fb5681bf3e; ending call 11:41:04 INFO - -1220557056[b715f180]: [1435948556927325 (id=232 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_peerConnection_iceFailure.html)]: stable -> closed
Blocks: 1162003
Flags: needinfo?(mshal)
:juanb, I see you wrote this test not too long ago, can you help us figure out why this times out when run as a standalone directory instead of as a full run or a larger chunk of tests? Most likely it depends on something which isn't there, initialized, or a pref.
Flags: needinfo?(mshal) → needinfo?(jbecerra)
I'm looking into this. I will chat with Nils and Byron to see how to fix this.
Flags: needinfo?(jbecerra)
Not sure what is going on here, but I think it would help if we add this line: SimpleTest.requestCompleteLog(); into the test_peerConnection_iceFailure.html test case and re-run. This should give us full mochitest logs and the mochitest log lines should get more or less alligned with the ICE log messages.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Juan I'm able to reproduce the problem by executing: ./mach mochitest --run-by-dir dom/media/ But it works fine if execute: ./mach mochitest dom/media/tests/mochitest/test_peerConnection_iceFailure.html
glad to hear this reproduces- sounds more like an odd case of a previous test doing something which causes this to fail.
(In reply to juan becerra [:juanb] from comment #2) > I'm looking into this. I will chat with Nils and Byron to see how to fix > this. Thanks for looking into this. I am checking in to see if we're close to fixing this. Please let me know if I can help in any way.
Hi, Any updates on this?
Flags: needinfo?(jbecerra)
Flags: needinfo?(drno)
This had fallen off my radar, but I'll look at it today.
Flags: needinfo?(jbecerra)
(In reply to juan becerra [:juanb] from comment #8) > This had fallen off my radar, but I'll look at it today. Thank you. It will be really great to have this fixed.
Hi Juan, Any news on this? Here's the recent state of try : https://treeherder.mozilla.org/#/jobs?repo=try&revision=aeae0c94a595 I hope this helps.
Flags: needinfo?(jbecerra)
Unfortunately Juan no longer has the capacity to investigate this. I'll try to look into this more deeply this week some time.
Flags: needinfo?(jbecerra)
Hi Nils, Any updates about this? Please let me know if I can help out in any way.
Thanks for the reminder Kaustabh. Interestingly I'm no longer able to reproduce this problem locally on my Mac with the latest mozilla-central code. Could it be that the problem solved itself? Do you still get failures when running it on treeherder? I think I remember then when I was able to reproduce the problem locally back in July a new Firefox would start (and close) for each subdirectory mochitest started to execute tests for. This no longer seem to be the case. The WebRTC tests are executed in the same browser instance as before the /dom/media/test tests. Is that new or old (from July) behavior intended? Maybe that is/was what is/was causing the problem?
Flags: needinfo?(drno)
:drno, the change is to get --run-by-dir going for all mochitests, in this case it fails with --run-by-dir (a fresh browser session for each directory), but passes when all the directories are run in the same browser session. To fix this just run the test: ./mach mochitest dom/media/tests/mochitest it could be a few other things have changed.
(In reply to Joel Maher (:jmaher) from comment #14) > :drno, the change is to get --run-by-dir going for all mochitests, in this > case it fails with --run-by-dir (a fresh browser session for each > directory), but passes when all the directories are run in the same browser > session. Yes I understand what this ticket is about. I just took the latest code from mozilla-central, compiled it after a clobber and execute this command: ./mach mochitest --run-by-dir dom/media/ And I was surprised that I did NOT see the browser closing and re-opening again. I did see that happening back in July when I reported in comment #4 that I am able to reproduce this problem. I don't know what has changed but for some reason that command above no longer opens and closes Firefox for each directory for me. > To fix this just run the test: > ./mach mochitest dom/media/tests/mochitest I run this command pretty regularly and this is not causing problems. For me the problem was only reproducible with the '--run-by-dir'.
Would have really appreciated if you would have let us know that you apparently disabled our test completely: https://hg.mozilla.org/try/rev/1306c43f4996#l2.13 From reading this I was under the impression that you were still trying to get --run-by-dir working and this is a blocker to get it working.
Summary: test_peerConnection_iceFailure.html fails on Windows, Linux, Mac on opt as well as debug with --run-by-dir enabled → test_peerConnection_iceFailure.html is disabled on all platforms because of test timeouts
I still can not reproduce the problem locally. Could someone please verify locally on his machine that if you run: ./mach mochitest --run-by-dir dom/media/tests actually re-opens Firefox three times, one Fx for the idendity subdir, one Fx for the ipc subdir and one for the mochitest directory itself? Because for me on my Mac and my Linux machine this is not happening. I only see Fx closing and re-starting when it switches between Chrome tests and mochitest-plain for dom/media. Although I agree that the log files from try look like there it is actually opening and closing browser instances per each sub-directory. I sincerely hope this is not yet another instance where our test framework behaves differently locally vs. on try server. I compared the log files from your last try run with log files from a working local run, but that did not really tell me what the problem could be. Therefore I created another try run with increased log level: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ad3e88b37b
Assignee: nobody → drno
Apparently my first attempt of increasing the ICE log levels were not successful. Next try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=52433b75d6d6
So after lots of digging and quite a few try runs with more and more debug log messages I finally found the problem: The first WebRTC test case which gets executed, sets the nICEr registry via the user pref settings in head.js, specifically the values ''media.peerconnection.ice.stun_client_maximum_transmits' and 'media.peerconnection.ice.trickle_grace_period'. Unfortunately that means any subsequent attempts to modify any of these values via SpecialPowers.pushPrefEnv() are useless as the initial value in nICEr registry does not get over-written. And that is exactly what the iceFailure test case here tries to do. I guess I see three possible solutions right now: A) ditch this test case alltogether B) move the test case into its own subdirectory, so that a new Fx instances gets started with a fresh nICEr registry (and find a way to have only one call to pushPrefEnv() with the right value for the test) C) find a way reset the nICEr registry like we do in the ice unittest and expose/access that via SpecialPowers Byron any thoughts on this?
Flags: needinfo?(docfaraday)
it looks like we do specialpowers.pushprefenv in head.js: https://dxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/head.js#235 I don't understand why this fails to allow us to set the pref? Is this because we always setup_environment prior to the test starting and the test case can then not set the prefs it needs? The confusion I have is that in future test cases we shouldn't be having issues with prefs from previous test cases unless we are not push pushprefenv, or there is a bug in it or firefox. Maybe this is a special circumstance. Either way, good find!
FYI locally this behaves differently, because at least for me, the identity test case from the identity subdir get executed first and they set the stun_client_maximum_transmits to 7, which is probably just long enough to make this test case pass, and then all tests run with that 7 as the value.
(In reply to Joel Maher (:jmaher) from comment #21) > it looks like we do specialpowers.pushprefenv in head.js: > https://dxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/ > head.js#235 > > I don't understand why this fails to allow us to set the pref? Is this > because we always setup_environment prior to the test starting and the test > case can then not set the prefs it needs? > > The confusion I have is that in future test cases we shouldn't be having > issues with prefs from previous test cases unless we are not push > pushprefenv, or there is a bug in it or firefox. Maybe this is a special > circumstance. This is kind of a special case: nICEr is a protocol stack like your HTTP stack. But nICEr was not written so that it is allowed to change parameters in the protocol stack while it is running. So basically user pref which modify settings for nICEr require a restart of Firefox to be sure that they get applied. Only the two user prefs mentioned above are affected by this. All the other user prefs set in head.js should be fine to modify whenever you want/need.
thanks for the explanation on this!
(In reply to Nils Ohlmeier [:drno] from comment #20) > So after lots of digging and quite a few try runs with more and more debug > log messages I finally found the problem: > > The first WebRTC test case which gets executed, sets the nICEr registry via > the user pref settings in head.js, specifically the values > ''media.peerconnection.ice.stun_client_maximum_transmits' and > 'media.peerconnection.ice.trickle_grace_period'. Unfortunately that means > any subsequent attempts to modify any of these values via > SpecialPowers.pushPrefEnv() are useless as the initial value in nICEr > registry does not get over-written. And that is exactly what the iceFailure > test case here tries to do. > > I guess I see three possible solutions right now: > A) ditch this test case alltogether > B) move the test case into its own subdirectory, so that a new Fx instances > gets started with a fresh nICEr registry (and find a way to have only one > call to pushPrefEnv() with the right value for the test) > C) find a way reset the nICEr registry like we do in the ice unittest and > expose/access that via SpecialPowers > > Byron any thoughts on this? It kinda sucks that we have a global config store for this, but making the nr_registry be local to the ICE ctx would be kinda involved, since we'd need to clean up bunches of one-time-only leaks. I'm also surprised that the test completely times out if the pref isn't set; normal ICE timeout takes a while, but not the hundreds of seconds I'm seeing here...
Flags: needinfo?(drno)
(In reply to Byron Campen [:bwc] from comment #25) > I'm also surprised that the > test completely times out if the pref isn't set; normal ICE timeout takes a > while, but not the hundreds of seconds I'm seeing here... To clarify: in the new (current) world on TBPL with --run-by-dir set the identity tests run by them self, then another instance of Fx runs the ipc test, and then another Fx starts up and executes just the tests in our mmochitest directory. So the first datachannel test calls the psuhPrefEnv() from head.js and thus sets stun_client_maximum_transmits to 14 and trickle_grace_period to 30000. And these two values are then used for all of our mochitests (except identity and ipc). Now 100ms *2^14 ~= 27min, which way beyond the 5min timeout limit of the mochitest framework. If I recall it correctly the insane high values for nICEr are meant to make our test succeed more likely on B2G, right? Maybe option D) is to make setting the nICEr values platform dependent? I usually dislike platform depend code, but loosing test coverage because of one lousy test platform stinks even more to me.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #26) > (In reply to Byron Campen [:bwc] from comment #25) > > I'm also surprised that the > > test completely times out if the pref isn't set; normal ICE timeout takes a > > while, but not the hundreds of seconds I'm seeing here... > > To clarify: in the new (current) world on TBPL with --run-by-dir set the > identity tests run by them self, then another instance of Fx runs the ipc > test, and then another Fx starts up and executes just the tests in our > mmochitest directory. So the first datachannel test calls the psuhPrefEnv() > from head.js and thus sets stun_client_maximum_transmits to 14 and > trickle_grace_period to 30000. And these two values are then used for all of > our mochitests (except identity and ipc). > > Now 100ms *2^14 ~= 27min, which way beyond the 5min timeout limit of the > mochitest framework. > > If I recall it correctly the insane high values for nICEr are meant to make > our test succeed more likely on B2G, right? > Maybe option D) is to make setting the nICEr values platform dependent? I > usually dislike platform depend code, but loosing test coverage because of > one lousy test platform stinks even more to me. Ah, right. Ugh. So, my first choice if I could just wave a magic wand would be to have the registry be scoped to a single ICE ctx, but we don't have time to do that right now. Another option might be to always update the registry for values that come out of the pref system, but that would require moving this registry setup stuff over to STS (not the end of the world). We would also need to be careful that changing the number of retransmits or grace period length during ICE would not cause terrible things to happen (I think it would be safe though). Another option might be to do the above, but only when a special powers API is hit (that's option C). I don't really like the idea of moving the test into its own directory, even though it has some elegance to it, because it leaves our directory structure in more of a mess. And of course, disabling the test is a non-starter for me.
Flags: needinfo?(docfaraday)
(In reply to Byron Campen [:bwc] from comment #27) > So, my first choice if I could just wave a magic wand would be to have the > registry be scoped to a single ICE ctx, but we don't have time to do that > right now. > > Another option might be to always update the registry for values that come > out of the pref system, but that would require moving this registry setup > stuff over to STS (not the end of the world). We would also need to be > careful that changing the number of retransmits or grace period length > during ICE would not cause terrible things to happen (I think it would be > safe though). Would it help if we scope this to happen only for new PeerConnections? It would certainly be nice to be able to configure the ICE stack at time of instantiating a new PeerConnection, e.g. via constraints. Although that sounds like then each PC would carry it's own copy of the registry and in that case the memory leaks of the registry would hurt a lot. > Another option might be to do the above, but only when a special powers API > is hit (that's option C). With option C I was more thinking of a secret "reset" call to the nICEr registry. I'm not sure if we can detect if anything got set through special powers vs. the user changed something via about:config.
It seems we lost some momentum on this bug, as a reminder we do have test_peerConnection_iceFailure.html disabled: https://dxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/mochitest.ini#113 I am unclear on next steps here.
(In reply to Joel Maher (:jmaher) from comment #29) > I am unclear on next steps here. What Byron and me discussed in the previous comment are major changes to our underlying code. At some point these changes are probably due. But I don't think that we have cycles for that right now. That means the tests probably need to stay disabled.
Rank: 35
Component: DOM → WebRTC
Priority: -- → P3
backlog: --- → webrtc/webaudio+
Blocks: 1321547
Mass change P3->P4 to align with new Mozilla triage process.
Priority: P3 → P4
I don't have the capacity any more to work on this any time soon.
Assignee: drno → nobody

Seeing some failures on the --full push, but they do not seem to be in any re-enabled tests. Maybe changing the ICE timing is effecting other tests in weird ways, but I don't see any signs of ICE failure. Here's a baseline push for those jobs:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=41dda7f27b0595418be98fba9a554b0aa803a347

Even though tightening the ICE timing looks pretty reliable, we do not have sufficient windows aarch64 testers to actually run try pushes. There is a possibility that tightening the ICE timing would break tests on aarch64, but there is no way to determine this without landing patches. So, maybe a viable approach here is to gradually tighten the ICE timing, instead of jumping straight to the defaults.

Try looks fine, or at least no worse than baseline.

Maybe we cannot totally remove the prefs that relax the ICE timeouts. We might be able to partially tighten them?

Yeah, I'm seeing some rare tier 2 intermittents when I try to tighten this timing. I'm not seeing ICE timing out in the first failure in the bunch, which is very odd. It's going to take significant time to debug this I think, and there are other higher priority things that need attention.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: