Closed Bug 1058158 Opened 5 years ago Closed 5 years ago

Marionette WebAPI tests should run in either a test container or a clean environment

Categories

(Testing :: Marionette, defect)

x86_64
Linux
defect
Not set

Tracking

(blocking-b2g:2.1+, firefox33 wontfix, firefox34 fixed, firefox35 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
mozilla35
blocking-b2g 2.1+
Tracking Status
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: qdot, Assigned: mdas)

References

Details

Attachments

(3 files, 9 obsolete files)

22.22 KB, patch
jgriffin
: review+
Details | Diff | Splinter Review
22.53 KB, patch
Details | Diff | Splinter Review
306.43 KB, text/x-log
Details
[Blocking Requested - why for this release]: Blocks settings API (bug 900551)

In bug 968709, a patch was landed to reset the gecko environment to the system app so we can make sure we're running in something similar to B2G, since prior unit tests can put the system into odd states due to navigating away from the system app. The problem is that, while this makes the tests work, this leaves the process in an undefined state.

To fix this, we need to either run webapi tests in a test container, or restart the b2g process before running webapi tests.
WebAPI tests get run in two ways:  with and without OOP.  With OOP, they're run in a gaia test-container app, but without OOP, they're run in place of the system app.

Another way to fix this, then, would be to run the non-OOP tests in a different test-container that doesn't get run OOP.
Try is closed, and I have to head out for the day. I've only tested this locally against my phone and browser, but no chance to test on emulator. qDot, if you have time, can you test this? It's a patch on top of m-c (should work on m-i too).

If you're running the test by calling runtests.py directly, please make sure that 'b2g' is somewhere in your --type list.
Flags: needinfo?(kyle)
Unfortunately, on the emulator this makes test_import_script crash the test container app, at least locally. I'm patching and pushing to try since it's open again, will post results URL here.
Flags: needinfo?(kyle)
blocking-b2g: 2.1? → 2.1+
updated patch.
Attachment #8478584 - Attachment is obsolete: true
Attached patch run in test-container (obsolete) — Splinter Review
testing out the latest patch on try https://tbpl.mozilla.org/?tree=Try&rev=6ca42ba28455, will r? if it passes
Attachment #8479304 - Attachment is obsolete: true
Attached patch run in test-container (obsolete) — Splinter Review
This is more up to date (https://tbpl.mozilla.org/?tree=Try&rev=abc3a219fe29)
Attachment #8480860 - Attachment is obsolete: true
So these try runs are passing up until they all hit this timeout at some point:

command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/marionette.py', '--cfg', 'marionette/automation_emulator_config.py', '--blob-upload-branch', 'try', '--download-symbols', 'ondemand'], attempting to kill

It's unclear to me why this is happening, and I'm heading out for a week's PTO today. Has anyone seen this before?
Basically, the tests have become much slower...and they're hitting the buildbot 2-hr timeout.  Without this change, the tests take ~90 min, for some reason this change makes them take > 2 hr.  :(
A couple of potential solutions:  we could try running them on faster VM's (but that typically boosts speed by only about 10%), or we could chunk the tests.
Attached patch testcon (obsolete) — Splinter Review
Thanks for the response. The original patch would start and kill a test container per test case, this will instead preserve it until it sees a case that doesn't require the test container. It will then look for the test container app and kill it if it exists.

Try:  https://tbpl.mozilla.org/?tree=Try&rev=5dd399f37ce0

If this approach is still too slow, how about we separate the Marionette unit tests from the webapi tests, and enable the test container app using --test-container or using a separate manifest with test_container = true? That way, we keep the system app untouched when running the Mn tests, so we keep the test environment as pristine as possible.

https://bugzilla.mozilla.org/show_bug.cgi?id=1022862 is the separation effort but not much as been done since June.
Attachment #8480867 - Attachment is obsolete: true
Attached patch run in test-container (obsolete) — Splinter Review
"manager" wasn't defined in gaia (let manager = window.wrappedJSObject.AppWindowManager || window.wrappedJSObject.WindowManager;). Maybe from dom bluetooth's test_navigate_to_default_url.py (https://mxr.mozilla.org/mozilla-central/source/dom/bluetooth/tests/marionette/test_navigate_to_default_url.py#16) doing weird stuff to gaia state? In any case, here's a retry that worked well locally:

https://tbpl.mozilla.org/?tree=Try&rev=4947fdcd616f
Attachment #8481613 - Attachment is obsolete: true
Attached patch run in test-container (obsolete) — Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=a49620d09800

Finally green with this patch! Now I can stop thinking about this and go enjoy my vacation :P
Attachment #8481798 - Attachment is obsolete: true
Attachment #8482260 - Flags: review?(jgriffin)
Comment on attachment 8482260 [details] [diff] [review]
run in test-container

Review of attachment 8482260 [details] [diff] [review]:
-----------------------------------------------------------------

This patch basically removes the 'oop' feature of the runner, but leaves a bunch of related code in place.  It causes all of our unit tests to be run OOP, regardless of what you specify in the manifest, but allows webapi tests to run in whatever state the phone happens to be in.

I think removing this feature is ok, as I looked and nothing outside of our unit tests currently uses it, but I think we should be more explicit about tearing it out.

We could land this if this patch is urgent and clean it up after mdas returns, or we could just wait until then.
Attachment #8482260 - Flags: review?(jgriffin) → review-
Removed the oop code from the runner.

Ran in try: https://tbpl.mozilla.org/?tree=Try&rev=db38e9ce01c5

The red, failed build is because I aborted it. The orange is an intermittent.
Attachment #8482260 - Attachment is obsolete: true
Attachment #8486493 - Flags: review?(jgriffin)
Comment on attachment 8486493 [details] [diff] [review]
run in test-container, remove oop option

Review of attachment 8486493 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, looks good.
Attachment #8486493 - Flags: review?(jgriffin) → review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/bebc8e9a6da6 - b2g's Mnw hasn't finished yet to know how it'll fare, but Mn on all non-b2g says "| KeyError: 'b2g'" for every test.
Thanks, I forgot to test that patch locally on browser! Retrying with a new patch:

Mn:
 https://tbpl.mozilla.org/?tree=Try&rev=84a07e0dd1f8
Mnw:
https://tbpl.mozilla.org/?tree=Try&rev=1e9544ee90f2
the only change from the previous patch is line 388 of marionette_test.py. This patch ran green on Mn and Mnw
Attachment #8486493 - Attachment is obsolete: true
Attachment #8487489 - Flags: review?(jgriffin)
Comment on attachment 8487489 [details] [diff] [review]
run in test-container, remove oop option

Review of attachment 8487489 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for pointing out the difference with the previous patch, makes it easy to review!
Attachment #8487489 - Flags: review?(jgriffin) → review+
https://hg.mozilla.org/mozilla-central/rev/d96632436134
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
It's looking like this may break the test_switch_frame.py unit tests on Aurora. I'm not sure how the other two patches would affect that functionality, but I'm not ruling that out either. Will run another try with just the marionette patches.

https://tbpl.mozilla.org/?tree=Try&rev=7da5f9669516

Good news is, this doesn't seem to break the bluetooth tests anymore. So we've at least fixed one thing.
Here's with only the marionette patches and navigate removal.

https://tbpl.mozilla.org/?tree=Try&rev=0197d4f2975b
So everything is green on b2g-i/m-i/m-c, so there's gotta be something else we need to backport to aurora to fix this. Since the frame switch stuff is marionette, is there possibly some marionette patch we should backport? It doesn't look like there's many between the aurora branch and now.
Attached patch test container patch for aurora (obsolete) — Splinter Review
So test_switch_frame.py isn't supposed to be run.  Bug 977893 is responsible for adding the 'b2g=false' line to the manifest, which shuts off this test. Adding this patch wholesale is a bit much for this. We can either uplift that patch to m-a, or this modified version of this patch that adds that line.

I'm running the modified version of this on aurora with your navigate patch applied: https://tbpl.mozilla.org/?tree=Try&rev=2b58b78c6fcd
Flags: needinfo?(mdas)
It looks like the patch for test_navigation.py is screwed up between the m-c and aurora patches? We're not doing a skip_if_b2g on test_navigate_frame, which makes the test app crash. That's what's causing the reds, and it's failing locally too.

Here's a new try, with the test skip moved.

https://tbpl.mozilla.org/?tree=Try&rev=609b9f1311fd
So along with the failures in navigation and frame, I was noticing landing this up'd our intermittents in bluetooth. I found that I had forgotten to change some settings functions in the webapi tests to use transaction events when landing bug 900551. There's a good chance that if we landed without these fixes, we'd just get backed right out again due to obviousness in orange factor. I got them fixed and have a new try running with those patches on top. I really, really hope this is it.

https://tbpl.mozilla.org/?tree=Try&rev=fe009c7736f2
Oook, well locally bug 1069115 is causing everything to fail, so maybe I stopped that prior build to soon. Here's a try with just the fixed navigation tests again. >.>

https://tbpl.mozilla.org/?tree=Try&rev=90437d10da50
Ok, back to running 1058158 and 1068077 locally against the emulator. I'm getting intermittent failures in telephony tests and wifi. This is the error that shows up for telephony:

09-18 23:35:52.042 E/GeckoConsole(   44): [JavaScript Error: "uncaught exception: RadioNotAvailable"]

For wifi, I get a lot of promise failures due to missing fields.
ni?'ing a couple of people on the RIL team, got names from blaming telephony ril code.

To get the ni?'ers up to date: in this bug, we're trying to run some tests in a container app to make sure each webapi test run is done in a clean environment. While we've landed this patch to m-c, unfortunately this is causing intermittents on Aurora. What I'm wondering about is how we can get to a state where a Promise is getting back RadioNotAvailable error as shown in this attachment. This is running a reduced test set of a few system tests, then the bluetooth test, then the telephony tests. About 80% of the time I can get the crash_emulator and a lot of the outbound telephony tests to fail on this message. Inbound seems to work ok, oddly enough. Just need to figure out where to start tracking this.
Flags: needinfo?(szchen)
Flags: needinfo?(htsai)
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #35)
> Created attachment 8491934 [details]
> Emulator log showing test_emulator_crash failure
> 
> ni?'ing a couple of people on the RIL team, got names from blaming telephony
> ril code.
> 
> To get the ni?'ers up to date: in this bug, we're trying to run some tests
> in a container app to make sure each webapi test run is done in a clean
> environment. While we've landed this patch to m-c, unfortunately this is
> causing intermittents on Aurora. What I'm wondering about is how we can get
> to a state where a Promise is getting back RadioNotAvailable error as shown
> in this attachment. This is running a reduced test set of a few system
> tests, then the bluetooth test, then the telephony tests. About 80% of the
> time I can get the crash_emulator and a lot of the outbound telephony tests
> to fail on this message. Inbound seems to work ok, oddly enough. Just need
> to figure out where to start tracking this.

Kyle,

How do you run the test? I tried to apply the patch locally on aurora and then execute:
  ./mach marionette-webapi <telephony-test-suit>
The result is good. How could I reproduce your result? Do I miss some options?

The error "RadioNotAvailable" means that the radio on emulator is not yet turned on.
Does system app runs on your test environment? Ideally, when the emulator power up, system app (gaia/apps/system/js/radio.js) will sent a request to gecko to turn on the radio. All the telephony tests assume that the radio is default on.
Flags: needinfo?(szchen)
I've been running the tests using runtests.py, but marionette-webapi does basically the same thing. If you run /just/ the telephony tests, it will usually pass. The problem comes when you try to run the tbpl bank of tests (in testing/marionette/client/marionette/tests/unit-test.ini). 

The way I've been running these is to run 1 or 2 tests out of testing/marionette/client/marionette/tests/unit/unit-tests.ini (just comment most of them out, leave a couple uncommented), just to make sure we run something in the test container. After this, bluetooth and telephony should run, and that's usually how telephony fails. That's the quickest way to repro this, as running all of the tests takes well over an hour.
Flags: needinfo?(szchen)
Ok. Well. Stacked ALL of the patches I need to land to aurora to get settings stablized EXCEPT for bug 1068077 (just because I forgot to move it to before the try in the mq) and ran a try:

https://tbpl.mozilla.org/?tree=Try&rev=e3c5b655b26a

Mnw, as well as everything else, passes.

Ignore the M-e10s, those don't run on m-a. So with this, it looks like we could land everything? I'm retriggering to find out.

The weird part is that this works while still leaving in the navigate test (which is what bug 1068077 removed), which is what this was supposed to remove. I guess we can leave that for later if this works though.
Ok, this landed to aurora with no problems, so I'm betting the issues I was seeing with telephony and other tests were due to bug 1068077, since we didn't re-establish the system app before starting the webapi tests. I guess I'll leave bug 1068077 open for right now, but I'll probably close it as a WONTFIX once we divide the webapi tests out from the unit test manifest in bug 1022862.

:aknow, thanks for the help thus far. :)
Flags: needinfo?(szchen)
Flags: needinfo?(htsai)
You need to log in before you can comment on or make changes to this bug.