Closed Bug 1026259 Opened 10 years ago Closed 6 years ago

wpt tests sometimes lose wifi when restarting

Categories

(Firefox OS Graveyard :: Certification Suite, defect)

defect
Not set
normal

Tracking

(firefox32 fixed, b2g-v2.0 fixed)

RESOLVED WONTFIX
Tracking Status
firefox32 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: jgriffin, Unassigned)

References

Details

Attachments

(2 files)

The wpt tests restart B2G when a test fails.  Sometimes, when this restart occurs, the device doesn't establish a wifi connection, and the restart times out.  It's then retried.

It looks like what's happening is that we reboot the device, and then immediately try to restart B2G with a new profile.  We do this sometimes before the device has completely booted, which can have some bad side effects.

Perhaps we need to add a check to wait for Marionette here?

https://github.com/w3c/wptrunner/blob/jgraham/initial/wptrunner/browsers/b2g.py#L193
It seems like this check should be in mozrunner, perhaps? Putting in a Marionette().wait_for_port() in B2GRunner.start() before _setup_remote_profile() at the end, perhaps?

(I thought this was working well, but as I typed this the phone rebooted a couple of times and failed to connect to WiFi, so I'm not so sure it is)
Yes, we could try putting it in B2GRunner.start().
The line that causes this problem is this:

http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/remote.py?from=remote.py#291

We "wait_for_net" until we see something with the pattern of an IP address that's "UP".  The problem is, this matches 0.0.0.0.  We need to wait until this has a non-zero IP before continuing.  Changing that regex locally to:

                if (re.search(r'UP\s+[1-9]\d{0,2}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)):

solves the problem.

Unfortunately, this code was completely ripped out in the recent mozrunner refactor, and we no longer appear to have a device runner.  We'll have to put some of that code back (i.e., create a generic, non-emulator B2G device runner) before we can solve this problem correctly.  Or, potentially, we could use mozilla-aurora to create a point release to mozrunner 5.37 with this fix.

Thoughts, ahal, jgraham?
Actually, given that mozrunner has changed so significantly, I'm leading toward making a 5.37.1 release off of mozilla-aurora.
Is the code in aurora identical to that in m-c before the refactor?

In any case it seems like we should pin the certsuite to 5.37 for now and ahal should release 6.0. Then I can assess the difficulty of porting to 6.0 and we can decide whether it's worthwhile to make a new 5.x series release based on that.
You should be able to create a non-emulator runner using the mozrunner.devices.Device base class. There isn't currently a B2GDeviceRunner shortcut in the runners.py file, but it would be easy to add one. If you want I can put up a patch for it.
I tried modifying the regexp and I am still having difficulty getting net connections.
Depends on: 1029529
My concern is that the current behavior is non-deterministic and will probably cause problems with OEM's that will come back to haunt us.

I agree that the long-term fix is to add missing abilities to mozrunner 6, and update wpt to work with that.

In the shorter term, if we can improve the situation by releasing a 5.37.1, I think we should do that; anything that reduces confusion about the cert suite results is likely very worthwhile.  I'll investigate this option locally, and post a patch here if I'm successful.
OK, thanks. Maybe the WiFi issue I see is different to the one that you see, but waiting longer didn't seem to help me. When the WiFi connection here fails I see something like 

 1:28.71 LOG: Thread-Log INFO STDOUT: -*- WifiWorker component: Determined that scanning is stuck, turning on background scanning!
 1:28.71 LOG: Thread-Log INFO STDOUT: -*- WifiWorker component: Event coming in: CTRL-EVENT-STATE-CHANGE id=-1 state=0 BSSID=00:00:00:00:00:00
 1:28.71 LOG: Thread-Log INFO STDOUT: -*- WifiWorker component: State change: SCANNING -> DISCONNECTED
 1:28.71 LOG: Thread-Log INFO STDOUT: -*- WifiWorker component: Event coming in: CTRL-EVENT-STATE-CHANGE id=-1 state=2 BSSID=00:00:00:00:00:00
 1:28.71 LOG: Thread-Log INFO STDOUT: -*- WifiWorker component: State change: DISCONNECTED -> INACTIVE

in the logs (with WiFi logging enabled), and WiFi never comes up.

Additionally I have started working on the mozrunner update and I have something that runs a little, but it's not yet at parity with the old code.
I wonder if it depends on the characteristics of the wifi network?   Also, in addition to modifying the regex, I put in an extra call to self._wait_for_net() here:

http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/remote.py?from=remote.py#169

(as you suggested in comment #1)

But, I haven't done enough testing to verify if that addition was necessary or not.  I'll re-run the tests without my patch and see if I get the same wifi error messages.
(In reply to James Graham [:jgraham] from comment #5)
> Is the code in aurora identical to that in m-c before the refactor?
> 
I just checked, and yes it is.
This patch is for mozilla-aurora, and creates a mozrunner 5.37.1 which completely resolves the wifi problems in wpt for me locally.  Eventually, we can incorporate similar logic into mozrunner 6.0 on trunk.
Attachment #8446188 - Flags: review?(ahalberstadt)
Assignee: nobody → jgriffin
Comment on attachment 8446188 [details] [diff] [review]
Improve wait_for_net to reduce the chances of dropping wifi when calling B2GRunner.start(),

Review of attachment 8446188 [details] [diff] [review]:
-----------------------------------------------------------------

Lgtm! James was trying to get his wptrunner working with trunk, but his tests weren't starting.. this might be what he was missing!
Attachment #8446188 - Flags: review?(ahalberstadt) → review+
I don't think it is the (only) problem that I was having with the new mozrunner, although we should certainly forward-port the change.

jgriffin: thanks for this. If you make a release I'll pin the version of wptrunner and mozrunner in the certsuite until the port to trunk is stable.
I've release mozrunner 5.37.1 with the above fix.  James, I'll pass this bug to you for the dependency updates.
Assignee: jgriffin → james
Hmm, so how important is it to call wait_for_net() before rebooting the device? That breaks the assumption in wptrunner that if there is no net connection we can call B2GRunner.start(), which will reboot the device and try again.
(In reply to James Graham [:jgraham] from comment #17)
> Hmm, so how important is it to call wait_for_net() before rebooting the
> device? That breaks the assumption in wptrunner that if there is no net
> connection we can call B2GRunner.start(), which will reboot the device and
> try again.

Currently, if you reboot the device, and then reboot the device again or restart B2G before a net connection has been established, you'll fail to get a connection after B2G or the device restarts.  (Instead, you'll have to reboot/restart a 3rd time....apparently there is some weird state persistence at work here.)

I haven't seen the case locally where we fail to get a connection other than by the above, and rebooting fixes this.

However, to protect against this, we could not fail if the initial wait_for_net fails, and continue with rebooting.  I'll submit a patch.
Assignee: james → jgriffin
Comment on attachment 8448096 [details] [diff] [review]
Don't throw if wait_for_net fails before rebooting,

So I had a thought this morning; maybe the right* thing to do here is check if we have already loaded the homescreen and, if so, skip waiting for net and go straight to setting up the profile.

* Or at least a better.
Attachment #8448096 - Flags: review?(james) → review+
The problem with checking for the homescreen is that the code will become coupled with specific versions of Gaia, and I'm not sure we want to do that, if we can avoid that.  The way we do it for the emulator will likely not work on real devices, since the emulator code relies on the fact that the emulator is so slow.

Sadly, we don't have a great, generic way of determining whether Gaia is loaded.
We already rely on checking for the homescreen in other places, at least. I'm just worried that with this setup when you get no wifi for whatever reason you have to wait a long time for an operation that you know will fail.
Do you know where we check for the homescreen currently?  (The only place I was aware of it is emulator.py.)  If we already have code that is sufficiently robust, that definitely makes sense.
There's code in wptrunner[1] modified from the code in mozrunner. I'm not sure how robust it is really, but apart from the logging-related bug, it isn't one of the parts has suffered from intermittent breakage.

[1] https://github.com/w3c/wptrunner/blob/jgraham/initial/wptrunner/browsers/b2g.py#L286
(In reply to James Graham [:jgraham] from comment #24)
> There's code in wptrunner[1] modified from the code in mozrunner. I'm not
> sure how robust it is really, but apart from the logging-related bug, it
> isn't one of the parts has suffered from intermittent breakage.
> 
> [1]
> https://github.com/w3c/wptrunner/blob/jgraham/initial/wptrunner/browsers/b2g.
> py#L286

I took a stab at integrating this into the version of mozrunner on mozilla-aurora, but no luck.  The wptrunner doesn't pass a marionette object to B2GRunner, so the runner has no way (without big changes to the code) of using marionette to wait for the homescreen.

Have you seen an instance of the wifi not coming up after rebooting with the current patch? I'm pretty tempted to leave that patch as-is...the only bad thing that would happen is that we'd wait 40 seconds unnecessarily.
Yeah, the WiFi in the office here still seems to be really problematic with B2G devices for some reason. So I will likely hit the 40s wait reasonably often. But I agree it's not obvious how to integrate things that require marionette into this code without creating a terrible mess.
I don't know that this is fixed, but I've fixed it enough so that I'm no longer working on it.
Assignee: jgriffin → nobody
I've heard that the problems I see are likely a bad interaction between bugs in 1.4 and bad network settings in the office. Hopefully this situation will improve when we switch to 2.x or fix the office network.
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: