Closed Bug 1003507 Opened 10 years ago Closed 10 years ago

Eideticker devices in Mountain View not able to maintain connection to the network

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: wlach, Assigned: adam)

References

Details

Attachments

(4 files, 2 obsolete files)

Background: We have a small network of 2 Android devices and PC's in Mountain View running Android performance tests and reporting results here: http://eideticker.mozilla.org/ . Unfortunately this setup has been down for a month because of our inability to setup a wifi network that allows bidirectional communication between the devices and the machines running tests.

With bug 988606 fixed, the latencies between the eideticker phones and machines in Mountain View is now acceptable and they can communicate with each other across all the ports they need to. However, the phones seem to keep on falling off the WIFI network for some reason (I haven't seen them stay on the internal ATeam network we set up for more than a few hours). We were not seeing this with the setup in the old MV office, perhaps because the devices + router were inside a faraday cage.

Bug 988606 has most of the background required on the current setup we have for Eideticker in MV. Please ask me if you want to be CC'ed on that bug. Needinfo me here if you want other information. Raymond Etornam (CC'ed) may have additional information on the physical setup of the devices/pcs and is onsite.
William, can you provide the MACs from the phones.  I'd like to see what's happening on the controller with them.

Thanks,

James
Copying information from: https://bugzilla.mozilla.org/show_bug.cgi?id=988606#c36

    host galaxy-nexus { hardware ethernet a0:0b:ba:da:88:45; fixed-address 10.252.120.77; }
    host lg-phone { hardware ethernet 78:d6:f0:8c:56:80; fixed-address 10.252.120.78; }

Note that these phones haven't been on the network in the last few days.
Assignee: network-operations → adam
Hey Clint, could you give us a quick update on the status of this?
Flags: needinfo?(ctalbert)
(In reply to William Lachance (:wlach) from comment #3)
> Hey Clint, could you give us a quick update on the status of this?

Yes sorry. I keep thinking we have good news to report and then my hopes are dashed.

So, we have the G2X on the network and it is solid and reliable. James is preparing an update on the networking fu that helped that to happen, I think it was removing the "n" wireless band from the AP.

We also fixed the watcher.ini file so that is pinging the proper server which can also cause timeouts as everyone on the A*Team is painfully aware. So that phone should be up and running via its cron job. From looking at the few crons I've seen where the test doesn't complete it is due to Fennec already running when some test starts which throws an exception and kills the entire test run. But that's not networking related.

So the Galaxy-Nexus is another story entirely. The GN was getting confused by having other wireless networks available on the AP and was trying to communicate with the "a" band of the network and failing to do so. James removed the "a" band from our AP and Raymond and I did a fair bit of hacking with the wpa_supplicant.conf file so that the phone will only ever attempt to join the ateam wireless network. That is the current state. It joins the network and then immediately disconnects. The reason it gives for disconnecting is that the phone has gone out of range of the AP, which is ludicrous since this happens milliseconds after connection and the phone hasn't moved. After exhausting our options with the GN, we decided that maybe we had a busted phone.

So we got a second GN and wired it all up. And it has the same behavior. Identical, in fact, if you look at the logcats.

Querying google for the disconnection errors, I found bugs filed on them that had been fixed but unhelpfully no indication of what release those fixes went into. I updated the GN from 4.2 to 4.3 and it didn't help at all.

So, we have the G2X running. But the GN seems DOA. James and I will both be in Mountain View again next week and we will have another look at it, but I'm running out of creative options at this point. My last idea is that there seems to be something (on google, referencing linux wireless) where two wpa_supplicant processes can cause these kinds of behaviors. So, I want to do a test where we remove the SUTAgent.ini file that attempts to force the phone to connect to the ateam network and leave just the wpa_supplicant.conf file directing us to that network. I've asked Raymond to do that and get me a logcat from that. I expect to see that at some point today. 

Sorry we've all been collectively terrible at keeping this bug up to date. I'll upload an example logcat of a recent run with the new GN.
Flags: needinfo?(ctalbert)
Attached file wpa_supplicant.conf
The wpa_supplicant.conf file we are using to force the GN to attach to this network. This lives at /data/misc/wifi/ on the phone and you need to make the /system partition rw in order to change it.
Attached file new_nexus.log
Logcat from the new GN run that still shows our connect/disconnect issues.
The positives in this log (compared to earlier ones that I am not going to spam this bug with) are that it is always trying to join the ateam network and only the ateam network. And that it is always joining the ateam network on the proper frequency (the 24xx frequency) which is the only frequency that is ever successful for the phone to join the network on.
Clint are you opposed to setting the IP address on the phone rather than using DHCP.  For the purposes of your testing this shouldn't make a difference.  I'd like to see if this masks any of the issues or creates more headaches.
(In reply to James Barnell from comment #7)
> Clint are you opposed to setting the IP address on the phone rather than
> using DHCP.  For the purposes of your testing this shouldn't make a
> difference.  I'd like to see if this masks any of the issues or creates more
> headaches.

No, happy to set up the IP on the phone itself. Let's see if that changes anything. All we require is a static address. We have no concerns at all for how that static address is assigned.
I played with a lot of stuff today. I had wlach fix a couple of problems
I saw with the agent for running on 4.3 which the new galaxy nexus is
running on. The agent is running fine, and I had some interesting successes.

Because I was at the b2g work week, I took the phone out of the lab and
worked on it from the mtn view commons area (first floor).

When the phone was sitting with me out in the commons area, every time I
rebooted it, it came back up and joined the network (yes, the ateam
network from all the way out here). Sometimes it took a little while but
it joined. For instance, once I think it may have taken 5 minutes. But
it consistently came back up. There were plenty of
connected/disconnected messages in the log but it did eventually work.

Encouraged, I put the phone into the lab and plugged it in via USB to
the eideticker machine. Now the phone does not connect to the ateam
network no matter how long I let it sit there.

So, there is some kind of real difference with the phone being in the
lab versus being elsewhere in the building.
So one wrinkle that I mentioned to Clint last Friday is that I've noticed very different behaviour in the Galaxy Nexus when the HDMI dongle is plugged in vs. not. Back when I was trying to make things work here in Montreal, I had a setup at home with two routers: one is a router/dsl modem combo that my ISP gave me  (TP Link something something) and an Apple Time Capsule. When the HDMI dongle was not connected, my Galaxy Nexus could connect to either fine. If I plugged the HDMI dongle in (and connected it to the Decklink card), it would only work with the Apple router. It wouldn't be able to maintain a connection to the TP-Link network at all.

Take aways from this:

1. The only valid test of WIFI connectivity is one where the MHL-HDMI dongle is attached to the Galaxy Nexus and it is connected to the machine.
2. It's quite possible that the difference was which channel each router was on. I'd definitely try switching between those to see if it helps.
Will, so we played with a nexus 5 today. The agent worked pretty well on it. If we can find a way to root the device I think the agent will work fine for us, which surprises the hell out of me. So, if you can get the nexus 5 to capture using the slimport adapter you ordered we could just upgrade to that. The nexus 5 does seem to get on the wifi ok in that room. Let us know how your testing with HDMI capture pans out with the nexus 5.
I was able to root the nexus 4, 5, 7 using https://github.com/bclary/bootimg

I can provide a binaries for linux x86_64 if you have that available or the boot img I created. Ping me.
(In reply to Clint Talbert ( :ctalbert ) from comment #11)
> Will, so we played with a nexus 5 today. The agent worked pretty well on it.
> If we can find a way to root the device I think the agent will work fine for
> us, which surprises the hell out of me. So, if you can get the nexus 5 to
> capture using the slimport adapter you ordered we could just upgrade to
> that. The nexus 5 does seem to get on the wifi ok in that room. Let us know
> how your testing with HDMI capture pans out with the nexus 5.

Still waiting for my slimport to arrive. It apparently just shipped yesterday.
Thanks bc and Will. Bc, if we wind up moving to the nexus 5 we will ping you for rooting instructions, thanks! Will you did (I hope) order the slim port that works for the LG phone? The older slimports (from Samsung) do not work with the nexus 5 (we tried that earlier this week).

One thing that does mean though is that there may be the possibility that the LG slimport will proxy adb while still pulling HDMI out of the phone. I will cross my fingers.
Clint
(In reply to Clint Talbert ( :ctalbert ) from comment #14)
> Thanks bc and Will. Bc, if we wind up moving to the nexus 5 we will ping you
> for rooting instructions, thanks! Will you did (I hope) order the slim port
> that works for the LG phone? The older slimports (from Samsung) do not work
> with the nexus 5 (we tried that earlier this week).

Yep, it's due to arrive next Tuesday.

> One thing that does mean though is that there may be the possibility that
> the LG slimport will proxy adb while still pulling HDMI out of the phone. I
> will cross my fingers.

We can try using adb over tcp in this case.
Updates:

1. SlimPort adapter has still not arrived. :(
2. LG G2X has not been able to stay on the network for more than a day, nor has it ever been able to finish a testrun due to wifi disconnection issues. I don't think we can really continue using it in this state.

I suspect the problems with the LG G2X might have something to do with bug 1011358, however we can't make progress on that until James is back from vacation.
See Also: → 1011358
Ok, the slimport adaptor for the nexus 5 finally arrived and I had a chance to test it. It does not work, and I tried all the settings. :(
(In reply to William Lachance (:wlach) from comment #17)
> Ok, the slimport adaptor for the nexus 5 finally arrived and I had a chance
> to test it. It does not work, and I tried all the settings. :(


:(
So, I have the little switch configured to use channel 11, broadcasting the ssid "qalab". It can be seen by the phones, and it is proxying DHCP across from the larger network (its own DHCP system is disabled).

Right now, the phones the router is controlling (just the g2x) is getting a 10.252.73.xxx address. The PC's (120.xxx) can communicate with that, but it is slow. If this all works we'll want to fix this up so it's all on the same network, but that can come later.

The router seems to have two MAC addresses (probably one for wifi and one for its ethernet port, but I don't know which is which because the stupid configuration screens are confusing).
They are:
* 28:c6:8e:99:f0:60 (I think this is the wifi mac)
* 28:c6:8e:99:f0:68 (I think this is the ethernet uplink mac address).

I'm running an eideticker test with the G2X now to see if it can complete an entire test (which the g2x has never done before on the ateam network). Stephend is trying to see if the new switch helps with the b2g disconnection issues too.

If I get another chance, I'll try to set up the galaxy nexus before I leave and see if it will run overnight. More data as I get it....
Will, The g2x is having strange issues - like Eideticker has been updated and isn't working properly - it's not finding files (like run-update-dashboard.sh).  I got that file replaced (using the one from the other eideticker box) and now it's saying "no such product" as org.mozilla.fennec. This is serious strangeness. Fortunately the phone is still connected.
(In reply to Clint Talbert ( :ctalbert ) from comment #20)
> Will, The g2x is having strange issues - like Eideticker has been updated
> and isn't working properly - it's not finding files (like
> run-update-dashboard.sh).  I got that file replaced (using the one from the
> other eideticker box) and now it's saying "no such product" as
> org.mozilla.fennec. This is serious strangeness. Fortunately the phone is
> still connected.

I updated some parts of the eideticker codebase, but didn't bother updating the eideticker nodes, because I assumed they would continue to be down indefinitely. I can fix this now.
Ok, so the new script to run is "update-dashboard-android.sh".

I did a test run and found that it was timing out when attempting to install the Android .apk. So I think there are latency/bandwidth issues with the new network than need to be sorted out before this is a solution.

When testing I'd recommend doing the following:

cd src/eideticker
source bin/activate
./bin/update-phone.py nightly latest
./bin/update-dashboard.py --output-dir /tmp/eideticker-tmp taskjs

This will run through the full set of steps that we need for a test to complete, without possibly uploading bad data to the dashboard.
Thanks Will. I'll use that from now on.
Let's lay out the facts:
* This automation has been down for 3 months at this point.
* Adding our own wifi router inside the rack did seem to help - stephen sent email saying it seemed more stable for b2g than the ateam node ssid had been.
* The throughput of the phones going to the router is very slow - could that be due to the fact that the router is not getting addresses that are on the 10.252.120 vlan? Even things like downloading a build from mozilla's FTP site take a very long time, and that probably isn't related to the vlan issue.

More Demands Coming on this System:
* We need to stand up 30 more Fx OS phone devices in the room. We need to do it very soon (within 3 weeks).
* We need to stand up 7 more android devices in the room. We need to do it soon as well (within 3 weeks).

We have tried:
* Turning off all ssids but the one we wanted
* Changing the channel and frequencies the ssid is broadcasted on
* We looked into the DHCP latency - there is no way to force the phone to not use DHCP so that was never tried
* We tried adding another wifi node to the system - this seems to help. The node is inside the rack with the phones.
* We tried several permutations of speed, and different ways to force phones to join networks but there is nothing we have done that has achieved any kind of lasting solution.

The only thing I can think of that we haven't tried:
* Get a wifi switch that has 24 ports on it (or something like that), and have it broadcast the ssid for the phones, have that switch run DHCP for the phones, and have the wired computers also plug into that same switch. Then have that switch uplinked so that we have outside access to/from these machines.
How do we fix this at this point? I don't know what else to try.

In light of all of this and all we've done, we still have no way to connect these phones to the wired network in the lab. In order for any of our automation to work, these phones need to connect to the 10.252.120.xxx network that the controller PC's in that room are on. There are of course outside connections that need to be allowed too for the automation frameworks to function, but the basic problem we first hit with this lab when we moved in still stands--the phones cannot reliably obtain the connection to the same network the controller PCs are using.

What else can we do here? How do we resolve this?
Flags: needinfo?(jbarnell)
Just to clarify, we're (Web QA + our Gaia UI Tests) back on the "ateam" node, now, as our temporary "qalab" trial was only reliable for the 1st evening; overnight, we started to witness the SSID disappearing, which -- obviously -- caused our Wi-Fi tests to fail.
Ok so this morning we met about the issue. We are trying the following things:
1 turning off the a-team ssid in the lab and just using the consumer wifi ssid "qalab" to see if that improves connectivity
2 using a-team ssid and shielding the rack so that we can have some idea of the impact of cross-talk and signal confusion with the devices and that node
3 using mozilla mobile ssid (without shielding) to see if greater wifi saturation of the network may help mitigate the issues

So far we have completed testing for scenario 1. We found that the FxOS and Eideticker phones remained on the network quite well. However, (and we see this on laptops as well) there are issues where connections made through the node have very slow data rates and sometimes connections through the node simply fail (though we don't lose the network, we just cannot seem to connect to the destination on the other side). We see these throughput issues even with a laptop running on this node, and I think it has to do with the unoptimized way this piece of hardware is jacked into the larger networking infrastructure rather than anything specific to the wifi. In short, the wifi connectivity with the consumer wifi switch inside the rack *does* seem to be quite good.
Update on the testing.

We didn't get to shield anything, as fun as that would be.

After testing with the qalab ssid running without the ateam ssid, we decided to do the reverse.

So we ran a test using the phones and the laptop using ateam ssid without the qalab node active. The phones immediately exhibited the same issues: constant scanning, unable to join the network etc. This occured on both b2g and android phones. The laptop running the test on the ateam node ran fine, however. We are now trying running the laptop test from inside the rack to see if the rack itself is providing interference.

Since the ateam test quickly showed issues, we moved on to test the mozilla mobile network since James provided us the password for it. So far, results with the mozilla mobile network are *extremely positive*. The phones are staying online, and are connecting after every single reboot WITH NO ISSUES \o/. I cannot run a full Eideticker test with these phones though because the port 20701 for the SUTagent is not available from the Eideticker PC machines (10.252.120.76 and 120.252.120.74) to the phones. But that is something we can solve later. Instead I'm running the test using my laptop connected to Mozilla ssid and logging into the phones via the agent and rebooting them that way.  The Firefox OS phones on mozilla mobile are also quite stable. I'm going to set up the reboot test I'm running on these android phones overnight to see what happens. We are also setting up one flame device to run on 1.4 with the Mozilla Mobile network overnight as well.

Let's check back in tomorrow and see where things are at.
Flags: needinfo?(jbarnell)
Clint asked that I update with Web QA's piece, here, so, while it's not perfect, it's definitely looking *a whole lot better*.  There is a lot more testing to be done, of course, but here's the laydown:

* we switched one B2G-9 (Flame) node to using Mozilla Mobile: https://github.com/mozilla/webqa-credentials/commit/b906adc989c497829bba60e45dacb0d4d28d995b
* I ran several iterations of both the 1.4 Wi-Fi test [1] (test_settings.py - see bug 987760 for history), as well as unittests [2], and with the exception of a couple failures (likely race conditions in Gaia and/or B2G/Gecko-land), it's looking really good.  Stable, quick tests.

[1] http://selenium.qa.mtv2.mozilla.com:8080/job/b2g.flame.mozilla-b2g30_v1_4.adhoc_test_settings_wifi/
[2] http://selenium.qa.mtv2.mozilla.com:8080/job/b2g.flame.mozilla-central.unittests/

I'm going to recommend to Web QA that we switch over to the Mozilla Mobile SSID, across the board, and do more thorough testing.  But, so far so good.
The galaxy nexus rebooted and got on the Mozilla Mobile network 15 consecutive times (no failures at all) before my reboot script burned up the phone. When I was about to leave the office I checked on the phones and found the Galaxy nexus completely fried. It won't boot, pulling the battery had no effect. The LG is still going strong and continues to grab a connection on each reboot. I'll keep the LG running over night. 

We will likely need a new Galaxy Nexus for the Eideticker machine. Raymond, do you still have that second one?
So it seems like Mozilla Mobile is working fine. I suggest we add a simple WPA2 key to the ateam SSID and see if you can now connect properly and stay connected as the only difference is that key.
When is a good time to do that?
Ok, this sounds great; do we know why this setup is working better? It would be good to know so we can keep things working and don't inadvertently break things in the future.
So, looking at my overnight run, the LG phone rebooted 170 times and was successful in getting an IP each time. It did fail to connect. Stephen or Raymond, could one of you go into the lab and see if:
A) The LG did in fact boot but connected to the incorrect network
B) The LG did boot connected to Mozilla Mobile but got a different IP than 10.252.27.147 - since we don't have static IPs on this network, we might have just lost our DHCP lease
C) The LG did not boot and is as dead as the galaxy nexus.

Thanks.

(In reply to Arzhel Younsi [:XioNoX] from comment #30)
> So it seems like Mozilla Mobile is working fine. I suggest we add a simple
> WPA2 key to the ateam SSID and see if you can now connect properly and stay
> connected as the only difference is that key.
> When is a good time to do that?
There are more differences between the two networks than the key alone. The mozilla mobile network is far more saturated through the building, not entirely sure how that last point might be helping us, maybe James can elaborate there--that was one of his theories. Most importantly, we have not seen the SSID "disappearing" from our available list of SSID's the way the ateam ssid does.

Essentially what we've uncovered is that there is just something fundamentally broken somehow with the ateam SSID -- perhaps reviewing the differences in its configuration from mozilla mobile's configuration will help enlighten us to what Wlach is asking in comment 31 and we can see why mozilla mobile is better. Both our little consumer switch and the mozilla mobile ssids are far more reliable. I don't know that it behooves us to do any further testing with the ateam ssid until we understand why it is unreliable, and I don't understand how further client based testing on it is going to help us understand that.

It's easy to do more testing, but I want to know that we're going to get benefit from it if we do it. From what I see, the next step is to review the configuration differences between the networks.
Raymond, can you help Clint out with comment 32?  Thanks!
Flags: needinfo?(mozbugs.retornam)
Supposition here ....

1.  General Observation -- It doesn't appear that the client was able making "intelligent decisions" around which open network to associate with.  This means that there was no calculation happening regarding beacon or signal strength.  The closest AP was not necessarily the AP that associated.  And the criteria was open network.
2.  Passphrase -- By using a secured SSID we gave the clients some common criteria to use.  In other words I have associated with the AP before  and I have authenticated to the SSID before, secure networks are better than open networks.  We see this behavior in most wireless devices.  Think about this when you open you laptop and immediately associate to a secure network you've used previously.
3.  Density -- We're hoping the density item is nothing and that secure network is the answer.  Density gives the devices multiple points to associate with and would mask any issues.

It seems like we've had marked improvement. Let me know what you think are the next steps.  NetOps would like to proceed with the ATEAM SSID modification.  We too want this resolved and to see that you're working and I think we're all interested in getting to the root cause.
(In reply to Clint Talbert ( :ctalbert ) from comment #32)
> So, looking at my overnight run, the LG phone rebooted 170 times and was
> successful in getting an IP each time. It did fail to connect. Stephen or
> Raymond, could one of you go into the lab and see if:
> A) The LG did in fact boot but connected to the incorrect network
> B) The LG did boot connected to Mozilla Mobile but got a different IP than
> 10.252.27.147 - since we don't have static IPs on this network, we might
> have just lost our DHCP lease
> C) The LG did not boot and is as dead as the galaxy nexus.

The LG did boot and was connected to the Mozilla Mobile network with 10.252.27.147 as its IP address when I came in. I manually restarted the phone just to double check and the results were the same. The phone booted,connected to Mozilla Mobile and kept the 10.252.27.147 IP address

> 
> Thanks.
> 
> (In reply to Arzhel Younsi [:XioNoX] from comment #30)
> > So it seems like Mozilla Mobile is working fine. I suggest we add a simple
> > WPA2 key to the ateam SSID and see if you can now connect properly and stay
> > connected as the only difference is that key.
> > When is a good time to do that?
> There are more differences between the two networks than the key alone. The
> mozilla mobile network is far more saturated through the building, not
> entirely sure how that last point might be helping us, maybe James can
> elaborate there--that was one of his theories. Most importantly, we have not
> seen the SSID "disappearing" from our available list of SSID's the way the
> ateam ssid does.
> 
> Essentially what we've uncovered is that there is just something
> fundamentally broken somehow with the ateam SSID -- perhaps reviewing the
> differences in its configuration from mozilla mobile's configuration will
> help enlighten us to what Wlach is asking in comment 31 and we can see why
> mozilla mobile is better. Both our little consumer switch and the mozilla
> mobile ssids are far more reliable. I don't know that it behooves us to do
> any further testing with the ateam ssid until we understand why it is
> unreliable, and I don't understand how further client based testing on it is
> going to help us understand that.
> 
> It's easy to do more testing, but I want to know that we're going to get
> benefit from it if we do it. From what I see, the next step is to review the
> configuration differences between the networks.
Flags: needinfo?(mozbugs.retornam)
Raymond, thanks for checking on that. I can now ping the device once more. That was an odd blip.

Ok, after speaking with James in the office here regarding his comment, I understand more of what he and Arzhel are saying. The main difference is the density in broadcast. So putting the key on the ateam node will give the ateam ssid precedence over the competing networks, and we can see the effect of the density (the ateam ssid being broadcast locally in the lab versus being broadcast throughout the building). 

So, let's go ahead and run the test once more with a secured ateam ssid so that we can judge the affect the density is having. James please configure the ateam with the same security and a PSK that mozilla mobile has and send us the key for it in email at your earliest convenience. And we will re-run the test today with that secured ateam node and see what we get. The good news here is that it only takes about 2 hours for the ateam ssid's misbehavior to become evident, so once we have the secure version of the network and the tests running, we should know pretty quickly whether it is showcasing the same issues it has previously shown.

Raymond - for the Eideticker android phones, I have a couple of requests:
* LG: Please tell it to "forget" the mozilla mobile network, and then manually log it onto the new secured ateam ssid once that is available from James.
* Nexus: Can you attempt to get the phone to boot? You may be able to get it to fastboot via down volume and power at the same time and this might provide you the ability to boot back into the phone's OS. Failing that, do you have the second galaxy nexus? I'd really like to test with the galaxy nexus since it had the most problems connecting to a*team of any of the phones.
* Once you have the phones on the newly secured ateam network, let me know and I will start running the tests from here in SF.
The password has been set on ateam and emailed to Clint, Stephen and Raymond.
(In reply to Clint Talbert ( :ctalbert ) from comment #36)
> Raymond, thanks for checking on that. I can now ping the device once more.
> That was an odd blip.
> 
> Ok, after speaking with James in the office here regarding his comment, I
> understand more of what he and Arzhel are saying. The main difference is the
> density in broadcast. So putting the key on the ateam node will give the
> ateam ssid precedence over the competing networks, and we can see the effect
> of the density (the ateam ssid being broadcast locally in the lab versus
> being broadcast throughout the building). 
> 
> So, let's go ahead and run the test once more with a secured ateam ssid so
> that we can judge the affect the density is having. James please configure
> the ateam with the same security and a PSK that mozilla mobile has and send
> us the key for it in email at your earliest convenience. And we will re-run
> the test today with that secured ateam node and see what we get. The good
> news here is that it only takes about 2 hours for the ateam ssid's
> misbehavior to become evident, so once we have the secure version of the
> network and the tests running, we should know pretty quickly whether it is
> showcasing the same issues it has previously shown.
> 
> Raymond - for the Eideticker android phones, I have a couple of requests:
> * LG: Please tell it to "forget" the mozilla mobile network, and then
> manually log it onto the new secured ateam ssid once that is available from
> James.

I cleared the Mozilla Mobile network and signed on the the ateam network using the password Arzhel emailed to us. The LG is now connected with 10.252.120.78 as its IP address. I power-cycled the phone and made sure it connect to the ateam network after restart. Clint, can you please kick-off the Eideticker tests now.


> * Nexus: Can you attempt to get the phone to boot? You may be able to get it
> to fastboot via down volume and power at the same time and this might
> provide you the ability to boot back into the phone's OS. Failing that, do
> you have the second galaxy nexus? I'd really like to test with the galaxy
> nexus since it had the most problems connecting to a*team of any of the
> phones.
> * Once you have the phones on the newly secured ateam network, let me know
> and I will start running the tests from here in SF.

I'm working on the Nexus now. I'll update this bug once I have it plugged in and running.
Flags: needinfo?(ctalbert)
Ok, running real eideticker tests on the LG now. Figured that was even better than doing simple reboot tests. Ping me on IRC when you have the galaxy nexus wired back up in the lab to its eideticker machine and I'll do the same there!

Thanks
Flags: needinfo?(ctalbert)
The Galaxy Nexus is now connected to the ateam network with the IP 10.252.120.217. Again I power-cycled the phone and it came back on connected to the ateam network. Clint is kicking off tests on the Nexus now.
So, the overnight tests on Eideticker didn't do nearly as well as they did on the Mozilla Mobile network. The LG fell off the network, and is unpingable. I can see (via the HDMI capture) that the agent still thinks it is connected, but there is no wifi icon in the settings bar.

The Nexus rebooted itself 20 times and came up every time except the last. It could have either ran out of charge or it too could have lost the network. 

So as far as the eideticker systems go, I would say it is definitely more stable with the PSA key than previously, but it is not as stable as Mozilla Mobile was.

Stephend, how did the FXOS devices fare overnight?
Raymond, can you check on the LG and the Nexus and see what state they are in?
Flags: needinfo?(stephen.donner)
Flags: needinfo?(mozbugs.retornam)
Do we have any idea on why we're seeing different behavior on the Mozilla Mobile vs. the Ateam network? I don't think we should close this bug until we understand what's going on there.

As an aside, Raymond and I have not been able to get the new Galaxy Nexus outputting to HDMI properly, though that doesn't have anything to do with the networking situation.
(In reply to Clint Talbert ( :ctalbert ) from comment #41)
> So, the overnight tests on Eideticker didn't do nearly as well as they did
> on the Mozilla Mobile network. The LG fell off the network, and is
> unpingable. I can see (via the HDMI capture) that the agent still thinks it
> is connected, but there is no wifi icon in the settings bar.
> 
> The Nexus rebooted itself 20 times and came up every time except the last.
> It could have either ran out of charge or it too could have lost the
> network. 
> 
> So as far as the eideticker systems go, I would say it is definitely more
> stable with the PSA key than previously, but it is not as stable as Mozilla
> Mobile was.
> 
> Stephend, how did the FXOS devices fare overnight?
> Raymond, can you check on the LG and the Nexus and see what state they are
> in?

Clint, the Wi-Fi on both ateam and Mozilla Mobile seems to be remarkably stable for us, at least using 1.4.
Flags: needinfo?(stephen.donner)
Attached image b2g-1-ateam.png
ateam Wi-Fi results run on a Flame with 1.4
Wlach I've found  and connected the old Galaxy Nexus phone. It is currently assigned the IP 10.252.120.77. Please try a test run and let me know if the HDMI output is recorded
Flags: needinfo?(mozbugs.retornam) → needinfo?(wlachance)
(In reply to raymond [:retornam] (needinfo? me) from comment #45)
> Wlach I've found  and connected the old Galaxy Nexus phone. It is currently
> assigned the IP 10.252.120.77. Please try a test run and let me know if the
> HDMI output is recorded

I'm afraid this isn't working. :( I think we should try connecting the galaxy nexus to eideticker-1 (which we know works), to make sure the problem isn't with the capture card (or its configuration).
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) from comment #46)
> (In reply to raymond [:retornam] (needinfo? me) from comment #45)
> > Wlach I've found  and connected the old Galaxy Nexus phone. It is currently
> > assigned the IP 10.252.120.77. Please try a test run and let me know if the
> > HDMI output is recorded
> 
> I'm afraid this isn't working. :( I think we should try connecting the
> galaxy nexus to eideticker-1 (which we know works), to make sure the problem
> isn't with the capture card (or its configuration).

Actually, let's track the Galaxy Nexus HDMI issues in a seperate bug, since they have absolutely nothing to do with the network configuration issues. Filed bug 1023369, please send followups there.
Hi Clint, we've been trying for the last few days to keep the LG G2X on the ateam wireless network, without any luck. We've reset it twice in the last two days, but it hasn't stayed on. I was going to make sure the watcher was configured correctly so it would restart in these cases, but now it won't connect to it at all. What should our next step be? If the network is reliable enough for FirefoxOS, I'm wondering if it isn't this specific device that's the problem.
Flags: needinfo?(ctalbert)
(In reply to William Lachance (:wlach) from comment #48)
> Hi Clint, we've been trying for the last few days to keep the LG G2X on the
> ateam wireless network, without any luck. We've reset it twice in the last
> two days, but it hasn't stayed on. I was going to make sure the watcher was
> configured correctly so it would restart in these cases, but now it won't
> connect to it at all. What should our next step be? If the network is
> reliable enough for FirefoxOS, I'm wondering if it isn't this specific
> device that's the problem.

The ateam wireless network is no longer working in the lab. None of the devices we had could connect to it and it stopped broadcasting about 5 mins ago. I was able to connect to it and run some tests yesterday on my  Moto G. I haven't had any success with it today.

James, did you change anything?
Flags: needinfo?(jbarnell)
Raymond, we made no changes.  The AP is up and still broadcasting, from what I can see ateam.
Flags: needinfo?(jbarnell)
These are the results from tests using the wifi SpeedTest tool for Android


SSID               Transfer(10MB file)     Speed          
ateam              Download                5.23  Mbit/s
ateam              Download                4.60  Mbit/s
ateam              Download                4.92  Mbit/s
ateam              Download                4.60  Mbit/s
Mozilla Mobile     Download                27.76 Mbit/s
Mozilla Mobile     Download                24.52 Mbit/s
Mozilla Mobile     Download                24.58 Mbit/s
Mozilla Mobile     Download                26.13 Mbit/s
Mozilla Mobile     Download                13.50 Mbit/s
Mozilla Mobile     Download                13.08 Mbit/s


The results show that Mozilla Mobile is doing better than the ateam network at this point.





[1]https://play.google.com/store/apps/details?id=com.pzolee.android.localwifispeedtester
Flags: needinfo?(ctalbert)
From everything I saw last week, I think Mozilla Mobile is more stable and faster than A*team. It's also the wifi network that the Taipei phone automation is running on. And since this automation system can (and should) span offices, I think it makes sense to streamline here. We should move the wifi devices in Mtn View to Mozilla Mobile, and remove the Ateam ssid.  Do you want me to file a separate bug to open up the proper network flows between the lab PC machines and Mozilla Mobile?
Flags: needinfo?(jbarnell)
James and I chatted briefly in SF this morning, but I was on my way to another meeting. James talked about building out more density for the ateam network and that might fix the "ssid disappearing" issues. In the moment, I was totally fine with that. But thinking on it now, why are we still going down that road? We have the density and the speed on the Moz mobile network. We have the coverage, the devices connect to it with no problem (even the galaxy nexus) why are we continuing to solve for a problem we just don't have?

Why can't we just use Mozilla Mobile?
(In reply to Clint Talbert ( :ctalbert ) from comment #53)
> James and I chatted briefly in SF this morning, but I was on my way to
> another meeting. James talked about building out more density for the ateam
> network and that might fix the "ssid disappearing" issues. In the moment, I
> was totally fine with that. But thinking on it now, why are we still going
> down that road? We have the density and the speed on the Moz mobile network.
> We have the coverage, the devices connect to it with no problem (even the
> galaxy nexus) why are we continuing to solve for a problem we just don't
> have?
> 
> Why can't we just use Mozilla Mobile?

I'm fine with switching to Mozilla Mobile because it is more stable. I submitted https://github.com/mozilla/webqa-credentials/pull/134 to switch the last remaining b2g node from ateam to Mozilla Mobile. The pull request was merged 18 minutes ago. All our  b2g-nodes are now on the  Mozilla Mobile network.
(In reply to Clint Talbert ( :ctalbert ) from comment #53)
 
> Why can't we just use Mozilla Mobile?

+1; I think our initial concerns were around performance/over-saturation (though we've since learned from James that Mozilla Mobile is not widely used, yay)
b) at the time -- in the old Mountain View office -- we had to specify, through an ACL/whitelist, the specific clients' MAC addresses, to be able to use Mozilla Mobile, which is no longer the case.

So, as Raymond said - the pull was merged, and we're back 100% on Mozilla Mobile, and looking good so far.
We are going to give up on Eideticker in mtn view at this time. We can't keep testing and it will help us to just focus on the Flame 40 device b2g buildout. We may build out Eideticker in mtn view at a later date. For now, let's focus on the FxOS use case. Since I still see no reason not to use Moz Mobile that still seems to be the best way forward.
I plan to start broadcasting the ateam SSID everywhere in the MTV2 office Tomorrow at 0900 UTC (2am PST). Please let me know if you want me to reschedule.
comment 57 is done, please try again the ateam SSID and let us know how it goes. It should now be totally identical to Mozilla Mobile.
Great! We will try to test it out today and tomorrow. 

Is the Ateam SSID broadcasting on all the same bands that Mozilla Mobile is? I think we'd turned off 802.11a or 802.11 g or something like that on Ateam at one time. I'd like the Ateam SSID to broadcast on all the same bands as mozilla mobile.

We'll let you know how it goes!
A (5Ghz) and BG (2.4Ghz) are both enabled like on Mozilla Mobile.

On the other hand, N (for both 5 and 2.4Ghz) is disabled as Linux support for that standard is very bad and can only make things worse.

Clearing the needinfo for James as I think the question above has been answered.
Flags: needinfo?(jbarnell)
(In reply to Arzhel Younsi [:XioNoX] from comment #60)
> A (5Ghz) and BG (2.4Ghz) are both enabled like on Mozilla Mobile.
> 
> On the other hand, N (for both 5 and 2.4Ghz) is disabled as Linux support
> for that standard is very bad and can only make things worse.
> 
> Clearing the needinfo for James as I think the question above has been
> answered.

Great, so the tests yesterday went really well - I did some testing with the android phones since they were the most problematic, and Raymond and Dylan were doing testing on the FxOS devices. The android phones remained on the network, and operated very well all day. This morning though, both phones claim they are connected to the network (I can see the wifi indicator claiming they are connected and they claim to have the correct IP) but I cannot ping them. Since I'm remote from them today I can't get a logcat for more information. 

Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID yesterday and this morning?
Flags: needinfo?(mozbugs.retornam)
Flags: needinfo?(dwong)
Raymond and/or Dylan, would you mind rebooting the android phones to see if that fixes their issue?
(In reply to Clint Talbert ( :ctalbert ) from comment #62)
> Raymond and/or Dylan, would you mind rebooting the android phones to see if
> that fixes their issue?

I'll reboot the devices after I leave my 9AM meeting.


(In reply to Clint Talbert ( :ctalbert ) from comment #61)
> (In reply to Arzhel Younsi [:XioNoX] from comment #60)
> > A (5Ghz) and BG (2.4Ghz) are both enabled like on Mozilla Mobile.
> > 
> > On the other hand, N (for both 5 and 2.4Ghz) is disabled as Linux support
> > for that standard is very bad and can only make things worse.
> > 
> > Clearing the needinfo for James as I think the question above has been
> > answered.
> 
> Great, so the tests yesterday went really well - I did some testing with the
> android phones since they were the most problematic, and Raymond and Dylan
> were doing testing on the FxOS devices. The android phones remained on the
> network, and operated very well all day. This morning though, both phones
> claim they are connected to the network (I can see the wifi indicator
> claiming they are connected and they claim to have the correct IP) but I
> cannot ping them. Since I'm remote from them today I can't get a logcat for
> more information. 

> 
> Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID
> yesterday and this morning?

We have 3 email tests failing consistently since last night. They are tests that make requests to Microsoft's ActiveSync  servers and they have been timing out consistently since we made the change. I'll keep monitoring the tests today to see if there is any difference.
Flags: needinfo?(mozbugs.retornam)
I've gone ahead and restarted the phones. While the nexus was technically connected to the hotspot, the wi-fi symbol was greyed out which meant it wasn't getting internet connectivity. The LG was not connected and had ateam disabled. After restarting the nexus immediately connected while the LG booted with ateam disabled. I deleted old hotspots from its history just in case. They should both be up now.
Flags: needinfo?(dwong)
The Galaxy Nexus has fallen off the network again. Or at least I can not ping it. I am not sure if we can consider this new network to be stable. Can we investigate what happened?
Flags: needinfo?(mozbugs.retornam)
(In reply to William Lachance (:wlach) from comment #65)
> The Galaxy Nexus has fallen off the network again. Or at least I can not
> ping it. I am not sure if we can consider this new network to be stable. Can
> we investigate what happened?

William, I've re-connected the Nexus to the WiFi network
Flags: needinfo?(mozbugs.retornam)
(In reply to raymond [:retornam] (needinfo? me) from comment #63)
> > Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID
> > yesterday and this morning?
> 
> We have 3 email tests failing consistently since last night. They are tests
> that make requests to Microsoft's ActiveSync  servers and they have been
> timing out consistently since we made the change. I'll keep monitoring the
> tests today to see if there is any difference.

How did these do on mozilla mobile?
(In reply to Clint Talbert ( :ctalbert ) from comment #67)
> (In reply to raymond [:retornam] (needinfo? me) from comment #63)
> > > Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID
> > > yesterday and this morning?
> > 
> > We have 3 email tests failing consistently since last night. They are tests
> > that make requests to Microsoft's ActiveSync  servers and they have been
> > timing out consistently since we made the change. I'll keep monitoring the
> > tests today to see if there is any difference.
> 
> How did these do on mozilla mobile?

Sorry for the spam, I should have also asked what build are they running. - It could be a build issue with FxOS or the network. Hopefully they are still running the same build they were on when they were on moz mobile?
(In reply to Clint Talbert ( :ctalbert ) from comment #67)
> (In reply to raymond [:retornam] (needinfo? me) from comment #63)
> > > Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID
> > > yesterday and this morning?
> > 
> > We have 3 email tests failing consistently since last night. They are tests
> > that make requests to Microsoft's ActiveSync  servers and they have been
> > timing out consistently since we made the change. I'll keep monitoring the
> > tests today to see if there is any difference.
> 
> How did these do on mozilla mobile?

They did much better on Mozilla Mobile than they are currently doing on ateam.
(In reply to Clint Talbert ( :ctalbert ) from comment #68)
> (In reply to Clint Talbert ( :ctalbert ) from comment #67)
> > (In reply to raymond [:retornam] (needinfo? me) from comment #63)
> > > > Raymond & Dylan, how did the work on the FxOS devices go on the Ateam SSID
> > > > yesterday and this morning?
> > > 
> > > We have 3 email tests failing consistently since last night. They are tests
> > > that make requests to Microsoft's ActiveSync  servers and they have been
> > > timing out consistently since we made the change. I'll keep monitoring the
> > > tests today to see if there is any difference.
> > 
> > How did these do on mozilla mobile?
> 
> Sorry for the spam, I should have also asked what build are they running. -
> It could be a build issue with FxOS or the network. Hopefully they are still
> running the same build they were on when they were on moz mobile?

We are are running V121-2 base builds on our Flames and pushing different commits from GAIA on top of it so it is not exactly the same for each test run. After switching to ateam, the tests I mentioned have been timing out when they attempt to connect to the network. See http://selenium.qa.mtv2.mozilla.com:8080/job/b2g.flame.mozilla-aurora.ui.smoketest/39/console
All Flame devices in the lab should be v122 now.

Today I ran the test_settings_wifi test locally on repeat 30 times with v122 for both v1.4 and v2.1 on both ateam and Mozilla Mobile AP's both outside and inside the lab. 

Results:
1.4 - ateam - outside: 31 Pass
2.1 - ateam - outside: 30 Pass (My logs for this are missing)
1.4 - Mozilla Mobile - outside: 29 Pass 2 Timeout Fails
2.1 - Mozilla Mobile - outside: 31 Pass (one run of 15 was missing in logs)
1.4 - ateam - lab: 29 Pass 2 Timeout Fails
2.1 - ateam - lab: 30 Pass 1 NoSuchElementException
1.4 - Mozilla Mobile - lab: 27 Pass 3 NoSuchElementException 1 Timeout Fails
2.1 - Mozilla Mobile - lab: 30 Pass 1 Crash

So the problems may be stemming from the access point itself in the lab. The NoSuchElementExceptions were due to it unable to find the element which indicates the currently active WiFi network. Timeouts are 60 seconds long and are for waiting to connect to selected networks. I've attached the logs for future use.
Attached file 1.4 ateam lab logs (obsolete) —
Attached file 1_4_mozmobile_lab.txt (obsolete) —
Attached file wifi-test-logs.zip
Sorry for the spam. I realized it's better to just zip and attach them all at once.
Attachment #8449107 - Attachment is obsolete: true
Attachment #8449108 - Attachment is obsolete: true
Clint and I discussed this during our 1:1, and given that our Gaia UI Tests against Mozilla Mobile and ateam seem largely the same (in terms of bandwidth, at the least), both our runs of 1.4 (stable Gaia) and master/m-c Gaia for test_settings_wifi, which we clearly used to have problems with even on 1.4, are now resolved or due to other issues, the recommendation was to close this bug, and open new ones for specific, targeted issues.

To paraphrase an IRC conversation on #ateam yesterday, the takeaway for Eideticker is that we can either try to stand it up again in the MTV lab, or try anew, in Toronto.  It looks like the OS is disabling the Wi-Fi network, and it's not actually a DHCP problem, as far as we're aware.

Marking as INCOMPLETE, but please do change the status if that's the wrong resolution; thanks!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
(In reply to Stephen Donner [:stephend] from comment #74)
> Clint and I discussed this during our 1:1, and given that our Gaia UI Tests
> against Mozilla Mobile and ateam seem largely the same (in terms of
> bandwidth, at the least), both our runs of 1.4 (stable Gaia) and master/m-c
> Gaia for test_settings_wifi, which we clearly used to have problems with
> even on 1.4, are now resolved or due to other issues, the recommendation was
> to close this bug, and open new ones for specific, targeted issues.
> 
> To paraphrase an IRC conversation on #ateam yesterday, the takeaway for
> Eideticker is that we can either try to stand it up again in the MTV lab, or
> try anew, in Toronto.  It looks like the OS is disabling the Wi-Fi network,
> and it's not actually a DHCP problem, as far as we're aware.
> 
> Marking as INCOMPLETE, but please do change the status if that's the wrong
> resolution; thanks!

Yes, I'm going to try to get things working in the Toronto office, in particular paying attention to this issue. In theory the watcher should be rebooting the device if it's stuck without a network connection for long enough, we need to look into why that's happening. Easiest to do that here. You can track my progress there in bug 1023369.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: