Closed Bug 759200 Opened 8 years ago Closed 8 years ago

Connection between WIFI network and physical network in hax0r is super slow

Categories

(Infrastructure & Operations :: NetOps, task)

x86_64
Linux
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: adam)

Details

Attachments

(1 file)

This bug pertains to Eideticker (see bug 748072 for context).

For some reason the connection between the physical network (that the Eideticker desktop machines are running on: 10.250.1.x) and the wireless network (that the Eideticker phones are running on: 10.250.49.x) is super slow. Ping times are in the 500ms range, and it takes more than 5 minutes to copy over a 14 meg file (needed every time we install a new build on a device for testing, a frequent occurance).

Did something happen recently? It wasn't like this a week ago. This performance makes the eideticker machines basically unusable.
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops
QA Contact: phong → ravi
the entire wireless network in Mountain View is experiencing high latency today - it's slightly better right now than it was this morning, but that could be partially to blame.
So a wireless device in Haxxor is trying to transfer data to/from another host also in haxxor, but attached to a switch?

Nothing has changed.
(In reply to Ravi Pina [:ravi] from comment #3)
> So a wireless device in Haxxor is trying to transfer data to/from another
> host also in haxxor, but attached to a switch?
> 
> Nothing has changed.

I'm not sure about the switch part. All I know is that the wireless devices and the physical machines are on different networks.

I don't know if anything changed specific to the configuration of the networks in haxor, but I can say that things have gotten unusable. Maybe it has to do with the wireless network in issues in Mountain View that Clint mentioned.
(In reply to Clint Talbert ( :ctalbert ) from comment #2)
> the entire wireless network in Mountain View is experiencing high latency
> today - it's slightly better right now than it was this morning, but that
> could be partially to blame.

Have you filed any bugs about it?  I encourage you to refer to https://mana.mozilla.org/wiki/display/DESKTOP/Troubleshooting+Wireless before you do.

I've been in MV all day and not have observed any issues on my laptop or phone.
(In reply to Ravi Pina [:ravi] from comment #5)
> (In reply to Clint Talbert ( :ctalbert ) from comment #2)
> > the entire wireless network in Mountain View is experiencing high latency
> > today - it's slightly better right now than it was this morning, but that
> > could be partially to blame.
> 
> Have you filed any bugs about it?  I encourage you to refer to
> https://mana.mozilla.org/wiki/display/DESKTOP/Troubleshooting+Wireless
> before you do.
> 
> I've been in MV all day and not have observed any issues on my laptop or
> phone.

So this has gotten a bit better today for whatever reason, but the latency is still not quite good enough for Eideticker (ping times are anywhere between 50ms and 500ms as I write this, and it takes 30 seconds to transfer a fennec build). 

We really need to get ping times down to less than 50 ms reliably for Eideticker to work as intended, as parts of the harness assume (the parts that we use to start/stop video capture in particular) that web requests will complete in a specific amount of time as the beginning/end of tests are keyed to certain REST requests. 

I was talking a bit about this with ctalbert, and one possible solution that came to mind is installing a seperate wireless router (connected to the main MV network) and running both the eideticker machines (which would be connected to it directly) and the phones off of it. Would that be possible? This would also eliminate the issue described in bug 750470 (actually, it almost looks like a similar solution is proposed there, but I'm not sure).
(In reply to William Lachance (:wlach) from comment #6)

> I was talking a bit about this with ctalbert, and one possible solution that
> came to mind is installing a seperate wireless router (connected to the main
> MV network) and running both the eideticker machines (which would be
> connected to it directly) and the phones off of it. Would that be possible?

Absolutely not. Let's address the issue within our infrastructure, rather than hanging more cruft off the side.

There should be a conventional desktop SSID available inside Haxxor, and if the Eideticker machines are on the desktop VLAN, so should the mobile devices.

If the devices can't easily do 802.1x, then the right solution would be to make the Mozilla Mobile SSID available inside Haxxor.

We'd need the MAC addresses of your mobile devices to enable access.
(In reply to Zandr Milewski [:zandr] from comment #7)
> (In reply to William Lachance (:wlach) from comment #6)
> 
> > I was talking a bit about this with ctalbert, and one possible solution that
> > came to mind is installing a seperate wireless router (connected to the main
> > MV network) and running both the eideticker machines (which would be
> > connected to it directly) and the phones off of it. Would that be possible?
> 
> Absolutely not. Let's address the issue within our infrastructure, rather
> than hanging more cruft off the side.
> 
> There should be a conventional desktop SSID available inside Haxxor, and if
> the Eideticker machines are on the desktop VLAN, so should the mobile
> devices.
> 
> If the devices can't easily do 802.1x, then the right solution would be to
> make the Mozilla Mobile SSID available inside Haxxor.
> 
> We'd need the MAC addresses of your mobile devices to enable access.

That's fine. I don't really care how we solve this as long as we do and quickly. The data being generated from Eideticker is being used to determine whether mobile is meeting its requirements on a day by day basis.

You can find all the MAC addresses of our mobile machines in these two bug comments:
* For Mobile startup tests: https://bugzilla.mozilla.org/show_bug.cgi?id=721482#c2
* For Eideticker tests: https://bugzilla.mozilla.org/show_bug.cgi?id=748072#c16
(In reply to Clint Talbert ( :ctalbert ) from comment #8)
> (In reply to Zandr Milewski [:zandr] from comment #7)
> > (In reply to William Lachance (:wlach) from comment #6)

> > We'd need the MAC addresses of your mobile devices to enable access.
> 
> That's fine. I don't really care how we solve this as long as we do and
> quickly. The data being generated from Eideticker is being used to determine
> whether mobile is meeting its requirements on a day by day basis.
> 
> You can find all the MAC addresses of our mobile machines in these two bug
> comments:
> * For Mobile startup tests:
> https://bugzilla.mozilla.org/show_bug.cgi?id=721482#c2
> * For Eideticker tests:
> https://bugzilla.mozilla.org/show_bug.cgi?id=748072#c16

Not to be a nag, but could we get an we get an ETA on this? Simpler is of course better, but as I'm not 100% confident this will actually fix the issue, it would be good to know whether we're going to need to try something more complicated sooner than later.
I'm in MV tomorrow.  My fundamental concern is the network in the RF room may be being used for purposes for which it was not designed.  There have been zero issues for over a year and it seems the recent addition of whatever it is you are doing has issues.

For anyone to troubleshoot this we will need either written or in a Visio-like diagram the flows that are in question over the respective media.  I can work with someone on this if need be or just attach it to this bug.

It would also be helpful to understand how many wireless devices are in the room.  This is something I can track, but it is possible we are hitting wireless contention.  This may not be something easily solved.
Every Tegra in Haxxor has a WiFi antenna and IIRC it is active.

That would be around 225 Tegras (not sure of exact count TBH but it's more than 200 for sure.)
Here's an attempt at a diagram of the network flows. As you can see, the Android phones only require a network connection to their controlling machines, but the controlling machines need to be fully accessible inside the internal network.
(In reply to Ravi Pina [:ravi] from comment #10)
> I'm in MV tomorrow.  My fundamental concern is the network in the RF room
> may be being used for purposes for which it was not designed.  There have
> been zero issues for over a year and it seems the recent addition of
> whatever it is you are doing has issues.

Well, the Eideticker set up needs a lower latency connection than most of the mobile automation we have. Also the fact that the tegras (as far as I know) don't actually use their wireless connections for anything, even if they're turned on.
 
> For anyone to troubleshoot this we will need either written or in a
> Visio-like diagram the flows that are in question over the respective media.
> I can work with someone on this if need be or just attach it to this bug.

Attached. Not sure if it's what you were expecting, if it isn't, let me know.
(In reply to William Lachance (:wlach) from comment #13)
> 
> Well, the Eideticker set up needs a lower latency connection than most of
> the mobile automation we have. Also the fact that the tegras (as far as I
> know) don't actually use their wireless connections for anything, even if
> they're turned on.

The Tegras do not use the WiFi at all, I just cannot say for certain if the OS image has it turned off or if it's just idle/not-configured
Correct. The ONLY devices in that room using the wireless network are these:
6 phones running android startup
1 mac mini serving pages for the phones doing android startup tests
3 phones running eideticker tests.

Those are all mentioned in the two links I gave out up above.

Will one thought I just had, if it would be easier, perhaps we could solve this by placing all the static websites that the eideticker machines need onto the mac mini serving the autophone sites?  Then the phones should have a very low latency connection to those websites.

Ravi, the reason that no one has had any issues with the wifi network in the RF room is that we never actually used it in such a way that we cared about the results.  The only other system (outside of this new android stuff) that used that wifi network were the old n900's and their automation but by the time we had that automation up and running in haxxor, the mobile developers had largely moved on to android development and no longer cared at all about whether the tests that the n900's were running passed or failed.
(In reply to Clint Talbert ( :ctalbert ) from comment #15)
> Correct. The ONLY devices in that room using the wireless network are these:
> 6 phones running android startup
> 1 mac mini serving pages for the phones doing android startup tests
> 3 phones running eideticker tests.
> 
> Those are all mentioned in the two links I gave out up above.
> 
> Will one thought I just had, if it would be easier, perhaps we could solve
> this by placing all the static websites that the eideticker machines need
> onto the mac mini serving the autophone sites?  Then the phones should have
> a very low latency connection to those websites.

I guess the mac mini is on the same wireless network as the phones? I guess that might have the same effect as putting the phones on the main office network, but it would be a bunch of extra work in the short and long term (short term effort to modify the harness to use another host for the web server, long term extra effort to keep pages in sync on the mac mini)

My vote would be to go with just putting the phones on the main office network. That seems like the least amount of work and maintenance for everyone-- we can consider other options if that doesn't work.
Assignee: network-operations → adam
Working this issue today.
SSID:Mozilla RF Room is on the releng network, vlan 500 in mtv1.  This is not a suitable network for anyone but releng.  The SSID:Mozilla is on the corp vlan, 200, but has LDAP authentication.  I don't believe you are using it.

I believe the more correct solution is to configure vlavn 120, ateam, in mtv1, and create SSID:Mozilla Ateam (I already did this part), and have it along with all your wired servers be in this vlan.  This means, however, that we need to understand all the wired hosts and what access they would need outside of the Ateam vlan once they move there.
(In reply to Ravi Pina [:ravi] from comment #18)
> SSID:Mozilla RF Room is on the releng network, vlan 500 in mtv1.  This is
> not a suitable network for anyone but releng.  The SSID:Mozilla is on the
> corp vlan, 200, but has LDAP authentication.  I don't believe you are using
> it.
> 
> I believe the more correct solution is to configure vlavn 120, ateam, in
> mtv1, and create SSID:Mozilla Ateam (I already did this part), and have it
> along with all your wired servers be in this vlan.  This means, however,
> that we need to understand all the wired hosts and what access they would
> need outside of the Ateam vlan once they move there.


As Clint said, we're pretty agnostic about which solution is implemented so long as it gets the job done. ;)

As for what kind of access we need for the wired hosts:

(1) They need to be accessible somehow via ssh/vnc from inside the MV VPN, so that we can maintain them. I'm not sure what the implications are of putting things inside an ateam vlan on this.
(2) There are a few external resources (github, etc.) that need to be accessible so we can keep the tests + test harnesses up to date.
We have set up the new network on VLAN 120. The IP block is 10.250.120.0/24. DHCP is available offering IPs from 10.250.120.100 - 254. Wireless is available by connecting to the "Mozilla Ateam" network, no auth is required.

Our next steps are to move the linux boxes to static IP addresses above 1 and below 100, then add the devices to the WIFI network.
FYI, currently any flows are permitted between the corp and ateam VLANs. The ateam vlan is not presently able to reach the Internet, but that can be arranged if required.
(In reply to Adam Newman [:adam] from comment #21)
> FYI, currently any flows are permitted between the corp and ateam VLANs. The
> ateam vlan is not presently able to reach the Internet, but that can be
> arranged if required.

Please do. As mentioned above, we rely on an external internet connection for a number of things. Thanks!
Can you enumerate what ports you need and which hosts need them?  It would be nice if we could restrict the access to only the devices that need it.
(In reply to Ravi Pina [:ravi] from comment #23)
> Can you enumerate what ports you need and which hosts need them?  It would
> be nice if we could restrict the access to only the devices that need it.

Only the eideticker desktop machines need external access to the internet. I'm pretty sure that HTTP, HTTPS, and GIT should be the only ports they need to access the external resources they need.
Wow, thanks, guys.  I'm attaching bmoss to this thread so he can help with the physical rebooting of the boxes in haxxor while I'm on vacation (after today).
adam: ravi:
if you haven't already, could you also set the vlan assignments for the following ports to vlan120.  These 3 machines are wired to those ports which are currently assigned the corp vlan.  Also, you might need to trunk vlan120 to that switch since I don't believe it is currently.

sw2.df202-2.ops:43 (ateam-eideticker-1)
sw2.df202-2.ops:44 (ateam-eideticker-2)
sw2.df202-2.ops:45 (ateam-eideticker-3)
A couple things need to happen before adjusting the phones and boxes.

First, netops (probably Adam) will need to trunk vlan120 to the switch that the eideticker linux boxes are wired to and change the vlan port assignments for those 3 ports. (see comment 26)

After that, someone (probably me) will need to move the dhcp static entries from the corp subnet to the ateam subnet and change the ip addresses to be inline with the ateam ip block.  All of these devices have dhcp and dns entries so its a c&p and change IP should be simple and quick.

When this is done, we can physically configure the phones and mac mini to connect to the Ateam SSID.

Below is the proposed ip changes:

ateam-mstartup      10.250.50.162  -> 10.250.120.62
droid-proa          10.250.50.163  -> 10.250.120.63
samsung-gs2a        10.250.50.164  -> 10.250.120.64
nexus-onea          10.250.50.165  -> 10.250.120.65
nexus-sa            10.250.50.166  -> 10.250.120.66
samsung-gs2b        10.250.50.167  -> 10.250.120.67
nexus-sb            10.250.50.168  -> 10.250.120.68

ateam-eideticker-galnex-1        10.250.1.179  -> 10.250.120.71
ateam-eideticker-g2x-1           10.250.1.180  -> 10.250.120.72
ateam-eideticker-g2x-2           10.250.1.189  -> 10.250.120.73
ateam-eideticker-1               10.250.1.191  -> 10.250.120.74
ateam-eideticker-2               10.250.1.212  -> 10.250.120.75
ateam-eideticker-3               10.250.1.252  -> 10.250.120.76
After juggling things around we got the following throughput:

http://www.speedtest.net/result/1996690452.png

We'll move the ports over Friday.
The trunk configuration is all set and tested. I moved over an unused port on sw2.202-2 in haxxor and we should be good go.
I've changed the dns for all the devices listed in comment 27 to properly represent that they are on the ateam vlan.

*.ateam.mtv1.mozilla.com

eg. ateam-eideticker-galnex-1.ateam.mtv1.mozilla.com
We've made a significant number of improvements to the wireless controller configurations after working with the vendor as well as the core infrastructure.

Can you provide an update on the link status?
[12:15:33] <adam> what's up with 759200.
[12:16:52] <dividehex> ive got a meeting with ctalbert in 45mins.  I'll verify with him. otherwise i'm sure it is complete
[12:17:11] <adam> excellent.
[12:17:13] <adam> I'll ping back.
[12:17:51] <adam> making a note in the bug.
Verified good. resolving.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thanks for all your help here guys!!!
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.