Closed Bug 582336 Opened 11 years ago Closed 11 years ago

Frequent "No route to host" in RF room

Categories

(Infrastructure & Operations :: NetOps, task)

ARM
Maemo
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: dmoore)

References

Details

(Whiteboard: [troubleshooting mini][buildduty])

We were getting a lot of DNS lookup issues, so I added an /etc/hosts and other hardcoded IPs in bug 579939.

That reduced the number of errors we were getting, but we're still hitting a lot of "No route to host".

e.g.

--2010-07-27 12:07:56--  http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/tracemonkey-maemo5-gtk/1280242537/fennec-2.0a1pre.en-US.linux-gnueabi-arm.tar.bz2
Resolving ftp.mozilla.org... 10.2.74.10
Connecting to ftp.mozilla.org|10.2.74.10|:80... failed: No route to host.

This is at all times of day across numerous devices.  These logs are streamed over the network to the buildbot master, so the devices are up and networked at the time of the failure.

This isn't just ftp.m.o; it's also graphs.m.o and sometimes production-mobile-master.build as well.

There were 25 such failures since midnight today; 39 yesterday 7/26; 11 on 7/25.  I can provide more info if needed.

We're currently guessing there's too much load on the wifi access points, but that's definitely a guess.

Not sure if there's another bug open.
32 more such errors since comment 0... really picked up.
Can we morph this into a more generic 'figure out what is wrong with the wireless network in the rf room' bug?  Its hard to figure out what we need to change if we don't know what the root of the problem is.

If it helps, I can set up an n900 that isn't in the production or staging pool to use in diagnostic tests.
Assignee: server-ops → dmoore
(In reply to comment #2)
> Can we morph this into a more generic 'figure out what is wrong with the
> wireless network in the rf room' bug?  Its hard to figure out what we need to
> change if we don't know what the root of the problem is.
> 
> If it helps, I can set up an n900 that isn't in the production or staging pool
> to use in diagnostic tests.

jhford: not sure how that would help, so lets skip that for now. 

dmoore: not sure how you plan to debug this. One idea might be to get a laptop/pc setup with ethernet (not wireless) in hexxor, so we can figure out if the problem is the phones or the access point or the network? 

Note: this happens ~20 times a day in production, and is causing > 90% of our mobile "red/burning" failures, so raising priority.
Severity: normal → major
(In reply to comment #3)
> (In reply to comment #2)
> > Can we morph this into a more generic 'figure out what is wrong with the
> > wireless network in the rf room' bug?  Its hard to figure out what we need to
> > change if we don't know what the root of the problem is.
> > 
> > If it helps, I can set up an n900 that isn't in the production or staging pool
> > to use in diagnostic tests.
> 
> jhford: not sure how that would help, so lets skip that for now. 

One theory is that we are overloading the access points.  Having a wireless device that is an example of what our real production device is seems like it could be handy in running diagnostics.  I will wait for someone to ask me to set it up before I do so though.

> dmoore: not sure how you plan to debug this. One idea might be to get a
> laptop/pc setup with ethernet (not wireless) in hexxor, so we can figure out if
> the problem is the phones or the access point or the network? 

We have two machines in this room (mobile-image02.build.mozilla.org and nokimg.build.mozilla.org) that can be used for this purpose.  Both are physical, linux based machines and plugged into wired ports in the rf room switch.  We don't use standard build passwords on mobile-image02 and I don't remember about nokimg.  

Derek, I can email the passwords to mobile-image02 if this is useful.

I have been able to maintain a solid ssh connection to mobile-image02 for a day at a time while we have been experiencing these issues.  I am not sure if this is a valid trigger for whatever these issues are (maybe they are only while trying to establish a new connection).
 
> Note: this happens ~20 times a day in production, and is causing > 90% of our
> mobile "red/burning" failures, so raising priority.

I wonder if the underlying issue is also responsible for mid-run device drop offs (much less frequent)?
Now the network is very very slow, killing my testing runs (plus I have 435 production test failure emails unread -- many of these appear to be no route to host).

This has killed devices wgetting from stage and devices pinging graphs-stage.m.o.
as a diagnostic, can we try putting any spare access points we can in the rf room to see if that changes the situation?
Although it's probably unrelated, one of the access points in the RF room was offline (loose power cable).

Adding more access points will actually exacerbate the problem, since it's just going to lead to more RF interference. The b/g spectrum really can't tolerate more than 2 or 3 APs in close proximity. At any rate, we're currently only seeing around 30 clients per AP, which should be well within capacity.
I'm going to provision a mac mini for this room which we can use as a base for monitoring the wireless network performance.
Sweet, thanks Derek!
For what it's worth, aggregate bandwidth coming out of that room is only around 2-3 Mbps. We're nowhere near saturated on any portion of the network.

The vast majority of the clients are connected at full-rate 54Mbps with a very nice SNR, so interference isn't a likely culprit.

We'll continue to investigate.
Duplicate of this bug: 576335
Duplicate of this bug: 578369
Duplicate of this bug: 579167
Blocks: mobile-pool
Whiteboard: [troubleshooting mini]
dmoore: did you get a machine setup in the room, to help figure out if it stays online continuously, which would help figure out if its the router or the phones?
The testing platform (macbook) is now set up. We'll be running some basic performance tests (smokeping) and a script which continuously attempts to perform DNS resolution and simple TCP connections.
Did you find/fix anything?

Over the last day and a half we had ~3 errors related to downloading bits, instead of dozens. (2x No route to host, 1x Name or service not known)

Afaik the changes include a new proxy. Also, our 80 n810s are now only testing mozilla-1.9.2, meaning the load from the devices has shrunk considerably.  The 50 n900s are still active.
I've noticed 11 network-caused mobile test breakages in the last week, and of those, 10 were on the soon-to-be-phased-out n810s (one on an n900); again, way down.

We're going to poke at this a bit more, but lowering priority.
Severity: major → normal
Component: Server Operations → Server Operations: Netops
This happened 5 times since 2am today, including 2 n900 runs.
Definitely ramping up in frequency.
Any issues in the last ~15 days?
37 times since comment 18.
Joduinn wants me to paste in dates/times for debugging purposes; on my todo list.
Found a rogue device that was causing a lot of the breakage emails.

Once I pulled that device out of the results, we have had 1 failure that is obviously network-caused since Oct 8 at 2:54am, (Oct 14 at 1:38am PDT) which is a significant improvement.
I haven't seen anything over the last week that specifically looks like this on the n900s; the last issues were ~5:51pm on Oct 20.
Resolving; thanks for your help!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This is happening again with some regularity.  This causes the tree to turn red when there is a failure of this nature.  It seems like this started happening as soon as we started turning on the 40 new devices in the RF room.  These failures aren't limited to the new devices.

Is there anything that I can do to make diagnosing this problem easier?  I can provide an N900 running the same firmware version if that would make this easier.  They are fairly stock linux devices running ssh and should have most or all standard linux utilities.  Some are implemented in busybox, but apt-get should work.

bug 610617 is an instance of this turning the tree red because of this failure.
Severity: normal → major
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Duplicate of this bug: 610617
Whiteboard: [troubleshooting mini] → [troubleshooting mini][buildduty]
My gut feeling is that we stop seeing this when enough devices go offline that we fall below a certain threshold, and when we image enough back up (or get a new order) this rears its ugly head again.

Could this be token-ring wifi connectivity rearing its ugly head?
If so, how do we solve? Metal partition, wifi access points on either side of the room?
If not, what could this be?

(Yeah, I hate this bug too.)
If you can provide a list of the MACs for the devices.  I'm sure they all have the same vendor ID so at the very least that would help us search the controller logs.  I'm not sure what data you can get from the device, but the next time it happens providing (if able):

- Timestamp
- MAC
- Device IP configuration (e.g. ifconfig)
- Device routing (e.g. netstat -nr)
- 802.11 protocol
I *think* MAC addresses are available in the DHCP static ip configs, no?
For n900-NNN.build.mozilla.org, which would let us give you the first two pieces of information fairly easily.

I think we run ifconfig in the job before we run into the no route to host errors, so we can probably get output from that; we can add netstat -nr to the command list too.

How would we get the 802.11 protocol?  I'd guess they're all B but I'm not sure.
(In reply to comment #26)
> If you can provide a list of the MACs for the devices.  I'm sure they all have
> the same vendor ID so at the very least that would help us search the
> controller logs.  I'm not sure what data you can get from the device, but the
> next time it happens providing (if able):
> 
> - Timestamp
> - MAC
> - Device IP configuration (e.g. ifconfig)
> - Device routing (e.g. netstat -nr)
> - 802.11 protocol

is all of this information ok to put in a publicly visible text file?

What do you mean by 802.11 protocol?  802.11B vs G?
> is all of this information ok to put in a publicly visible text file?
> 
> What do you mean by 802.11 protocol?  802.11B vs G?

Yes, 802.11b, b, a or n?
Will the wifi routers tell you?
By my hacky way of measuring (error emails), it seems we haven't hit this in the past week.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.