582336 - Frequent "No route to host" in RF room

Reporter

Description

•

15 years ago

We were getting a lot of DNS lookup issues, so I added an /etc/hosts and other hardcoded IPs in bug 579939. That reduced the number of errors we were getting, but we're still hitting a lot of "No route to host". e.g. --2010-07-27 12:07:56-- http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/tracemonkey-maemo5-gtk/1280242537/fennec-2.0a1pre.en-US.linux-gnueabi-arm.tar.bz2 Resolving ftp.mozilla.org... 10.2.74.10 Connecting to ftp.mozilla.org|10.2.74.10|:80... failed: No route to host. This is at all times of day across numerous devices. These logs are streamed over the network to the buildbot master, so the devices are up and networked at the time of the failure. This isn't just ftp.m.o; it's also graphs.m.o and sometimes production-mobile-master.build as well. There were 25 such failures since midnight today; 39 yesterday 7/26; 11 on 7/25. I can provide more info if needed. We're currently guessing there's too much load on the wifi access points, but that's definitely a guess. Not sure if there's another bug open.

Aki Sasaki (not active)

Reporter

Comment 1

•

15 years ago

32 more such errors since comment 0... really picked up.

John Ford [:jhford] CET/CEST Berlin Time

Comment 2

•

15 years ago

Can we morph this into a more generic 'figure out what is wrong with the wireless network in the rf room' bug? Its hard to figure out what we need to change if we don't know what the root of the problem is. If it helps, I can set up an n900 that isn't in the production or staging pool to use in diagnostic tests.

Shyam Mani [:fox2mike]

Updated

•

15 years ago

Assignee: server-ops → dmoore

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 3

•

15 years ago

(In reply to comment #2) > Can we morph this into a more generic 'figure out what is wrong with the > wireless network in the rf room' bug? Its hard to figure out what we need to > change if we don't know what the root of the problem is. > > If it helps, I can set up an n900 that isn't in the production or staging pool > to use in diagnostic tests. jhford: not sure how that would help, so lets skip that for now. dmoore: not sure how you plan to debug this. One idea might be to get a laptop/pc setup with ethernet (not wireless) in hexxor, so we can figure out if the problem is the phones or the access point or the network? Note: this happens ~20 times a day in production, and is causing > 90% of our mobile "red/burning" failures, so raising priority.

Severity: normal → major

John Ford [:jhford] CET/CEST Berlin Time

Comment 4

•

15 years ago

(In reply to comment #3) > (In reply to comment #2) > > Can we morph this into a more generic 'figure out what is wrong with the > > wireless network in the rf room' bug? Its hard to figure out what we need to > > change if we don't know what the root of the problem is. > > > > If it helps, I can set up an n900 that isn't in the production or staging pool > > to use in diagnostic tests. > > jhford: not sure how that would help, so lets skip that for now. One theory is that we are overloading the access points. Having a wireless device that is an example of what our real production device is seems like it could be handy in running diagnostics. I will wait for someone to ask me to set it up before I do so though. > dmoore: not sure how you plan to debug this. One idea might be to get a > laptop/pc setup with ethernet (not wireless) in hexxor, so we can figure out if > the problem is the phones or the access point or the network? We have two machines in this room (mobile-image02.build.mozilla.org and nokimg.build.mozilla.org) that can be used for this purpose. Both are physical, linux based machines and plugged into wired ports in the rf room switch. We don't use standard build passwords on mobile-image02 and I don't remember about nokimg. Derek, I can email the passwords to mobile-image02 if this is useful. I have been able to maintain a solid ssh connection to mobile-image02 for a day at a time while we have been experiencing these issues. I am not sure if this is a valid trigger for whatever these issues are (maybe they are only while trying to establish a new connection). > Note: this happens ~20 times a day in production, and is causing > 90% of our > mobile "red/burning" failures, so raising priority. I wonder if the underlying issue is also responsible for mid-run device drop offs (much less frequent)?

Aki Sasaki (not active)

Reporter

Comment 5

•

15 years ago

Now the network is very very slow, killing my testing runs (plus I have 435 production test failure emails unread -- many of these appear to be no route to host). This has killed devices wgetting from stage and devices pinging graphs-stage.m.o.

John Ford [:jhford] CET/CEST Berlin Time

Comment 6

•

15 years ago

as a diagnostic, can we try putting any spare access points we can in the rf room to see if that changes the situation?

Derek Moore [:dmoore]

Assignee

Comment 7

•

15 years ago

Although it's probably unrelated, one of the access points in the RF room was offline (loose power cable). Adding more access points will actually exacerbate the problem, since it's just going to lead to more RF interference. The b/g spectrum really can't tolerate more than 2 or 3 APs in close proximity. At any rate, we're currently only seeing around 30 clients per AP, which should be well within capacity.

Derek Moore [:dmoore]

Assignee

Comment 8

•

15 years ago

I'm going to provision a mac mini for this room which we can use as a base for monitoring the wireless network performance.

Aki Sasaki (not active)

Reporter

Comment 9

•

15 years ago

Sweet, thanks Derek!

Derek Moore [:dmoore]

Assignee

Comment 10

•

15 years ago

For what it's worth, aggregate bandwidth coming out of that room is only around 2-3 Mbps. We're nowhere near saturated on any portion of the network. The vast majority of the clients are connected at full-rate 54Mbps with a very nice SNR, so interference isn't a likely culprit. We'll continue to investigate.

Aki Sasaki (not active)

Reporter

Updated

•

15 years ago

Blocks: mobile-pool

matthew zeier [:mrz]

Updated

•

15 years ago

Whiteboard: [troubleshooting mini]

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 14

•

15 years ago

dmoore: did you get a machine setup in the room, to help figure out if it stays online continuously, which would help figure out if its the router or the phones?

Derek Moore [:dmoore]

Assignee

Comment 15

•

15 years ago

The testing platform (macbook) is now set up. We'll be running some basic performance tests (smokeping) and a script which continuously attempts to perform DNS resolution and simple TCP connections.

Aki Sasaki (not active)

Reporter

Comment 16

•

15 years ago

Did you find/fix anything? Over the last day and a half we had ~3 errors related to downloading bits, instead of dozens. (2x No route to host, 1x Name or service not known) Afaik the changes include a new proxy. Also, our 80 n810s are now only testing mozilla-1.9.2, meaning the load from the devices has shrunk considerably. The 50 n900s are still active.

Aki Sasaki (not active)

Reporter

Comment 17

•

15 years ago

I've noticed 11 network-caused mobile test breakages in the last week, and of those, 10 were on the soon-to-be-phased-out n810s (one on an n900); again, way down. We're going to poke at this a bit more, but lowering priority.

Severity: major → normal

matthew zeier [:mrz]

Updated

•

15 years ago

Component: Server Operations → Server Operations: Netops

Aki Sasaki (not active)

Reporter

Comment 18

•

15 years ago

This happened 5 times since 2am today, including 2 n900 runs. Definitely ramping up in frequency.

Ravi Pina [:ravi]

Comment 19

•

15 years ago

Any issues in the last ~15 days?

Aki Sasaki (not active)

Reporter

Comment 20

•

15 years ago

37 times since comment 18. Joduinn wants me to paste in dates/times for debugging purposes; on my todo list.

Aki Sasaki (not active)

Reporter

Comment 21

•

15 years ago

Found a rogue device that was causing a lot of the breakage emails. Once I pulled that device out of the results, we have had 1 failure that is obviously network-caused since Oct 8 at 2:54am, (Oct 14 at 1:38am PDT) which is a significant improvement.

Aki Sasaki (not active)

Reporter

Comment 22

•

15 years ago

I haven't seen anything over the last week that specifically looks like this on the n900s; the last issues were ~5:51pm on Oct 20. Resolving; thanks for your help!

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

John Ford [:jhford] CET/CEST Berlin Time

Comment 23

•

15 years ago

This is happening again with some regularity. This causes the tree to turn red when there is a failure of this nature. It seems like this started happening as soon as we started turning on the 40 new devices in the RF room. These failures aren't limited to the new devices. Is there anything that I can do to make diagnosing this problem easier? I can provide an N900 running the same firmware version if that would make this easier. They are fairly stock linux devices running ssh and should have most or all standard linux utilities. Some are implemented in busybox, but apt-get should work. bug 610617 is an instance of this turning the tree red because of this failure.

Severity: normal → major

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Updated

•

15 years ago

Whiteboard: [troubleshooting mini] → [troubleshooting mini][buildduty]

Aki Sasaki (not active)

Reporter

Comment 25

•

15 years ago

My gut feeling is that we stop seeing this when enough devices go offline that we fall below a certain threshold, and when we image enough back up (or get a new order) this rears its ugly head again. Could this be token-ring wifi connectivity rearing its ugly head? If so, how do we solve? Metal partition, wifi access points on either side of the room? If not, what could this be? (Yeah, I hate this bug too.)

Ravi Pina [:ravi]

Comment 26

•

15 years ago

If you can provide a list of the MACs for the devices. I'm sure they all have the same vendor ID so at the very least that would help us search the controller logs. I'm not sure what data you can get from the device, but the next time it happens providing (if able): - Timestamp - MAC - Device IP configuration (e.g. ifconfig) - Device routing (e.g. netstat -nr) - 802.11 protocol

Aki Sasaki (not active)

Reporter

Comment 27

•

15 years ago

I *think* MAC addresses are available in the DHCP static ip configs, no? For n900-NNN.build.mozilla.org, which would let us give you the first two pieces of information fairly easily. I think we run ifconfig in the job before we run into the no route to host errors, so we can probably get output from that; we can add netstat -nr to the command list too. How would we get the 802.11 protocol? I'd guess they're all B but I'm not sure.

John Ford [:jhford] CET/CEST Berlin Time

Comment 28

•

15 years ago

(In reply to comment #26) > If you can provide a list of the MACs for the devices. I'm sure they all have > the same vendor ID so at the very least that would help us search the > controller logs. I'm not sure what data you can get from the device, but the > next time it happens providing (if able): > > - Timestamp > - MAC > - Device IP configuration (e.g. ifconfig) > - Device routing (e.g. netstat -nr) > - 802.11 protocol is all of this information ok to put in a publicly visible text file? What do you mean by 802.11 protocol? 802.11B vs G?

matthew zeier [:mrz]

Comment 29

•

14 years ago

> is all of this information ok to put in a publicly visible text file? > > What do you mean by 802.11 protocol? 802.11B vs G? Yes, 802.11b, b, a or n?

Aki Sasaki (not active)

Reporter

Comment 30

•

14 years ago

Will the wifi routers tell you?

Aki Sasaki (not active)

Reporter

Comment 31

•

14 years ago

By my hacky way of measuring (error emails), it seems we haven't hit this in the past week.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

3 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard