Closed Bug 471679 Opened 11 years ago Closed 11 years ago

Many/most build network hosts are unreachable

Categories

(mozilla.org Graveyard :: Server Operations, task, P1, blocker)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gozer, Assigned: justdave)

References

Details

(Whiteboard: post-mortem in comment 29)

I got notified this morning by nagios that the following hosts were down:

 - tb-linux-tbox.build 
 - tbnewref-win32-tbox.build

These are still down, as far as I can tell. As they are both nightly builders for Thunderbird, getting them back up and running is somewhat important.

The only suspicion I have is that it might be work on bug 460094, but I don't think it is, as I haven't given the go ahead on this.

Thanks.
It's not just Thunderbird hosts, it's everything. nagios is currently spamming us with DOWN: CRITICAL - Host Unreachable messages.
Severity: major → blocker
Priority: -- → P1
Summary: Hosts down: tb-linux-tbox.build and tbnewref-win32-tbox.build → Many/most build network hosts are
Summary: Many/most build network hosts are → Many/most build network hosts are unreachable
production-master and staqing-master VMs are down. This closes all Firefox development.
The DHCP server on bm-admin01 stopped responding at some point during the night.  Hosts started dropping as their leases expired.  The default lease time is 7 days though, for this much stuff to be dropping it's probably been broken longer than that.  We did discover a faulty entry for fx64-linux-tbox that was causing the dhcp server not to start when we sanitychecked everything after last night's ugprade.
I can't find any record of fx64-linux-tbox in any bugs so I don't know when it was set up or by who.
Mentioned on IRC but forgot to say so here, stuff should be coming back on its own as soon as they re-attempt their DHCP next.  The ESX boxes apparently have a long retry on them, we're working on getting them to manually renew right now, there's two of them (bm-vmware03 and bm-vmware11) still off the net as I type, should have those up soon.
Assignee: server-ops → justdave
From the looks of things in nagios, I'm going to say that all the Windows and Mac boxes came back on their own, looks like most of the Linux boxes haven't.
All of the ESX boxes should be back on the net now except bm-vmware11 to the best of our knowledge.
tb-linux-tbox.build appears to be still unreacheable.
tb-linux-tbox.build is back in business, thanks!
I suspect this may have caused bug 471712
 (In reply to comment #10)
> I suspect this may have caused bug 471712

I would also suspect the same, especially as bug#471712 is now ok, about the same time as we brought most of build machines back online.
As best as we can tell, build machines are now back up again. We're keep tree closed for a little longer to let machines cycle back from burning to green, before reopening tree. 

If you see a specific problem with any build machine, please update this bug.


ps: try server and try slaves were not impacted by this.
Did this affect MDC at all? There seems to be a problem with some pages on MDC for the past 2 or so days.
From the outside, tough for me to tell whether crazyhorse, the Tb Linux box, is this or coincidence. Filed bug 471739 for it.
Sometime between 16:03 and 16:09, bm-xserve18.build.mozilla.org failed out with:

...
pulling from http://hg.mozilla.org/mozilla-central
NEXT ERROR abort: error: Temporary failure in name resolution
..

Is this related to the DHCP problem or something else?
I've reopened the tree, as almost all machines seem to be ok now. Lets keep this bug open to watch a little longer, just in case. 

It would also be good to know if bm-xserve18 error in comment#15 is related to this DHCP bug or not.
(In reply to comment #13)
> Did this affect MDC at all? There seems to be a problem with some pages on MDC
> for the past 2 or so days.

No, that would be a separate issue (and I think it was resolved a few hours ago based on reports on IRC)

(In reply to comment #16)
> It would also be good to know if bm-xserve18 error in comment#15 is related to
> this DHCP bug or not.

Is it still happening, or has it recovered since then?  If it's an older error message from this morning and you're just trying to figure out what cause it, then yes, the DHCP issues could easily cause that error, so it's quite likely.
(In reply to comment #17)
> (In reply to comment #13)
> > Did this affect MDC at all? There seems to be a problem with some pages on MDC
> > for the past 2 or so days.
> 
> No, that would be a separate issue (and I think it was resolved a few hours ago
> based on reports on IRC)

Nope, still broken for me when I try to visit https://developer.mozilla.org/En/XPConnect or https://developer.mozilla.org/en/CSS/-moz-margin-end just to mention a couple. Should I file a new bug?
(In reply to comment #18)
> Nope, still broken for me when I try to visit
> https://developer.mozilla.org/En/XPConnect or
> https://developer.mozilla.org/en/CSS/-moz-margin-end just to mention a couple.
> Should I file a new bug?

Yes, it's not related to this at all.
bm-xserve18, though, I think is - it was still failing at 4pm. That would be the burning "OS X 10.5.2 mozilla-central unit test %" in the middle of the Firefox trunk tinderbox, that's going to get in my way when I try to back out the patch that's probably failing the tests on the Linux and Windows ones (and failed once on Mac). Could you give it a thumping, and tell it to try harder to resolve hg.mozilla.org?
I can ssh to it, and DNS seems to work fine.

bm-xserve18:~ root# scutil --dns
DNS configuration

resolver #1
  domain : build.mozilla.org
  nameserver[0] : 10.2.74.125
  nameserver[1] : 10.2.74.127
  order   : 200000
(...)
bm-xserve18:~ root# host hg.mozilla.org
hg.mozilla.org has address 10.2.74.66
hg.mozilla.org has address 10.2.74.67
Well, is there an IT step for "disable a busted slave until RelEng beats some sense into it"? (Or, my favorite, say "you're no better than Windows!" and reboot it.) bm-xserve19 is doing just fine, when 18 gives it a chance to get at a task: after you sshed in, I pushed a backout and another changeset, to increase my odds of getting one good build, and while 19 got one unit test task it's doing fine on, 18 got both the build task which it failed on and then the other unit test task which it also failed on. While "host" knows how to resolve things, some layer between that and "hg clone" is hanging onto its memory of not being able to resolve host names, and refusing to even think about trying again.
Yeah, the IT instructions for xserves for "unusual problems" not listed elsewhere on the page says "log in as cltbld, stop the slave, and notify releng to investigate."

bm-xserve18:~ cltbld$ buildbot stop /builds/moz2_slave 
buildbot process 12391 is dead

Have at it.
I had a look at bm-xserve18. hg behaved itself doing incoming and pull in one of the mozilla-1.9.1 dirs (m-c had all been clobbered by someone), so I've restarted the buildbot slave.
I think all this is probably due to the following, as posted by Reed Loden on 2008-12-30 18:34:23 -0600 on several Mozilla NGs including mozilla.dev.planning:

> *Mozilla Scheduled Downtime - 12/30/2008, 7pm - 11pm PST (0300 - 0700
> 12/31/2008 UTC)*
>
> We’ll be taking advantage of the continued holiday lull and performing
> scheduled maintenance tonight from 7:00pm to 11:00pm PST. The following
> changes will take place:
> 
> * 7:00pm PST (0300 UTC) duration 1 hour: Firmware upgrade.  We’ll be
> upgrading the firmware of the machine that hosts tinderbox.mozilla.org
> and bonsai.mozilla.org.
> 
> * 8:00pm PST (0400 UTC) duration 3 hours: RHEL5 upgrades.  We’ll be
> doing some more RHEL5 upgrades on several machines. This will
> internally affect DHCP and DNS throughout the MPT network.
> 
> * 9:00pm PST (0500 UTC) crash-stats.mozilla.com database maintenance.
> We’ll be copying the entire Breakpad database in order to set up a
> replication environment. No downtime is expected, however degraded
> performance is expected for several days while the database is copied.
> 
> Please let me know if you have any reason why we should not proceed
> with this planned maintenance. As always, we aim to keep downtime to
> as little as possible, but unexpected complications can arise causing
> longer downtime periods than expected. All systems should be
> operational by the end of the maintenance window.
> 
> Feel free to let me know if you see issues past the planned downtime.

Reed, what do you think?
(In reply to comment #25)
> I think all this is probably due to the following, as posted by Reed Loden on
> 2008-12-30 18:34:23 -0600 on several Mozilla NGs including
> mozilla.dev.planning:
...
> > * 8:00pm PST (0400 UTC) duration 3 hours: RHEL5 upgrades.  We’ll be
> > doing some more RHEL5 upgrades on several machines. This will
> > internally affect DHCP and DNS throughout the MPT network.
...
> Reed, what do you think?

Yes, the RHEL5 upgrade on bm-admin01 was the catalyst for this issue, though not the direct cause, but comment #3 explains what really happened.

Everything's basically recovered, as far as I know, so marking this fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
karma needed a |/etc/init.d/network restart| to get 1.8 l10n back
(In reply to comment #27)
> karma needed a |/etc/init.d/network restart| to get 1.8 l10n back

yeah, and see also bug 471770.
Depends on: 471815
I started to post the full analysis of what happened in here last night, and I apparently didn't hit the Commit button and have since lost it..  so here goes again.

1. Someone at some point in recent history (since the last time before this event that bm-admin01 got rebooted at least) someone added a second hardware line to the host block for fx64-linux-tbox.  This is not legal, as the syntax for dhcpd.conf only allows one hardware line per host (this is mapping an ethernet address to that host).  However, dhcpd never got restarted, so it never picked up this config change.  When we brought the machine back up after the upgrade, dhcpd failed to start, citing this syntax error.  I still haven't figured out who did this or when, as there doesn't appear to be a bug on file requesting this change.

2. An hour or so after the upgrade, a few build machines started dropping off the net. This immediately looked like a failed DHCP, so it was investigated, and the above syntax error was discovered and corrected, and dhcpd was brought back up.  The hosts we noticed dropping off the net all came back within a few minutes.

3. After this, it was discovered that the vmware-tools on bm-admin01 hadn't survived the upgrade (The machine is a VM, and it was performing horribly, as vmware-tools was not operating).  vmware-tools was then upgraded to the RHEL5 version.  The act of upgrading vmware-tools took dhcpd offline again because the interface it was listening on effectively went away (vmware-tools changed the driver responsible for the network interface).

4. Not long after this, a few build hosts started dropping off the net again.  dhcpd was checked on, but the daemon was still running.  It was suggested that it was an error in the nagios configuration (because nagios also runs on that box, and also had some issues getting through the rhel5 upgrade).  We did not discover at this time that although dhcpd was running, it wasn't actually working.  Bug 471828 has been filed as an action item to prevent this from being an issue again (get nagios to monitor the dhcpd servers from the network).

5. It was in that narrow stretch of time between when the West Coast US goes to sleep and before the Europeans wake up.  Reed and I, who are both east/central US (and thus halfway between those two timezones, and should have been long past sleeping already) were very tired, and both of us apparently thought the other was working on what we thought was the nagios issue, and we both went to bed.  We were woken up a few hours later by pages complaining about the build network being down.  Recovery efforts went from there.

The dhcpd server that we use supposedly supports an active/standby failover configuration where you can have multiple dhcpd servers and one will take over when the other goes down.  If we'd had this in place all of this could have been avoided, as we could have taken bm-admin01 down for the upgrade without taking dhcp completely offline with it.  I've filed bug 471830 to track investigating getting this set up for the future.
Whiteboard: post-mortem in comment 29
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.