Closed Bug 471830 Opened 16 years ago Closed 16 years ago

setup active/standby DHCP for build

Categories

(mozilla.org Graveyard :: Server Operations, task, P1)

All
Other

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justdave, Assigned: justdave)

Details

Another action item from bug 471679. ISC dhcpd3, which we run, supposedly has the capability to set up two dhcpd servers on the same LAN, which run in an active/standby failover configuration. If we had had this set up, all of the damage from bug 471679 could have been avoided. It does require additional hardware and/or VMs in place, and will likely cause configuration to be more complicated, so we'll need to figure out if this is worth the effort. It's probably not, in the offices, but I suspect we probably want to try to set this up in the colos. This also allows us to take one of them down for an extended period of time without taking the network with it, since the other will be there as a backup.
Assignee: server-ops → nobody
Component: Server Operations → Server Operations: Projects
The description of how this all works is in the dhcpd.conf man page. Search for "DHCP FAILOVER" within that man page.
DHCP FAILOVER This version of the ISC DHCP server supports the DHCP failover protocol as documented in draft-ietf-dhc- failover-07.txt. This is not a final protocol document, and we have not done interoperability testing with other vendors’ implementations of this protocol, so you must not assume that this implementation conforms to the standard. If you wish to use the failover protocol, make sure that both failover peers are running the same version of the ISC DHCP server. The failover protocol allows two DHCP servers (and no more than two) to share a common address pool. Each server will have about half of the available IP addresses in the pool at any given time for allocation. If one server fails, the other server will continue to renew leases out of the pool, and will allocate new addresses out of the roughly half of available addresses that it had when communications with the other server were lost. It is possible during a prolonged failure to tell the remain- ing server that the other server is down, in which case the remaining server will (over time) reclaim all the addresses the other server had available for allocation, and begin to reuse them. This is called putting the server into the PARTNER-DOWN state. You can put the server into the PARTNER-DOWN state either by using the omshell (1) command or by stopping the server, editing the last peer state declaration in the lease file, and restarting the server. If you use this last method, be sure to leave the date and time of the start of the state blank: failover peer name state { my state partner-down; peer state state at date; } When the other server comes back online, it should automati- cally detect that it has been offline and request a complete update from the server that was running in the PARTNER-DOWN state, and then both servers will resume processing together. It is possible to get into a dangerous situation: if you put one server into the PARTNER-DOWN state, and then *that* server goes down, and the other server comes back up, the other server will not know that the first server was in the PARTNER-DOWN state, and may issue addresses previously issued by the other server to different clients, resulting in IP address conflicts. Before putting a server into PARTNER- DOWN state, therefore, make sure that the other server will not restart automatically. The failover protocol defines a primary server role and a secondary server role. There are some differences in how primaries and secondaries act, but most of the differences simply have to do with providing a way for each peer to behave in the opposite way from the other. So one server must be configured as primary, and the other must be config- ured as secondary, and it doesn’t matter too much which one is which.
Assignee: nobody → justdave
Severity: minor → major
Priority: -- → P1
Component: Server Operations: Projects → Server Operations
We had this staged back in March, and ran out of time in the outage window and it got pushed out and not rescheduled. What was about to get pushed that night was sort of a rush job, and I've had some thoughts on a cleaner way to do it since then (that make it more in line with how we do DNS as well). I'll get the config staged the new way in SVN this weekend and we can try this again this coming week.
This needs to address the build Vlan as well.
Whiteboard: ETA 05/21 (tentative)
I still think I could pull this off tonight, but it's too close, and I'd feel much more comfortable having sufficient time to make sure it's done right before we push it out. I think we're better to wait till next Tuesday at this point.
Whiteboard: ETA 05/21 (tentative) → ETA 05/26 (tentative)
Whiteboard: ETA 05/26 (tentative) → ETA 05/28 (tentative)
Ready to go for 06/09?
maybe 6/11. I'm too far out-of-the-loop just coming back from vacation right now and it's going to take me a day to re-remember what I was doing here since I haven't touched it in two weeks.
Whiteboard: ETA 05/28 (tentative) → ETA 06/11 (tentative)
Whiteboard: ETA 06/11 (tentative) → ETA 06/30
ok, this got pushed off to 6/30 because I was on vacation, and then pushed off again by the Firefox 3.5 release falling on 6/30. I think we'll generally be safe to do this tomorrow night (7/2).
Flags: needs-downtime+
Whiteboard: ETA 06/30 → ETA 07/02
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Still don't have active/standby for build network, correct?
yeah, that's true, and that's what actually prompted this. I get the impression from the way the config is set up that we can have more than one failover peer configured, and have different vlans use different peers. Which means we could probably set up bm-admin01 the same way as this and have it failover with nm-dhcp01 as well, just for that vlan.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Need to investigate active/standby DHCP failover in the colos → setup active/standby DHCP for build
What has to happen to get another DHCP server on the build vlan? When can we schedule this?
Whiteboard: ETA 07/02
Whiteboard: 07/21 ?
Don't want to do this during a release.
Whiteboard: 07/21 ? → 07/23 ?
So we've got an IP address conflict on vlan71 that prevents me from bringing up nm-dhcp01 on that vlan on the IP I wanted it on... The IP it would take is currently occupied by moz2-win32-slave37. It's got last octet of 252 on every vlan it's on, except vlan71, which it's using 10.2.71.3, because that was available and 252 wasn't. :) This is all set up now. DHCP config for vlan71 is now in SVN with all the rest, and is being handled by boris and nm-dhcp01. bm-admin01 no longer has a DHCP server on it.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
some observations from Nick and I over the first couple hours watching the logs and testing things... It appears that machines will go back to the same machine they got the lease from in order to renew. Just doing a renew on a machine isn't sufficient. You have to actually do a release and renew before it'll throw a generic request back at the network to find an available dhcp server. This likely means the existing leases will have to actually expire before anything will look at the new dhcp servers. This means some machines may be briefly inaccessible for a short period of time after their leases expire before it reacquires a lease via the new servers. Exactly how long, I have no clue. Hopefully it's only a few seconds.
Getting lots of requests from vlan71 in the logs on both dhcp servers now, so it looks like everything eventually found its way to the new servers.
Whiteboard: 07/23 ?
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.