Closed
Bug 471830
Opened 16 years ago
Closed 16 years ago
setup active/standby DHCP for build
Categories
(mozilla.org Graveyard :: Server Operations, task, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: justdave, Assigned: justdave)
Details
Another action item from bug 471679. ISC dhcpd3, which we run, supposedly has the capability to set up two dhcpd servers on the same LAN, which run in an active/standby failover configuration. If we had had this set up, all of the damage from bug 471679 could have been avoided. It does require additional hardware and/or VMs in place, and will likely cause configuration to be more complicated, so we'll need to figure out if this is worth the effort. It's probably not, in the offices, but I suspect we probably want to try to set this up in the colos. This also allows us to take one of them down for an extended period of time without taking the network with it, since the other will be there as a backup.
| Assignee | ||
Updated•16 years ago
|
Assignee: server-ops → nobody
Component: Server Operations → Server Operations: Projects
| Assignee | ||
Comment 1•16 years ago
|
||
The description of how this all works is in the dhcpd.conf man page. Search for "DHCP FAILOVER" within that man page.
| Assignee | ||
Comment 2•16 years ago
|
||
DHCP FAILOVER
This version of the ISC DHCP server supports the DHCP
failover protocol as documented in draft-ietf-dhc-
failover-07.txt. This is not a final protocol document, and
we have not done interoperability testing with other vendors’
implementations of this protocol, so you must not assume that
this implementation conforms to the standard. If you wish to
use the failover protocol, make sure that both failover peers
are running the same version of the ISC DHCP server.
The failover protocol allows two DHCP servers (and no more
than two) to share a common address pool. Each server will
have about half of the available IP addresses in the pool at
any given time for allocation. If one server fails, the
other server will continue to renew leases out of the pool,
and will allocate new addresses out of the roughly half of
available addresses that it had when communications with the
other server were lost.
It is possible during a prolonged failure to tell the remain-
ing server that the other server is down, in which case the
remaining server will (over time) reclaim all the addresses
the other server had available for allocation, and begin to
reuse them. This is called putting the server into the
PARTNER-DOWN state.
You can put the server into the PARTNER-DOWN state either by
using the omshell (1) command or by stopping the server,
editing the last peer state declaration in the lease file,
and restarting the server. If you use this last method, be
sure to leave the date and time of the start of the state
blank:
failover peer name state {
my state partner-down;
peer state state at date;
}
When the other server comes back online, it should automati-
cally detect that it has been offline and request a complete
update from the server that was running in the PARTNER-DOWN
state, and then both servers will resume processing together.
It is possible to get into a dangerous situation: if you put
one server into the PARTNER-DOWN state, and then *that*
server goes down, and the other server comes back up, the
other server will not know that the first server was in the
PARTNER-DOWN state, and may issue addresses previously issued
by the other server to different clients, resulting in IP
address conflicts. Before putting a server into PARTNER-
DOWN state, therefore, make sure that the other server will
not restart automatically.
The failover protocol defines a primary server role and a
secondary server role. There are some differences in how
primaries and secondaries act, but most of the differences
simply have to do with providing a way for each peer to
behave in the opposite way from the other. So one server
must be configured as primary, and the other must be config-
ured as secondary, and it doesn’t matter too much which one
is which.
| Assignee | ||
Updated•16 years ago
|
Assignee: nobody → justdave
Severity: minor → major
Priority: -- → P1
Updated•16 years ago
|
Component: Server Operations: Projects → Server Operations
| Assignee | ||
Comment 3•16 years ago
|
||
We had this staged back in March, and ran out of time in the outage window and it got pushed out and not rescheduled. What was about to get pushed that night was sort of a rush job, and I've had some thoughts on a cleaner way to do it since then (that make it more in line with how we do DNS as well). I'll get the config staged the new way in SVN this weekend and we can try this again this coming week.
Comment 4•16 years ago
|
||
This needs to address the build Vlan as well.
Updated•16 years ago
|
Whiteboard: ETA 05/21 (tentative)
| Assignee | ||
Comment 5•16 years ago
|
||
I still think I could pull this off tonight, but it's too close, and I'd feel much more comfortable having sufficient time to make sure it's done right before we push it out. I think we're better to wait till next Tuesday at this point.
Whiteboard: ETA 05/21 (tentative) → ETA 05/26 (tentative)
| Assignee | ||
Updated•16 years ago
|
Whiteboard: ETA 05/26 (tentative) → ETA 05/28 (tentative)
Comment 6•16 years ago
|
||
Ready to go for 06/09?
| Assignee | ||
Comment 7•16 years ago
|
||
maybe 6/11. I'm too far out-of-the-loop just coming back from vacation right now and it's going to take me a day to re-remember what I was doing here since I haven't touched it in two weeks.
Whiteboard: ETA 05/28 (tentative) → ETA 06/11 (tentative)
Updated•16 years ago
|
Whiteboard: ETA 06/11 (tentative) → ETA 06/30
| Assignee | ||
Comment 8•16 years ago
|
||
ok, this got pushed off to 6/30 because I was on vacation, and then pushed off again by the Firefox 3.5 release falling on 6/30.
I think we'll generally be safe to do this tomorrow night (7/2).
Flags: needs-downtime+
Whiteboard: ETA 06/30 → ETA 07/02
Updated•16 years ago
|
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 9•16 years ago
|
||
Still don't have active/standby for build network, correct?
| Assignee | ||
Comment 10•16 years ago
|
||
yeah, that's true, and that's what actually prompted this.
I get the impression from the way the config is set up that we can have more than one failover peer configured, and have different vlans use different peers. Which means we could probably set up bm-admin01 the same way as this and have it failover with nm-dhcp01 as well, just for that vlan.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•16 years ago
|
Summary: Need to investigate active/standby DHCP failover in the colos → setup active/standby DHCP for build
Comment 11•16 years ago
|
||
What has to happen to get another DHCP server on the build vlan? When can we schedule this?
Whiteboard: ETA 07/02
Updated•16 years ago
|
Whiteboard: 07/21 ?
| Assignee | ||
Comment 13•16 years ago
|
||
So we've got an IP address conflict on vlan71 that prevents me from bringing up nm-dhcp01 on that vlan on the IP I wanted it on... The IP it would take is currently occupied by moz2-win32-slave37. It's got last octet of 252 on every vlan it's on, except vlan71, which it's using 10.2.71.3, because that was available and 252 wasn't. :)
This is all set up now.
DHCP config for vlan71 is now in SVN with all the rest, and is being handled by boris and nm-dhcp01. bm-admin01 no longer has a DHCP server on it.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 14•16 years ago
|
||
some observations from Nick and I over the first couple hours watching the logs and testing things...
It appears that machines will go back to the same machine they got the lease from in order to renew. Just doing a renew on a machine isn't sufficient. You have to actually do a release and renew before it'll throw a generic request back at the network to find an available dhcp server. This likely means the existing leases will have to actually expire before anything will look at the new dhcp servers. This means some machines may be briefly inaccessible for a short period of time after their leases expire before it reacquires a lease via the new servers. Exactly how long, I have no clue. Hopefully it's only a few seconds.
| Assignee | ||
Comment 15•16 years ago
|
||
Getting lots of requests from vlan71 in the logs on both dhcp servers now, so it looks like everything eventually found its way to the new servers.
| Assignee | ||
Updated•16 years ago
|
Whiteboard: 07/23 ?
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•