Closed
Bug 715006
Opened 14 years ago
Closed 13 years ago
put admin1a/b into service as DHCP servers
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: arich)
References
Details
Attachments
(3 files)
These hosts should replace ns1/2 for this purpose.
Reporter | ||
Comment 1•14 years ago
|
||
let's not forget vlan47 here, either - we'll either need to add an interface on admin1a/b for it, or keep using a DHCP helper for it. I prefer the former.
Reporter | ||
Comment 2•14 years ago
|
||
This should get appropriate nagios monitoring, too.
Comment 3•14 years ago
|
||
I have gotten these hosts in puppet but dhcpd fails to start cause puppet is pulling in a dhcpd.conf file that is missing;
include "/etc/dhcpconfig-autodeploy/global.conf";
I need some help from rtucker as to why this is needed for dhcpd to start and how we can configure puppet to be smart enough to know when to include it.
Reporter | ||
Comment 4•14 years ago
|
||
adding :rtucker for his input
Comment 5•14 years ago
|
||
Adding :jabba as well since he wrote the puppet module
Comment 6•14 years ago
|
||
I updated the puppet classes and the nodes manifest to add a new class parameter use_inventory.
Then in the erb template:
+<% if use_inventory == true -%>
+include "/etc/dhcpdconfig-autodeploy/global.conf";
+<% end -%>
I've passed my changes to jabba for review. once he r+, i'll push them out.
Comment 7•14 years ago
|
||
Need to find a way to test this and a way to roll it over with little to no downtime. Dustin has a tool that might help us out.
Status: NEW → ASSIGNED
Comment 8•14 years ago
|
||
rtucker is looking at getting vlan48 setup so inventory controls it like vlan75 and vlan47. Once this is done we can do a small test pool to verify everything is working correctly and then roll over into production.
Comment 9•14 years ago
|
||
All of the hosts except for the following were found in inventory and are able to be imported:
ganglia1 not found
pxe1 not found
install not found
dev-master01 not found
slavealloc not found
buildapi01 not found
redis01 not found
w32-ix-slave06 not found
w32-ix-slave12 not found
w32-ix-slave06 not found
w32-ix-slave12 not found
buildbot-master23 not found
buildbot-master24 not found
buildbot-master25 not found
buildbot-master18 not found
buildbot-master21 not found
linux64-ix-slave37 not found
linux64-ix-slave37 not found
talos-r4-snow-ref not found
talos-r4-snow-082 not found
talos-r4-lion-075 not found
talos-r4-snow-083 not found
talos-r4-lion-076 not found
hg1 not found
hg2 not found
pdu1.r102-1-scl1 not found
pdu2.r102-1-scl1 not found
pdu1.r102-2-scl1 not found
pdu2.r102-2-scl1 not found
pdu1.r102-3-scl1 not found
pdu1.r102-4-scl1 not found
rabbit2 not found
rabbit3 not found
signing2 not found
rabbit1 not found
arr-client-test not found
rabbit2 not found
rabbit3 not found
dl120g7-r4-4876 not found
dl120g7-r4-4876 not found
This dhcp config file includes a group kickstartable which from what I understand from talking to Dustin, isn't necessary.
We need to either get rid of the hosts not found, or update the config file reflecting the hostname as found in inventory.
Once that is complete and the hosts not found are resolved, we can start using inventory to generate this vlan like the others!
digipengi:
Can you let me know the proper host name for the ones not found or if it doesn't exist any more?
Comment 10•14 years ago
|
||
I will get all of these in inventory
Comment 11•14 years ago
|
||
Just 1 thing to keep in mind, the system might be in inventory but not named correctly.
ganglia1 might be ganglia1.foo.bar.com but in the dhcp config file is set as just ganglia1.
Assignee | ||
Comment 12•14 years ago
|
||
As discussed with rtucker previously, we want the releng records in inventory and dhcp to have the FQDN up to mozilla.com (for example, ganglia1 is already in inventory. It's called ganglia1.build.scl1).
For any host that does not currently match this, we want to change it. I thought that was something rtucker was going to do programatically.
Comment 13•14 years ago
|
||
I cannot do it programatically since there can be multiple entries with similar names.
For example, ganglia1 could be in inventory as ganglia1.build, but the systems team might have ganglia1.private.phx1 etc.
There's only a handful of them to manually audit that do not match for the import.
Once the import is completed, if we wanted to script something up to change both, that's possible, but they need to match first.
Assignee | ||
Comment 14•14 years ago
|
||
Some of them (like ganglia1) exist in multiple zones as different hosts, so as long as we can go change things programatically once the dhcp and inventory lists match up (in this case, we'll need to change dhcp, for example, not inventory), then that works.
Assignee | ||
Comment 15•14 years ago
|
||
Changed dhcp to match inventory:
ganglia1 not found
pxe1 not found
install not found
dev-master01 not found
slavealloc not found
buildapi01 not found
redis01 not found
buildbot-master23 not found
buildbot-master24 not found
buildbot-master25 not found
buildbot-master18 not found
buildbot-master21 not found
hg1 not found
hg2 not found
pdu1.r102-1-scl1 not found
pdu2.r102-1-scl1 not found
pdu1.r102-2-scl1 not found
pdu2.r102-2-scl1 not found
pdu1.r102-3-scl1 not found
pdu1.r102-4-scl1 not found
rabbit2 not found
rabbit3 not found
signing2 not found
arr-client-test not found
removed from dhcp (retasked to seamonkey, work was incomplete):
w32-ix-slave06 not found (retasked to seamonkey)
w32-ix-slave12 not found (retasked to seamonkey)
linux64-ix-slave37 not found (retasked to seamonkey)
looked fine:
dl120g7-r4-4876 not found
actually need to be added to inventory (still todo):
talos-r4-snow-ref not found
talos-r4-snow-082 not found
talos-r4-snow-083 not found
talos-r4-lion-075 not found
talos-r4-lion-076 not found
Comment 16•14 years ago
|
||
I am adding these to inventory now
talos-r4-snow-ref not found
talos-r4-snow-082 not found
talos-r4-snow-083 not found
talos-r4-lion-075 not found
talos-r4-lion-076 not found
Comment 17•14 years ago
|
||
arr:
Can you get me the list that maps the hostnames in the config file as I listed above, to what they have been updated to?
Assignee | ||
Comment 18•14 years ago
|
||
ganglia1.build.scl1
pxe1.build
install.build.scl1
dev-master01.build.scl1
slavealloc.build.scl1
buildapi01.build
redis01.build
buildbot-master23.build.scl1
buildbot-master24.build.scl1
buildbot-master25.build.scl1
buildbot-master18.build.scl1
buildbot-master21.build.scl1
hg1.build.scl1
hg2.build.scl1
rabbit2-mgmt.build.scl1
rabbit2.build.scl1
rabbit3-mgmt.build.scl1
rabbit3.build.scl1
signing2.build.scl1
arr-client-test.build
Comment 19•14 years ago
|
||
I've made all of the corrections to my staged version of the config file. I can now do the actual import into the inventory key/value database.
Comment 20•14 years ago
|
||
talos-r4-snow-ref 102-1 06030
talos-r4-snow-082 102-1 06043
talos-r4-snow-083 102-1 06059
talos-r4-lion-075 102-3 06055
talos-r4-lion-076 102-3 06069
Comment 21•14 years ago
|
||
Removed the group for the DL120's
Comment 22•14 years ago
|
||
Need to review the work rtucker did and verify all the machines in the autogen are matching to our dhcpconfig/scl1/vlan48.conf
Comment 23•14 years ago
|
||
I ran the compare script and go the output to arr via IRC.
Results follow:
/tmp/scl1/vlan48.conf does not contain
{'hardware ethernet': '3C:07:54:72:42:0A', 'hostname': 'r5-mini-001-nic0', 'fixed-address': '10.12.52.88'}
{'hardware ethernet': '3C:07:54:72:4A:AE', 'hostname': 'r5-mini-002-nic0', 'fixed-address': '10.12.52.89'}
{'hardware ethernet': '3C:07:54:72:4E:9A', 'hostname': 'r5-mini-003-nic0', 'fixed-address': '10.12.52.90'}
{'hardware ethernet': '3C:07:54:72:4F:C1', 'hostname': 'r5-mini-004-nic0', 'fixed-address': '10.12.52.91'}
{'hardware ethernet': '3C:07:54:72:4F:96', 'hostname': 'r5-mini-005-nic0', 'fixed-address': '10.12.52.92'}
{'hardware ethernet': '3C:07:54:72:4D:8F', 'hostname': 'r5-mini-006-nic0', 'fixed-address': '10.12.52.93'}
What this means is that on my massaged file generated on Jan 30, these hosts do not appear in it. They are however live in inventory correctly. So everything checks out on this. Amy said that she would keep me posted as to when the migration is going to be, I will be available to do the cutover and to monitor for breakage.
Comment 24•14 years ago
|
||
Existing Parsed Config ordered by ip address
Comment 25•14 years ago
|
||
New generated config file
Comment 26•14 years ago
|
||
Generated diff of the 2 lists. The differences are the hostnames after they were updated
Comment 27•14 years ago
|
||
Doing this work tomorrow Thursday is risky (IIUC) since we might have several chemspills happening tomorrow.
I will keep you posted in tomorrow's EST morning as timeline reveals.
Also asking in IRC from someone on the release-drivers mailing list while I am away tonight could give you the latest news.
Comment 28•14 years ago
|
||
This is fine with me. Whenever someone wants to let me know know to pull the trigger, I'll be so.
Assignee | ||
Comment 29•14 years ago
|
||
After getting the go-ahead from armen, rtucker did the DHCP cutover this morning at 7:25 pacific. We're monitoring for issues, but have seen a number of successful lease negotiations and no problems.
Comment 30•14 years ago
|
||
Are we good to close this bug out?
Comment 31•14 years ago
|
||
From my perspective everything looks good to go.
Amy?
Assignee | ||
Comment 32•14 years ago
|
||
rtucker: the portion of getting auto-generated dhcp is done (thanks for the work!), but we haven't even moved onto admin1a/b yet, so, no, as I told matt last night, there's still a lot of work to be done on this bug. :}
Comment 33•13 years ago
|
||
Are you waiting on anything from me for admin1a/b?
Comment 34•13 years ago
|
||
Dustin: On Monday we will add vlan47 and then we should remove the iptables drops then ask netops to remove the dhcp helper from the vlan it serves which is to say 48 and 75.
Comment 35•13 years ago
|
||
after some work dustin and I added the interface for vlan47
Assignee | ||
Comment 36•13 years ago
|
||
Handing this off to dustin to do the final DHCP helper removal coordination for production.
Assignee: mlarrain → dustin
Reporter | ||
Comment 37•13 years ago
|
||
So, tcpdump verifies that both of these hosts are servicing requests for vlan48 (!). vlan47 verifies as well, after forcing one of the hosts to renew. vlan75 verifies as well, using the spare arr-test VM.
Failover seems to be running fine:
Apr 10 16:53:52 admin1a dhcpd: balancing pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-own (+/-)2
Apr 10 16:53:52 admin1a dhcpd: balanced pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-misbal 3
and
Apr 2 20:51:22 admin1a dhcpd: failover peer dhcp-failover: I move from normal to startup
Apr 2 20:51:23 admin1a dhcpd: failover peer dhcp-failover: I move from startup to normal
and the leases files are up to date with the latest leases.
What took me a while to figure out is, the iptables rules on these hosts are counting packets:
Chain INPUT (policy ACCEPT 189M packets, 23G bytes)
pkts bytes target prot opt in out source destination
205 70106 DROP udp -- bond0 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc
1313K 442M DROP udp -- bond0.48 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc
yet dhcpd is clearly replying to those packets. Raw sockets, which come before iptables kicks in, are to blame:
[root@admin1a.infra.scl1 ~]# lsof -p 13040
...
dhcpd 13040 dhcpd 4u raw 0t0 160057793 00000000:0001->00000000:0000 st=07
I removed the (superfluous) iptables rules.
All of which is to say, to the best of my knowledge we can shut off DHCP on ns1/2 and call this done. Removing the DHCP helpers can happen after that without any disruption (they'll just be fowarding broadcast packets that are ignored on receipt).
Amy, do you want to do that during the downtime on Thursday, and then reboot a few scl1 slaves just to verify they come back?
Assignee: dustin → arich
Assignee | ||
Comment 38•13 years ago
|
||
* moved two dhcp related cron jobs in /etc/cron.d into /etc/cron.d.disable
* ran "chkconfig dhcpd off" on both hosts (ns1/2.infra.scl1.mozilla.com)
* killed running dhcp and autodeploy processes
* verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com
* ran "yum remove dhcp" on both hosts (ns1/2.infra.scl1.mozilla.com)
* removed /var/lib/dhcpd/dhcpd.leases.rpmsave on both hosts (ns1/2.infra.scl1.mozilla.com)
* verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com
Assignee | ||
Updated•13 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•