Closed Bug 715006 Opened 14 years ago Closed 13 years ago

put admin1a/b into service as DHCP servers

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: arich)

References

Details

Attachments

(3 files)

These hosts should replace ns1/2 for this purpose.
let's not forget vlan47 here, either - we'll either need to add an interface on admin1a/b for it, or keep using a DHCP helper for it. I prefer the former.
This should get appropriate nagios monitoring, too.
I have gotten these hosts in puppet but dhcpd fails to start cause puppet is pulling in a dhcpd.conf file that is missing; include "/etc/dhcpconfig-autodeploy/global.conf"; I need some help from rtucker as to why this is needed for dhcpd to start and how we can configure puppet to be smart enough to know when to include it.
adding :rtucker for his input
Adding :jabba as well since he wrote the puppet module
I updated the puppet classes and the nodes manifest to add a new class parameter use_inventory. Then in the erb template: +<% if use_inventory == true -%> +include "/etc/dhcpdconfig-autodeploy/global.conf"; +<% end -%> I've passed my changes to jabba for review. once he r+, i'll push them out.
Need to find a way to test this and a way to roll it over with little to no downtime. Dustin has a tool that might help us out.
Status: NEW → ASSIGNED
rtucker is looking at getting vlan48 setup so inventory controls it like vlan75 and vlan47. Once this is done we can do a small test pool to verify everything is working correctly and then roll over into production.
All of the hosts except for the following were found in inventory and are able to be imported: ganglia1 not found pxe1 not found install not found dev-master01 not found slavealloc not found buildapi01 not found redis01 not found w32-ix-slave06 not found w32-ix-slave12 not found w32-ix-slave06 not found w32-ix-slave12 not found buildbot-master23 not found buildbot-master24 not found buildbot-master25 not found buildbot-master18 not found buildbot-master21 not found linux64-ix-slave37 not found linux64-ix-slave37 not found talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-lion-075 not found talos-r4-snow-083 not found talos-r4-lion-076 not found hg1 not found hg2 not found pdu1.r102-1-scl1 not found pdu2.r102-1-scl1 not found pdu1.r102-2-scl1 not found pdu2.r102-2-scl1 not found pdu1.r102-3-scl1 not found pdu1.r102-4-scl1 not found rabbit2 not found rabbit3 not found signing2 not found rabbit1 not found arr-client-test not found rabbit2 not found rabbit3 not found dl120g7-r4-4876 not found dl120g7-r4-4876 not found This dhcp config file includes a group kickstartable which from what I understand from talking to Dustin, isn't necessary. We need to either get rid of the hosts not found, or update the config file reflecting the hostname as found in inventory. Once that is complete and the hosts not found are resolved, we can start using inventory to generate this vlan like the others! digipengi: Can you let me know the proper host name for the ones not found or if it doesn't exist any more?
I will get all of these in inventory
Just 1 thing to keep in mind, the system might be in inventory but not named correctly. ganglia1 might be ganglia1.foo.bar.com but in the dhcp config file is set as just ganglia1.
As discussed with rtucker previously, we want the releng records in inventory and dhcp to have the FQDN up to mozilla.com (for example, ganglia1 is already in inventory. It's called ganglia1.build.scl1). For any host that does not currently match this, we want to change it. I thought that was something rtucker was going to do programatically.
I cannot do it programatically since there can be multiple entries with similar names. For example, ganglia1 could be in inventory as ganglia1.build, but the systems team might have ganglia1.private.phx1 etc. There's only a handful of them to manually audit that do not match for the import. Once the import is completed, if we wanted to script something up to change both, that's possible, but they need to match first.
Some of them (like ganglia1) exist in multiple zones as different hosts, so as long as we can go change things programatically once the dhcp and inventory lists match up (in this case, we'll need to change dhcp, for example, not inventory), then that works.
Changed dhcp to match inventory: ganglia1 not found pxe1 not found install not found dev-master01 not found slavealloc not found buildapi01 not found redis01 not found buildbot-master23 not found buildbot-master24 not found buildbot-master25 not found buildbot-master18 not found buildbot-master21 not found hg1 not found hg2 not found pdu1.r102-1-scl1 not found pdu2.r102-1-scl1 not found pdu1.r102-2-scl1 not found pdu2.r102-2-scl1 not found pdu1.r102-3-scl1 not found pdu1.r102-4-scl1 not found rabbit2 not found rabbit3 not found signing2 not found arr-client-test not found removed from dhcp (retasked to seamonkey, work was incomplete): w32-ix-slave06 not found (retasked to seamonkey) w32-ix-slave12 not found (retasked to seamonkey) linux64-ix-slave37 not found (retasked to seamonkey) looked fine: dl120g7-r4-4876 not found actually need to be added to inventory (still todo): talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-snow-083 not found talos-r4-lion-075 not found talos-r4-lion-076 not found
I am adding these to inventory now talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-snow-083 not found talos-r4-lion-075 not found talos-r4-lion-076 not found
arr: Can you get me the list that maps the hostnames in the config file as I listed above, to what they have been updated to?
ganglia1.build.scl1 pxe1.build install.build.scl1 dev-master01.build.scl1 slavealloc.build.scl1 buildapi01.build redis01.build buildbot-master23.build.scl1 buildbot-master24.build.scl1 buildbot-master25.build.scl1 buildbot-master18.build.scl1 buildbot-master21.build.scl1 hg1.build.scl1 hg2.build.scl1 rabbit2-mgmt.build.scl1 rabbit2.build.scl1 rabbit3-mgmt.build.scl1 rabbit3.build.scl1 signing2.build.scl1 arr-client-test.build
I've made all of the corrections to my staged version of the config file. I can now do the actual import into the inventory key/value database.
talos-r4-snow-ref 102-1 06030 talos-r4-snow-082 102-1 06043 talos-r4-snow-083 102-1 06059 talos-r4-lion-075 102-3 06055 talos-r4-lion-076 102-3 06069
Removed the group for the DL120's
Need to review the work rtucker did and verify all the machines in the autogen are matching to our dhcpconfig/scl1/vlan48.conf
I ran the compare script and go the output to arr via IRC. Results follow: /tmp/scl1/vlan48.conf does not contain {'hardware ethernet': '3C:07:54:72:42:0A', 'hostname': 'r5-mini-001-nic0', 'fixed-address': '10.12.52.88'} {'hardware ethernet': '3C:07:54:72:4A:AE', 'hostname': 'r5-mini-002-nic0', 'fixed-address': '10.12.52.89'} {'hardware ethernet': '3C:07:54:72:4E:9A', 'hostname': 'r5-mini-003-nic0', 'fixed-address': '10.12.52.90'} {'hardware ethernet': '3C:07:54:72:4F:C1', 'hostname': 'r5-mini-004-nic0', 'fixed-address': '10.12.52.91'} {'hardware ethernet': '3C:07:54:72:4F:96', 'hostname': 'r5-mini-005-nic0', 'fixed-address': '10.12.52.92'} {'hardware ethernet': '3C:07:54:72:4D:8F', 'hostname': 'r5-mini-006-nic0', 'fixed-address': '10.12.52.93'} What this means is that on my massaged file generated on Jan 30, these hosts do not appear in it. They are however live in inventory correctly. So everything checks out on this. Amy said that she would keep me posted as to when the migration is going to be, I will be available to do the cutover and to monitor for breakage.
Attached file Existing Parsed Config
Existing Parsed Config ordered by ip address
Attached file New Generated Config
New generated config file
Attached file Genererated Diff file
Generated diff of the 2 lists. The differences are the hostnames after they were updated
Doing this work tomorrow Thursday is risky (IIUC) since we might have several chemspills happening tomorrow. I will keep you posted in tomorrow's EST morning as timeline reveals. Also asking in IRC from someone on the release-drivers mailing list while I am away tonight could give you the latest news.
This is fine with me. Whenever someone wants to let me know know to pull the trigger, I'll be so.
After getting the go-ahead from armen, rtucker did the DHCP cutover this morning at 7:25 pacific. We're monitoring for issues, but have seen a number of successful lease negotiations and no problems.
Are we good to close this bug out?
From my perspective everything looks good to go. Amy?
rtucker: the portion of getting auto-generated dhcp is done (thanks for the work!), but we haven't even moved onto admin1a/b yet, so, no, as I told matt last night, there's still a lot of work to be done on this bug. :}
Are you waiting on anything from me for admin1a/b?
Dustin: On Monday we will add vlan47 and then we should remove the iptables drops then ask netops to remove the dhcp helper from the vlan it serves which is to say 48 and 75.
after some work dustin and I added the interface for vlan47
Handing this off to dustin to do the final DHCP helper removal coordination for production.
Assignee: mlarrain → dustin
So, tcpdump verifies that both of these hosts are servicing requests for vlan48 (!). vlan47 verifies as well, after forcing one of the hosts to renew. vlan75 verifies as well, using the spare arr-test VM. Failover seems to be running fine: Apr 10 16:53:52 admin1a dhcpd: balancing pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-own (+/-)2 Apr 10 16:53:52 admin1a dhcpd: balanced pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-misbal 3 and Apr 2 20:51:22 admin1a dhcpd: failover peer dhcp-failover: I move from normal to startup Apr 2 20:51:23 admin1a dhcpd: failover peer dhcp-failover: I move from startup to normal and the leases files are up to date with the latest leases. What took me a while to figure out is, the iptables rules on these hosts are counting packets: Chain INPUT (policy ACCEPT 189M packets, 23G bytes) pkts bytes target prot opt in out source destination 205 70106 DROP udp -- bond0 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc 1313K 442M DROP udp -- bond0.48 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc yet dhcpd is clearly replying to those packets. Raw sockets, which come before iptables kicks in, are to blame: [root@admin1a.infra.scl1 ~]# lsof -p 13040 ... dhcpd 13040 dhcpd 4u raw 0t0 160057793 00000000:0001->00000000:0000 st=07 I removed the (superfluous) iptables rules. All of which is to say, to the best of my knowledge we can shut off DHCP on ns1/2 and call this done. Removing the DHCP helpers can happen after that without any disruption (they'll just be fowarding broadcast packets that are ignored on receipt). Amy, do you want to do that during the downtime on Thursday, and then reboot a few scl1 slaves just to verify they come back?
Assignee: dustin → arich
Blocks: 748814
* moved two dhcp related cron jobs in /etc/cron.d into /etc/cron.d.disable * ran "chkconfig dhcpd off" on both hosts (ns1/2.infra.scl1.mozilla.com) * killed running dhcp and autodeploy processes * verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com * ran "yum remove dhcp" on both hosts (ns1/2.infra.scl1.mozilla.com) * removed /var/lib/dhcpd/dhcpd.leases.rpmsave on both hosts (ns1/2.infra.scl1.mozilla.com) * verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: