715006 - put admin1a/b into service as DHCP servers

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Description

•

14 years ago

These hosts should replace ns1/2 for this purpose.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

14 years ago

let's not forget vlan47 here, either - we'll either need to add an interface on admin1a/b for it, or keep using a DHCP helper for it. I prefer the former.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 2

•

14 years ago

This should get appropriate nagios monitoring, too.

Matthew Larrain[:MaRu]

Comment 3

•

14 years ago

I have gotten these hosts in puppet but dhcpd fails to start cause puppet is pulling in a dhcpd.conf file that is missing; include "/etc/dhcpconfig-autodeploy/global.conf"; I need some help from rtucker as to why this is needed for dhcpd to start and how we can configure puppet to be smart enough to know when to include it.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 4

•

14 years ago

adding :rtucker for his input

Matthew Larrain[:MaRu]

Comment 5

•

14 years ago

Adding :jabba as well since he wrote the puppet module

Rob Tucker [:rtucker]

Comment 6

•

14 years ago

I updated the puppet classes and the nodes manifest to add a new class parameter use_inventory. Then in the erb template: +<% if use_inventory == true -%> +include "/etc/dhcpdconfig-autodeploy/global.conf"; +<% end -%> I've passed my changes to jabba for review. once he r+, i'll push them out.

Matthew Larrain[:MaRu]

Comment 7

•

14 years ago

Need to find a way to test this and a way to roll it over with little to no downtime. Dustin has a tool that might help us out.

Status: NEW → ASSIGNED

Matthew Larrain[:MaRu]

Comment 8

•

14 years ago

rtucker is looking at getting vlan48 setup so inventory controls it like vlan75 and vlan47. Once this is done we can do a small test pool to verify everything is working correctly and then roll over into production.

Rob Tucker [:rtucker]

Comment 9

•

14 years ago

All of the hosts except for the following were found in inventory and are able to be imported: ganglia1 not found pxe1 not found install not found dev-master01 not found slavealloc not found buildapi01 not found redis01 not found w32-ix-slave06 not found w32-ix-slave12 not found w32-ix-slave06 not found w32-ix-slave12 not found buildbot-master23 not found buildbot-master24 not found buildbot-master25 not found buildbot-master18 not found buildbot-master21 not found linux64-ix-slave37 not found linux64-ix-slave37 not found talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-lion-075 not found talos-r4-snow-083 not found talos-r4-lion-076 not found hg1 not found hg2 not found pdu1.r102-1-scl1 not found pdu2.r102-1-scl1 not found pdu1.r102-2-scl1 not found pdu2.r102-2-scl1 not found pdu1.r102-3-scl1 not found pdu1.r102-4-scl1 not found rabbit2 not found rabbit3 not found signing2 not found rabbit1 not found arr-client-test not found rabbit2 not found rabbit3 not found dl120g7-r4-4876 not found dl120g7-r4-4876 not found This dhcp config file includes a group kickstartable which from what I understand from talking to Dustin, isn't necessary. We need to either get rid of the hosts not found, or update the config file reflecting the hostname as found in inventory. Once that is complete and the hosts not found are resolved, we can start using inventory to generate this vlan like the others! digipengi: Can you let me know the proper host name for the ones not found or if it doesn't exist any more?

Matthew Larrain[:MaRu]

Comment 10

•

14 years ago

I will get all of these in inventory

Rob Tucker [:rtucker]

Comment 11

•

14 years ago

Just 1 thing to keep in mind, the system might be in inventory but not named correctly. ganglia1 might be ganglia1.foo.bar.com but in the dhcp config file is set as just ganglia1.

Amy Rich [:arr] [:arich]

Assignee

Comment 12

•

14 years ago

As discussed with rtucker previously, we want the releng records in inventory and dhcp to have the FQDN up to mozilla.com (for example, ganglia1 is already in inventory. It's called ganglia1.build.scl1). For any host that does not currently match this, we want to change it. I thought that was something rtucker was going to do programatically.

Rob Tucker [:rtucker]

Comment 13

•

14 years ago

I cannot do it programatically since there can be multiple entries with similar names. For example, ganglia1 could be in inventory as ganglia1.build, but the systems team might have ganglia1.private.phx1 etc. There's only a handful of them to manually audit that do not match for the import. Once the import is completed, if we wanted to script something up to change both, that's possible, but they need to match first.

Amy Rich [:arr] [:arich]

Assignee

Comment 14

•

14 years ago

Some of them (like ganglia1) exist in multiple zones as different hosts, so as long as we can go change things programatically once the dhcp and inventory lists match up (in this case, we'll need to change dhcp, for example, not inventory), then that works.

Amy Rich [:arr] [:arich]

Assignee

Comment 15

•

14 years ago

Changed dhcp to match inventory: ganglia1 not found pxe1 not found install not found dev-master01 not found slavealloc not found buildapi01 not found redis01 not found buildbot-master23 not found buildbot-master24 not found buildbot-master25 not found buildbot-master18 not found buildbot-master21 not found hg1 not found hg2 not found pdu1.r102-1-scl1 not found pdu2.r102-1-scl1 not found pdu1.r102-2-scl1 not found pdu2.r102-2-scl1 not found pdu1.r102-3-scl1 not found pdu1.r102-4-scl1 not found rabbit2 not found rabbit3 not found signing2 not found arr-client-test not found removed from dhcp (retasked to seamonkey, work was incomplete): w32-ix-slave06 not found (retasked to seamonkey) w32-ix-slave12 not found (retasked to seamonkey) linux64-ix-slave37 not found (retasked to seamonkey) looked fine: dl120g7-r4-4876 not found actually need to be added to inventory (still todo): talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-snow-083 not found talos-r4-lion-075 not found talos-r4-lion-076 not found

Matthew Larrain[:MaRu]

Comment 16

•

14 years ago

I am adding these to inventory now talos-r4-snow-ref not found talos-r4-snow-082 not found talos-r4-snow-083 not found talos-r4-lion-075 not found talos-r4-lion-076 not found

Rob Tucker [:rtucker]

Comment 17

•

14 years ago

arr: Can you get me the list that maps the hostnames in the config file as I listed above, to what they have been updated to?

Amy Rich [:arr] [:arich]

Assignee

Comment 18

•

14 years ago

ganglia1.build.scl1 pxe1.build install.build.scl1 dev-master01.build.scl1 slavealloc.build.scl1 buildapi01.build redis01.build buildbot-master23.build.scl1 buildbot-master24.build.scl1 buildbot-master25.build.scl1 buildbot-master18.build.scl1 buildbot-master21.build.scl1 hg1.build.scl1 hg2.build.scl1 rabbit2-mgmt.build.scl1 rabbit2.build.scl1 rabbit3-mgmt.build.scl1 rabbit3.build.scl1 signing2.build.scl1 arr-client-test.build

Rob Tucker [:rtucker]

Comment 19

•

14 years ago

I've made all of the corrections to my staged version of the config file. I can now do the actual import into the inventory key/value database.

Matthew Larrain[:MaRu]

Comment 20

•

14 years ago

talos-r4-snow-ref 102-1 06030 talos-r4-snow-082 102-1 06043 talos-r4-snow-083 102-1 06059 talos-r4-lion-075 102-3 06055 talos-r4-lion-076 102-3 06069

Matthew Larrain[:MaRu]

Comment 21

•

14 years ago

Removed the group for the DL120's

Matthew Larrain[:MaRu]

Comment 22

•

14 years ago

Need to review the work rtucker did and verify all the machines in the autogen are matching to our dhcpconfig/scl1/vlan48.conf

Rob Tucker [:rtucker]

Comment 23

•

14 years ago

I ran the compare script and go the output to arr via IRC. Results follow: /tmp/scl1/vlan48.conf does not contain {'hardware ethernet': '3C:07:54:72:42:0A', 'hostname': 'r5-mini-001-nic0', 'fixed-address': '10.12.52.88'} {'hardware ethernet': '3C:07:54:72:4A:AE', 'hostname': 'r5-mini-002-nic0', 'fixed-address': '10.12.52.89'} {'hardware ethernet': '3C:07:54:72:4E:9A', 'hostname': 'r5-mini-003-nic0', 'fixed-address': '10.12.52.90'} {'hardware ethernet': '3C:07:54:72:4F:C1', 'hostname': 'r5-mini-004-nic0', 'fixed-address': '10.12.52.91'} {'hardware ethernet': '3C:07:54:72:4F:96', 'hostname': 'r5-mini-005-nic0', 'fixed-address': '10.12.52.92'} {'hardware ethernet': '3C:07:54:72:4D:8F', 'hostname': 'r5-mini-006-nic0', 'fixed-address': '10.12.52.93'} What this means is that on my massaged file generated on Jan 30, these hosts do not appear in it. They are however live in inventory correctly. So everything checks out on this. Amy said that she would keep me posted as to when the migration is going to be, I will be available to do the cutover and to monitor for breakage.

Rob Tucker [:rtucker]

Comment 24

•

14 years ago

Attached file Existing Parsed Config — Details

Existing Parsed Config ordered by ip address

Rob Tucker [:rtucker]

Comment 25

•

14 years ago

Attached file New Generated Config — Details

New generated config file

Rob Tucker [:rtucker]

Comment 26

•

14 years ago

Attached file Genererated Diff file — Details

Generated diff of the 2 lists. The differences are the hostnames after they were updated

Armen [:armenzg]

Comment 27

•

14 years ago

Doing this work tomorrow Thursday is risky (IIUC) since we might have several chemspills happening tomorrow. I will keep you posted in tomorrow's EST morning as timeline reveals. Also asking in IRC from someone on the release-drivers mailing list while I am away tonight could give you the latest news.

Rob Tucker [:rtucker]

Comment 28

•

14 years ago

This is fine with me. Whenever someone wants to let me know know to pull the trigger, I'll be so.

Amy Rich [:arr] [:arich]

Assignee

Comment 29

•

14 years ago

After getting the go-ahead from armen, rtucker did the DHCP cutover this morning at 7:25 pacific. We're monitoring for issues, but have seen a number of successful lease negotiations and no problems.

Matthew Larrain[:MaRu]

Comment 30

•

14 years ago

Are we good to close this bug out?

Rob Tucker [:rtucker]

Comment 31

•

14 years ago

From my perspective everything looks good to go. Amy?

Amy Rich [:arr] [:arich]

Assignee

Comment 32

•

14 years ago

rtucker: the portion of getting auto-generated dhcp is done (thanks for the work!), but we haven't even moved onto admin1a/b yet, so, no, as I told matt last night, there's still a lot of work to be done on this bug. :}

Rob Tucker [:rtucker]

Comment 33

•

13 years ago

Are you waiting on anything from me for admin1a/b?

Matthew Larrain[:MaRu]

Comment 34

•

13 years ago

Dustin: On Monday we will add vlan47 and then we should remove the iptables drops then ask netops to remove the dhcp helper from the vlan it serves which is to say 48 and 75.

Matthew Larrain[:MaRu]

Comment 35

•

13 years ago

after some work dustin and I added the interface for vlan47

Amy Rich [:arr] [:arich]

Assignee

Comment 36

•

13 years ago

Handing this off to dustin to do the final DHCP helper removal coordination for production.

Assignee: mlarrain → dustin

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 37

•

13 years ago

So, tcpdump verifies that both of these hosts are servicing requests for vlan48 (!). vlan47 verifies as well, after forcing one of the hosts to renew. vlan75 verifies as well, using the spare arr-test VM. Failover seems to be running fine: Apr 10 16:53:52 admin1a dhcpd: balancing pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-own (+/-)2 Apr 10 16:53:52 admin1a dhcpd: balanced pool 7fce73f606e0 10.12.47.0/24 total 21 free 11 backup 10 lts 0 max-misbal 3 and Apr 2 20:51:22 admin1a dhcpd: failover peer dhcp-failover: I move from normal to startup Apr 2 20:51:23 admin1a dhcpd: failover peer dhcp-failover: I move from startup to normal and the leases files are up to date with the latest leases. What took me a while to figure out is, the iptables rules on these hosts are counting packets: Chain INPUT (policy ACCEPT 189M packets, 23G bytes) pkts bytes target prot opt in out source destination 205 70106 DROP udp -- bond0 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc 1313K 442M DROP udp -- bond0.48 any anywhere anywhere udp spts:bootps:bootpc dpts:bootps:bootpc yet dhcpd is clearly replying to those packets. Raw sockets, which come before iptables kicks in, are to blame: [root@admin1a.infra.scl1 ~]# lsof -p 13040 ... dhcpd 13040 dhcpd 4u raw 0t0 160057793 00000000:0001->00000000:0000 st=07 I removed the (superfluous) iptables rules. All of which is to say, to the best of my knowledge we can shut off DHCP on ns1/2 and call this done. Removing the DHCP helpers can happen after that without any disruption (they'll just be fowarding broadcast packets that are ignored on receipt). Amy, do you want to do that during the downtime on Thursday, and then reboot a few scl1 slaves just to verify they come back?

Assignee: dustin → arich

Mike Taylor [:bear]

Updated

•

13 years ago

Blocks: 748814

Amy Rich [:arr] [:arich]

Assignee

Comment 38

•

13 years ago

* moved two dhcp related cron jobs in /etc/cron.d into /etc/cron.d.disable * ran "chkconfig dhcpd off" on both hosts (ns1/2.infra.scl1.mozilla.com) * killed running dhcp and autodeploy processes * verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com * ran "yum remove dhcp" on both hosts (ns1/2.infra.scl1.mozilla.com) * removed /var/lib/dhcpd/dhcpd.leases.rpmsave on both hosts (ns1/2.infra.scl1.mozilla.com) * verified that machines were still getting leases via admin1a.infra.scl1.mozilla.com

Amy Rich [:arr] [:arich]

Assignee

Updated

•

13 years ago

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Existing Parsed Config 14 years ago Rob Tucker [:rtucker] 139.40 KB, text/plain		Details
New Generated Config 14 years ago Rob Tucker [:rtucker] 138.28 KB, text/plain		Details
Genererated Diff file 14 years ago Rob Tucker [:rtucker] 9.49 KB, text/plain		Details