Closed Bug 663644 Opened 13 years ago Closed 13 years ago

fix DEAD_FISH once and for all

Categories

(Infrastructure & Operations :: RelOps: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(2 files)

This seems to be fixed on centos, but is still failing on fedora (32 and 64).  See bug 661382.

I'll re-install fedora on my r3 mini and exclude it from DHCP via an iptables rule on my DHCP server, and see what it does.

In theory, we have two few protections for this already in place:
 ifcfg-eth0 should specify waiting for ever for an address
 puppet should loop 10 times and then reboot if it still can't contact the server

My guess is that the former is hanging forever waiting for dhclient, but dhclient is not actively requesting addresses.  But who knows.
Zandr describes the exact circumstances here, which were not what I thought:

15:55 <@dustin> zandr: while I have you, all you see on screen in dead fish mode is the puppet message, right?
15:56 <@zandr> nope. 
15:56 <@dustin> oh, what do you see?
15:56 <@zandr> They're sitting at the desktop
15:56 <@zandr> with a blank terminal window
15:56  * dustin re-tries to replicate
15:56 <@dustin> I'm glad I hung onto the mini!
15:57 <@zandr> to diagnose, I bring up a new terminal, type ip addr show, and verify that eth0 doesn't have an address, and bounce it
and now, without a DHCP server active, I'm seeing it hang at
  Initializing network drop monitor service
Puppet helpfully exits with a 0 status when it can't find its server.  This is fixed in newer versions, I think.

I can replicate the failure to get a DHCP address (by turning off my DHCP server).  The 'network' service waits for a while (60s?), then barrels on anyway.  However dhclient is running in the background at that point, and continues polling DHCP for addresses.  So by the time zandr's onsite to diagnose, the machine should have gotten an IP.

So I think something else is going on here - perhaps the mini is failing to start its NIC?  I assume we don't have hours-long DHCP downtimes.  Sadly, mii-tool doesn't work on these NICs, so that's hard to tell.

Amy, I'm not sure what runs the DHCP server on this subnet - would it be possible to do a tcpdump for |udp port bootpc or udp port bootps| for, say, a week?
Assignee: dustin → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
It's an odd collection of times that I see request packets from this mini:

18:04:53.615912 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:05:00.522223 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:05:16.530227 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:05:36.532833 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:05:49.524899 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:12:48.582551 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:12:54.487950 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:13:01.488570 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:13:16.495513 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:13:25.488875 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:13:34.488245 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:13:47.491331 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:16:57.565075 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:17:01.468596 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:17:11.474025 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:17:26.477970 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:17:40.476003 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff
18:17:47.468499 34:15:9e:18:56:9e > ff:ff:ff:ff:ff:ff

but it seems to be requesting indefinitely - for 15m at any rate.  I think that the DHCP server tcpdump will help..
I can fire up a tcpdump on ns1.infra.scl1.mozilla.com after the reboot tomorrow.  Do you just want  "tcpdump -i eth0 -n port 67 or port 68" or something that captures payload?
that tcpdump would be completely sufficient - I just need to see source MAC address to determine whether the next DEAD_FISH minis are requesting or not.
started on both ns1 and ns2 (in case one goes down for some reason):

tcpdump -i eth0 -w /root/tcpdump -n port 67 or port 68
Zandr found some historical logging for talos-r3-fed-006 (bug 661382 comment 17).  It was sending regular requests until the 6th, when it went "down"

(June, edited to hopefully not wrap in bugzilla)
 6 15:00:41 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 15:00:41 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
 6 15:24:50 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 15:24:50 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
 6 15:48:18 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 15:48:18 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:05:20 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:05:20 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:17:33 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:17:33 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:39:45 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
 6 16:39:45 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 17:53:28 DISCOVER from d4:9a:20:0d:74:24 via 10.12.48.1
10 17:53:28 OFFER on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 17:53:28 REQUEST for 10.12.49.133 (10.12.75.10) from d4:9a:20:0d:74:24 via 10.12.48.1
10 17:53:28 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 18:58:24 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
10 18:58:24 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 19:24:38 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
10 19:24:38 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 20:25:26 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
10 20:25:26 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1
10 21:13:02 REQUEST for 10.12.49.133 from d4:9a:20:0d:74:24 via 10.12.48.1
10 21:13:02 ACK on 10.12.49.133 to d4:9a:20:0d:74:24 via 10.12.48.1

Lease times in the config: default 86400 (one day), max 604800 (one week).

Previous DISCOVER was on May 31:

May 31 15:18:20 ns1 dhcpd: DHCPDISCOVER from d4:9a:20:0d:74:24 via 10.12.48.1

This machine had booted many a time between May 31st and June 6th - last log confirms that.  What's odd is that the last REQUEST and ACK come about an hour *after* the lease should have expired.  Why would the DHCP server have acked this?  And why did the slave not subsequently send a new DISCOVER?

Here's the system's last log, for comparison:

reboot   system boot   Mon Jun  6 14:53 - 14:59  (00:05)    
reboot   system boot   Mon Jun  6 15:00 - 15:23  (00:22)    
reboot   system boot   Mon Jun  6 15:24 - 15:46  (00:22)    
reboot   system boot   Mon Jun  6 15:48 - 16:03  (00:15)    
reboot   system boot   Mon Jun  6 16:05 - 16:16  (00:10)    
reboot   system boot   Mon Jun  6 16:17 - 16:38  (00:20)    
reboot   system boot   Mon Jun  6 16:39 - 17:02  (00:22)    
reboot   system boot   Mon Jun  6 17:03 - 17:52 (4+00:48)   
reboot   system boot   Fri Jun 10 17:53 - 18:56  (01:03)    
reboot   system boot   Fri Jun 10 18:58 - 19:23  (00:24)    
reboot   system boot   Fri Jun 10 19:24 - 20:24  (00:59)    
reboot   system boot   Fri Jun 10 20:25 - 21:11  (00:46)    
reboot   system boot   Fri Jun 10 21:12 - 00:17  (03:04)    
reboot   system boot   Sat Jun 11 00:18 - 01:26  (01:07)    

So the June 6 17:03 boot is the one that didn't get an address.  From the looks of it, it never tried to get an address.  Something's fishy here.
Attached file dhclient-logs.txt
Here is grep -i dhc /var/log/message*, edited down to the same date range.
Assignee: server-ops-releng → dustin
Here's the last successful initialization:

Jun  6 16:39:45 NetworkManager: <info>  Activation (eth0) Beginning DHCP transaction (timeout in 45 seconds)
Jun  6 16:39:45 NetworkManager: <info>  dhclient started with pid 1195
Jun  6 16:39:45 dhclient: Internet Systems Consortium DHCP Client 4.1.0p1
Jun  6 16:39:45 dhclient: Copyright 2004-2009 Internet Systems Consortium.
Jun  6 16:39:45 dhclient: All rights reserved.
Jun  6 16:39:45 dhclient: For info, please visit http://www.isc.org/sw/dhcp/
Jun  6 16:39:45 dhclient:
Jun  6 16:39:45 NetworkManager: <info>  DHCP: device eth0 state changed (null) -> preinit
Jun  6 16:39:45 dhclient: Listening on LPF/eth0/d4:9a:20:0d:74:24
Jun  6 16:39:45 dhclient: Sending on   LPF/eth0/d4:9a:20:0d:74:24
Jun  6 16:39:45 dhclient: Sending on   Socket/fallback
Jun  6 16:39:46 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Jun  6 16:39:46 dhclient: DHCPACK from 10.12.48.1
Jun  6 16:39:46 NetworkManager: <info>  DHCP: device eth0 state changed preinit -> reboot
Jun  6 16:39:46 dhclient: bound to 10.12.49.133 -- renewal in 40531 seconds.

Note that the renewal is in ~12h - about half the default lease.

On the next boot:

Jun  6 17:03:43 NetworkManager: <info>  Activation (eth0) Beginning DHCP transaction (timeout in 45 seconds)
Jun  6 17:03:43 NetworkManager: <info>  dhclient started with pid 1237
Jun  6 17:03:43 dhclient: Internet Systems Consortium DHCP Client 4.1.0p1
Jun  6 17:03:43 dhclient: Copyright 2004-2009 Internet Systems Consortium.
Jun  6 17:03:43 dhclient: All rights reserved.
Jun  6 17:03:43 dhclient: For info, please visit http://www.isc.org/sw/dhcp/
Jun  6 17:03:43 dhclient:
Jun  6 17:03:43 NetworkManager: <info>  DHCP: device eth0 state changed (null) -> preinit
Jun  6 17:03:43 dhclient: Listening on LPF/eth0/d4:9a:20:0d:74:24
Jun  6 17:03:43 dhclient: Sending on   LPF/eth0/d4:9a:20:0d:74:24
Jun  6 17:03:43 dhclient: Sending on   Socket/fallback
Jun  6 17:03:43 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Jun  6 17:03:50 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Jun  6 17:03:58 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Jun  6 17:04:03 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 10
Jun  6 17:04:13 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 13
Jun  6 17:04:26 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 10
Jun  6 17:04:29 NetworkManager: <info>  (eth0): DHCP transaction took too long, stopping it.
Jun  6 17:04:29 NetworkManager: <info>  (eth0): canceled DHCP transaction, dhcp client pid 1237

so it requests the existing address twice, then tries to discover a new one.  None of these packets have a corresponding log entry on the DHCP server, which is why (Amy) I think it will be good to have a tcpdump trace: how far did these packets (and any responses) get before dying?

That NetworkManager kills dhclient is, I suspect, a bug in NetworkManager, but most likely that's hackable or configurable.  Still, we shouldn't be in a situation where hosts wait for multiple minutes to get an address.

Zandr, to verify, did you grep the dhcp server's logs for the slave's MAC or its IP?
I was looking for the MAC address.
Amy, looks like we have some dead fish in bug 664629 and bug 666089.  Can you get the tcpdumps and DHCP server logs and drop them somewhere I can squint at them?
Well, I lost my VPN access to the logs again, but I'm not sure that this particular instance is going to prove much for us, since the failure was due to a problem on the switch, which triggered a loss of link on the talos systems and probably made them reset their NICs.

I think that the big picture is that NetworkManager is operating at a level *above* dhclient, and deciding that dhclient has had enough even though dhclient itself is happy to keep trying.  I'm not sure how much more time I want to spend proving that, rather than just fixing it - or eviscerating NetworkManager.
Gozer advised something along the lines of "ensure NetworkManager absent", which sounds a lot like a Package resource to me

On my mini here, I'm trying
 yum remove NetworkManager
 chkconfig network on
to see how that works out.
With these changes in place:

I see it continue to poll for an address via tcpdump on my DHCP server, when that server is configured to ignore its packets (iptables).  If it has a previous lease that's still valid, it will boot with that if it can't get in touch with the DHCP server.  If it doesn't have a lease on file (or, presumably, if its lease was expired), it will eventually login anyway without an IPv4 address.  I'm not sure if this is because it sets up an IPv6 address for itself (I'm running radvd).

Losing link after initialization doesn't seem to faze it.

If I boot without link, it doesn't seem to hang at the network stage, although puppetd helpfully returns a zero exit status for "Temporary failure in name resolution".  That kinda sucks, but the fix is to upgrade Puppet, and not related to this bug.

Adding link after boot doesn't make it acquire an address, sadly.  I think that's what NetworkManager did (watching udev, etc.).

All in all, it's not clear that this would be an improvement, but it's something to try.  if this doesn't work, I'd be tempted to just declare F12's DHCP fatally broken and use static assignment, which could be reasonably well-automated via puppet.
Zandr, what do you think - should I give this a try?
Zandr sez, "ship it"
Attachment #544906 - Flags: review?(bhearsum)
Attachment #544906 - Flags: feedback?(gozer)
Attachment #544906 - Flags: review?(bhearsum) → review+
Comment on attachment 544906 [details] [diff] [review]
m663644-puppet-manifests-r1-p1.patch

Review of attachment 544906 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544906 - Flags: feedback?(gozer)
This is landed and isn't causing rebooted machines to fail.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: