Fix DHCP error handling on fed and fed64 boxes

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: nthomas, Assigned: dustin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [puppet])

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
If there is no DHCP server responding to a talos-r3-fed or talos-r3-fed64 box on boot it gives up after a few attempts, and reverts to the LIE_THERE_LIKE_A_DEAD_FISH state.

bkero says we should "add PERSISTENT_DHCLIENT=1 to the end of /etc/sysconfig/network-scripts/ifcfg-eth0" to prevent this. Saves IT having to go to the data center when *ahem* DHCP fails.
Created attachment 514421 [details] [diff] [review]
m636069-puppet-manifests-r1.patch

Fedora's helpful practice of persisting interface names based on hardware address is unhelpful here.  The refimages run on eth0, but the persistence carries over to the slaves, so they all run with eth1.  If we specify PERSISTENT_DHCLIENT, then any nonexistent interfaces will have a dhclient instance polling them, which may affect talos timings.

My preferred fix is:
 - add ifcfg-eth0 with puppet
 - blow away /etc/udev/rules.d/70-persistent-net.rules in puppet
 - add a step to delete this file before imaging to the refimage instructions

The trouble is that NetworkManager helpfully watches the network-scripts directory with inotify, and reacts immediately when puppet writes the ifcfg-eth0 file by loading it up, re-ordering the variables, and writing it back to disk.  That means that this particular puppet rule will trigger on every startup, which isn't much use.  Unfortunately, NM re-writes the file with a uuid that puppet couldn't possibly hope to invent on its own, so I don't see a good way to avoid this.  Ideas?  If not, we can leave this bug open in hopes of a better fix.
Assignee: nobody → dustin
Attachment #514421 - Flags: review?(nrthomas)
Attachment #514421 - Flags: feedback?
Attachment #514421 - Flags: feedback? → feedback?(bkero)
Blocks: 636390
(Reporter)

Comment 2

8 years ago
Comment on attachment 514421 [details] [diff] [review]
m636069-puppet-manifests-r1.patch

This looks OK to me, in a rubber stamp kind of sense, but going to ask Ben for his review too. The additional step for updating ref images is likely to get missed some of the time, perhaps we can script something that tests on the hostname and blows away the network and puppet ssl dir ?

I know I filed this for talos-r3-fed* specifically, but did you have a look at the CentOS boxes too ? I think there's a template in use there.
Attachment #514421 - Flags: review?(nrthomas)
Attachment #514421 - Flags: review?(bhearsum)
Attachment #514421 - Flags: review+
There's a separate bug for the CentOS work (bug 636390).

If the refimage step gets missed, it's not such a big deal - each slave will start up without PERSISTENT_DHCLIENT on its first boot, which means that its first boot will fail if the DHCP server is down at that point.  That seems a sufficiently small likelihood that I'm not too worried about it.  A script would be difficult since NetworkManager re-writes these files.
(Reporter)

Comment 4

8 years ago
Close enough to a corner case to ignore then. I'm confused what 'add a step to delete this file before imaging to the refimage instructions' is for though, if NM is going to put it straight back.
I'm not sure how the imaging process works - is it an image of a "live" system?  If not, and the files can be deleted, that'd be good.  In the end, it's probably not that important.
Based on my googling, I _think_ we can fix this bug without PERSISTENT_DHCLIENT by adding eth0 to MANDATORY_DEVICES in /etc/sysconfig/network/config. As I understand it, doing so will cause the dhcp client to block the boot process on getting an address for it.

Would that solve the ickyness?
No, it actually makes it worse because the ethernet device is not necessarily eth0.  PERSISTENT_DHCLIENT is definitely the right tool for the job.
D'oh, I thought that it was always eth0, unless we had PERSISTENT_DHCLIENT=1.
No, the wandering definition of ethN is due to udev persisting (different use of the term) the device names in /etc/udev/rules.d/70-persistent-net.rules.  We could disable that, but I don't think that would buy us much more than a very unlikely corner case (comment 3 and comment 4), and it would risk me leaving a bunch of fedora boxes unreachable on reboot as I figure out how to make it work.  It would also leave those boxes with different performance numbers, since it would involve killing NetworkManager.  In short, it would make this bug a lot harder to finish.

So where does that leave us with the r? ?
Thanks for the explanation.

My only concern here is the same as Nick's: that we're going to often miss the pre-imaging step. Can we add a shutdown job to delete this file rather than relying on memory?
Shutdown job to remove the udev rules?

I would rather fix the root cause (there's a udev rule with the ref machine's MAC in it) than add another moving part to work around it.

Dealing with the udev issue on the image is a cost of using images rather than install+puppet, but we should fix it at the point of origin, rather than add another kludge.
Just a shutdown job to blow away this file, so that we ensure it never messes up the ref image:

>  - blow away /etc/udev/rules.d/70-persistent-net.rules in puppet
>  - add a step to delete this file before imaging to the refimage instructions
And blow it away on every reboot on every production machine, right? That's the gratuitous moving part I object to.
Is there any harm in doing so?
It's hackish, it's another moving part that could go wrong, it's something else to maintain. 

It also means that every host regenerates that rule on boot, which is yet another moving part that could go wrong and knock the machine off the net.

If we fix the image, than this file gets built once, while we're deploying the machine and looking at it.
Under the current plan, we have to remember to delete the file every time we shut down the ref image for imaging -- that's something that we *will* forget to do sometimes, and if I understand correctly, means that newly imaged machines may have issues on first boot.

What if we deleted the file at shutdown only on the ref images?
I think we agreed earlier that this was of vanishingly small importance (see comment 4). Consider that someone usually watches a machine do its first boot after a re-image, and that if DHCP is down IT is unlikely to be working on a re-image).

Adding more moving parts to fix something that will likely never bite us is not worth it, IMHO.  The better fix is to get away from refimages, but I'm not going to propose that here :)

I think that the cause of all this furor is my suggesting a new step to the snapshot process that we might forget.  The impact of forgetting is close to nul.  So let's just not add that step.  Problem solved!
Comment on attachment 514421 [details] [diff] [review]
m636069-puppet-manifests-r1.patch

Thanks for reminding me that this only affects us if DHCP is down, I missed that part of the earlier comments :(.
Attachment #514421 - Flags: review?(bhearsum) → review+
Comment on attachment 514421 [details] [diff] [review]
m636069-puppet-manifests-r1.patch

8fd2b7012a9a
Attachment #514421 - Flags: checked-in+
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Comment on attachment 514421 [details] [diff] [review]
m636069-puppet-manifests-r1.patch

I'm going to assume fb+ from bkero on this one :)
Attachment #514421 - Flags: feedback?(bkero) → feedback+

Comment 21

7 years ago
Looks good to me!  Sorry for the lack of feedback.
This re-occurred in bug 658414.  I think the F12 DHCP client is just bogus.  This should be fixed by some work related to bug 658678 to re-vamp the puppet/buildbot startup process on all POSIX hosts.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.