Closed Bug 858840 Opened 11 years ago Closed 11 years ago

ns{1,2}.private.phx1 using dhcp

Categories

(Infrastructure & Operations :: Change Requests, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: limed, Assigned: bhourigan)

References

Details

Attachments

(1 file)

This host is grabbing its IP via DHCP, sounds like bad mojo and we should change that to use static addresses.
Proposed patch to address network configuration
Assignee: server-ops-infra → bhourigan
Status: NEW → ASSIGNED
I'll land this Monday after our r/o window closes
Adding a note on what we talked over IRC, unless I am going crazy I had actually tried to change this to a static assignment the last time phx1 had issues with dhcp. And for some reason it changed back to DHCP which I suspect was caused by dhclient, so might want to take a look at this too.
I've changed ns2.private.phx1 to static IP assignment, and made sure dhclient isn't running.
The network changes to ns1 are staged, but I'll need some input on when this can be done. I expect 6 seconds of downtime to restart networking. The default libresolv timeout is 5 seconds. Production services could see query failures.

:jabba,

Please advise how I should proceed
Flags: needinfo?(jdow)
I think the best course of action would be to:

1) swap the /etc/resolv.conf entries on clients so that ns2 is preferred, which will minimize the impact during the cutover and
2) do the cutover during the June 1st (or possibly June 2nd) releng maintenance window to further minimize possible impact
Flags: needinfo?(jdow)
Also, bring it up in next CAB meeting.
CAB NOTES:

I'de like to make a simple change to ns1.private.phx1 so that the IP is configured statically vs allocated by DHCP. It will involve an edit to /etc/sysconfig/network-scripts/ifcfg-bond0 and a 'service network restart'.

It will require ~6s of downtime on ns1.private.phx1. The default libresolv timeout is 5s, but the default retry is 2. I estimate it will take 10s of downtime for applications to observe a failed query. I don't expect any problems, but there is a possibility that some DNS queries will fail during this time.

I would prefer to avoid mucking with resolv.conf globally, while we can remove the IP from resolv.conf many applications (such as Zeus) cache the servers and it would require us to restart many services. Once to remove the IP, and again to re-add it.

I'm flexible on timing. I'm thinking early some morning like 5AM PST, or during another maintenance window when folks are already anticipating a blip.
Component: Server Operations: Infrastructure → Server Operations: Change Requests
QA Contact: jdow → shyam
Flags: cab-review?
Approved to ride along on the treeclosing window of June 1st.
Flags: cab-review? → cab-review+
Group: infra
Blocks: 878494
Changes applied, ns1 is no longer using dhcp. Is nice!
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: