Closed Bug 666757 Opened 14 years ago Closed 14 years ago

fix naming problems on seamicro.phx1 nodes

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nmaul, Assigned: nmaul)

Details

Nodes 10-19 and 21-29 are known to be wrong. That is, DNS and the server disagree... the hostname is one number higher than DNS. Apart from being confusing, this is probably also causing an off-by-one error with puppet. 1- remove the <node30 nodes from the engagement cluster (drain them on http https) 2- remove node1-node29 from rhn (or at least the affected ones) 3- fix the hostnames for node0-node29 (currently show as node1-node30) 4- re-register them with rhn 5- re-puppetize them just to make sure we have the name/certs right 6- un-drain them pro tip: puppetizing these takes forever. If you do them in a serial loop make sure and screen your session.
Assignee: server-ops → nmaul
1. Done 2. Done 3. Done. Rebooted nodes to put into effect (because some daemons don't like changing hostnames, and this seemed like the easiest way to get everything fixed). This caused chaos with the Seamicro though. I think there is a race condition where nodes rebooting simultaneously can screw it up... best to space out the reboots by 15+ seconds... that seemed to work fine. 4. In progress.
Status: NEW → ASSIGNED
4. Done. 5. In progress (screen w/ 2 windows on dp-nagios01... doing 2 at a time).
5. Done. 6. Not starting yet... I want to double-check at least the edge cases, to make sure they look like they're going to work properly and got the right puppet configs.
(In reply to comment #0) > pro tip: puppetizing these takes forever. If you do them in a serial loop > make sure and screen your session. I just puppetized node81 and 82 yesterday...and had no problems (took the "usual" amount of time). Could it be that these nodes have storage related issues? or something else that's making them slow?
It actually wasn't too bad... I think it just seems bad if you do a bunch in serial. Of course, they're just little Atom CPUs, so they should be a bit slower than a good blade would be. Anyway, found another problem with the engagement cluster. For some reason, engagement1-20 line up with node10-29.seamicro engagement21 is a blade engagement22-30 line up with node210-218.seamicro engagement31-41 line up with node229-239.seamicro There was a gap... node219-228 are assigned to the Engagement cluster in puppet, but did not have an 'engagementXX' CNAME. I wedged it in there, which shifted 31-41 down a bunch. I updated commanderconfig.py on ip-admin02.phx, and fixed root's known_hosts file to have all the proper keys for each host. 6. Done as well. Marking this as resolved... don't see any more strangeness with engagement naming.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.