Closed Bug 1014703 Opened 11 years ago Closed 10 years ago

image 64 seamicro machines for production

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: arich)

References

Details

Attachments

(2 files)

In bug 1002634 we successfully tested 3 seamicro machines as windows builders. Let's move the remaining into production as well.
Blocks: 1002634
Depends on: 1014700
Assignee: relops → mcornmesser
:catlee: we're going to need to know how many machines to allocate for try vs build. Do you have a count for us? We'll also need to reclaim the first node and rebuild it once the ssds come in, since it's on a 1T SATA drive right now.
Let's go with 50% try, 50% prod.
catlee: so to be clear, you want 7 32G wintry, 7 32G winbuild, 25 16G wintry, and 25 winbuild?
Flags: needinfo?(catlee)
Depends on: 1017126
This is the seamicro server config for all 64 nodes (0 - 63). They are split so that we have 25 16G wintry nodes, 25 16G winbuild nodes, 7 32G wintry nodes, and 7 32G winbuild nodes. Information has not yet been added to inventory/dns/dhcp.
bhearsum: since the seamicros are chassis and don't have individual ipmi interfaces like the 1U machines, have you given though to how automatic reboots will be handled? The docs talk about XML-RPC API support, but relops has not investigated this at all. I suspect you want to look at SeaMicro_Config_2.7.0.0_18-Mar-2012_Edition-1.pdf to get a sense of what's possible.
Flags: needinfo?(bhearsum)
(In reply to Amy Rich [:arich] [:arr] from comment #5) > bhearsum: since the seamicros are chassis and don't have individual ipmi > interfaces like the 1U machines, have you given though to how automatic > reboots will be handled? The docs talk about XML-RPC API support, but > relops has not investigated this at all. I've moved away from SlaveAPI work for the time being, so I'm redirecting this to Callek
Flags: needinfo?(bhearsum) → needinfo?(bugspam.Callek)
In the meantime slaveapi will file "unreachable" dcops bugs for these hosts, if ipmi is not specified/reachable in inventory and we can't connect directly to the host. for slaveapi needs I'd love a set of docs on "how to remotely reboot a specific node" and any unique information for such stored in inventory so we don't need to replicate a dict like devices.json in order to do so.
(In reply to Amy Rich [:arich] [:arr] from comment #3) > catlee: so to be clear, you want 7 32G wintry, 7 32G winbuild, 25 16G > wintry, and 25 winbuild? yes
Flags: needinfo?(catlee)
64 hostnames, SREG records, and CNAMEs updated in inventory to match the chassis description and server id designations: https://inventory.mozilla.org/en-US/core/search/#q=b-2008-sm
(In reply to Justin Wood (:Callek) from comment #7) That might quickly become untenable if there are a lot of reboots required. I'm not sure how to remotely reboot these machines (without using the GUI) as it stands. I suspect the first step will probably involve moving the OOB interface and/or figuring out how to setup the fabric inband without screwing with the IP allocations.
catlee: when can we delete the three existing machines and rebuild them on the new ssds?
Flags: needinfo?(catlee)
Depends on: 1020424
Any time. Coordinate with buildduty to take them out of production first.
Flags: needinfo?(catlee)
Summary: Move remaining seamicro machines into production → image 64 seamicro machines for production
Depends on: 1018450
The three existing machines were disabled in slavealloc and their disk configurations have been wiped and re-done with the proper SSD partitions. I've configured the disks for all but nodes 4-12 (which are waiting on an SSD swap), and have kicked off installs for all of the machines that have disks. It looks like we having issues talking to the domain controllers, though, so getting the machines functional is blocked on some help from netops.
All new SSDs have been configured, and we're attempting installs on all nodes. We might still be blocked by flow issues, so I'll report back when all installs are successful.
Assignee: mcornmesser → arich
After a great deal of debugging, hacking, and hand-holding, I think I've gotten all but 004 and 0031 installed. Those two are having tftp issues which are probably the fault of the seamicro's weird internal networking. I'll bang on them more next week.
mgerva: catlee says you're the one who can start smoke testing these before we get them into production, just to make sure that things are working and we're seeing the performance we expected (we went from half of a 1T SSD to 1/4 of a 1T ssd).
Flags: needinfo?(mgervasini)
Thanks arr! b-2008-sm-000{1..3} have been re-enabled in slavealloc. b-2008-sm-0033 should be ready soon.
Flags: needinfo?(mgervasini)
b-2008-sm-0033 has been enabled on slavealloc too.
All the seamicros (except for 0004 and 0031) have been enabled on slavealloc. Most of the seamicro are already accepting jobs but some of them, after a reboot, are just showing the following message (when connecting with RDP): "Please wait for the Group Policy Client..." here's the list of machines that show the above message: (try) b-2008-sm-0013 b-2008-sm-0016 b-2008-sm-0017 b-2008-sm-0020 b-2008-sm-0021 b-2008-sm-0024 b-2008-sm-0025 b-2008-sm-0026 b-2008-sm-0027 b-2008-sm-0030 (build) b-2008-sm-0044 b-2008-sm-0059 b-2008-sm-0060 b-2008-sm-0063 b-2008-sm-0064
Going to split out the issues with 0004 and 0031 into a different bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Flags: needinfo?(bugspam.Callek)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: