Closed Bug 671415 Opened 13 years ago Closed 13 years ago

Get 169 'rev4' mini slaves deployed in scl1 and ready to accept images from RelEng

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: jhford)

References

Details

160 Minis (plus one deployment server and one ref machine) have been ordered, and will need to be deployed in scl1. This bug will be RESOLVED/FIXED when they are ready for an image (or better puppetization) from Releng.
Assignee: server-ops-releng → zandr
per irc conversation with zandr: minis have been received and are in process
We were able to get all the memory into the mini's and now we just need to get them put into the custom server mounts for them. Once that is done I can get them racked in a few hours. We are currently working on getting power into the racks but that should be working by friday if everything goes to plan.
(In reply to Matthew Larrain[:digipengi] from comment #2) > We were able to get all the memory into the mini's and now we just need to > get them put into the custom server mounts for them. Once that is done I can > get them racked in a few hours. We are currently working on getting power > into the racks but that should be working by friday if everything goes to > plan. digipengi; in the RelEng / IT meeting yesterday, there was talk about some physical problem with racks, and possibly needing a redo. Is this work in addition to the work you describe in comment#2?
After the meeting yesterday I spoke with Erica and we decided to rack the HP's and mini's in a different fashion thus now we do not need to get new racks. I will be at SCL1 today to finish racking the HP's and tomorrow myself and the interns will be adding the mini's to there chassis and mounting/cabling them. As far as power is concerned Erica has the PDU's on order.
per meeting with IT yesterday: releng is on the hook to determine whether a dongle required for talos on these minis. jhford: did you happen to check out the dongle situation on the rev4 mini you had?
(In reply to Chris Cooper [:coop] from comment #5) > per meeting with IT yesterday: > > releng is on the hook to determine whether a dongle required for talos on > these minis. > > jhford: did you happen to check out the dongle situation on the rev4 mini > you had? I haven't had a chance to test whether a dongle is required. I believe that the dongle is currently installed, but I'd like to have verification on that. Justin, do you remember whether the dongle is installed/could you check?
I seem to remember us installing it, but I honestly can't remember for sure. I'm not at the office today, but let me see if I can get someone to go check.
As I mentioned in the meeting yesterday, the machine in MTV1 has a dongle attached.
Cross posting from bug 681726, but we will need to have dongles attached to the minis as 1600x1200x32 is not an available mode at boot without the dongle.
Per meeting with IT yesterday: * Erica said these machines were waiting on cabling and a subset (40?) should be online today.
After further investigation this week, it turned out that we needed to unrack and dismantle the enclosures, take down all of the serial/asset information again, fix the enclosures as we reassembled them, and rerack all the machines. Matt, Erica, Dustin and I spent a number of hours working on that this week. As of now, a number of the hosts are still without power cables (back ordered), and we need additional switches (on order) and more ethernet cables (will check to see what the ordering status of these are). All but 10 of the hosts have not yet been cabled for network and have not yet been added to the inventory (we have most of the information necessary for that, it's just a matter of data entry/import at this point). Those that do not have power cords (I would guestimate about 30 of them) have not been racked yet because it's difficult to get the power cords in after racking.
I put some additional information in another bug, but I'll add it here as well. At this point, to my knowledge we blocking waiting on the following hardware requirements: # cable management for all four racks - we had the trays, but the mounting brackets we could find when we were there last week were the wrong size. # additional PDUs for rack 102-1 and 102-2 - I believe these are on order? We did get in another batch of power cords and powered on all the minis we had outlets for. We should verify we have sufficient cords for the remaining ones (including the ones in MTV1 where the chassis are waiting to be fixed). # switches for rack 102-1 and 102-2 - I believe these are on order? # Network cables. 5' for sure (3' is too short), possibly more 7' as well, but we may be good there. # hooking up networking for all but the 10 minis that jhford is using for testing (blocked on cable management and, for some, switches and PDUs and network cables). # dongles/resistors for all but the 10 minis that jhford is using for testing - zandr was going to order more dongles before he left. I'm not sure if that was done. Other than the hardware requirements, we need to: # record inventory information for switch ports (I don't believe anything but the 10 machines jhford is using have been entered into inventory at all yet. dustin?) # get all of the hosts into dns/dhcp (if we have hostnames we know we're going to use, this will save some effort in renaming later in this step and the one above) # figure out a way to deploy the new minis in an automated fashion (digipengi is looking at using ARD for this now, but it will like require a change in infrastructure because we already have a PXE server that handles every thing else on the build vlan). Erica/Dustin, did I miss anything you know of?
(In reply to Amy Rich [:arich] from comment #12) > # record inventory information for switch ports (I don't believe anything > but the 10 machines jhford is using have been entered into inventory at all > yet. dustin?) Not even those, yet, although I will do that shortly. I'm waiting on the rest until we have switchport information, since it's a lot easier to bulk-import than to go back and edit that info in later. > # get all of the hosts into dns/dhcp (if we have hostnames we know we're > going to use, this will save some effort in renaming later in this step and > the one above) Once stuff goes into inventory, this will be straightforward. > # figure out a way to deploy the new minis in an automated fashion > (digipengi is looking at using ARD for this now, but it will like require a > change in infrastructure because we already have a PXE server that handles > every thing else on the build vlan). It's also a Netboot server, which is how it remote-boots the minis, but yeah
adding dmoore incase there's some shared state he has. also grabbing this bug.
Assignee: zandr → mrz
3 48-port switches are on order (one for rack 2-1; two for rack 2-2). A request for an additional outlet for the final PDU has been made the Internap. I can get a timeframe on that later in the day today. Once that work has been completed, I will mount the final PDU in rack 2-2. Custom brackets are on order for the PDUs. I am ordering crossbars for cable management for the mac minis and will either find a bracket that works with the plastic (main) cable management or will order something else next week.
Also, the 10 dongles that were provided by relops were connected to the mac minis. These mac minis as well as all of the HP DL120s are located in racks 2-3 and 2-4 and are not impacted by the power request for rack 2-2.
Erica: I believe there are about 10 minis in MTV that are waiting to have the stripped chassis screws dealt with as well as the ones in SCL1 that haven't been racked yet. With the minis already racked plus the ones that are yet to rack, I think we may be missing infra in rack 102-1) * I believe we're going to need a total of 2 switches for rack 201-1 as well as 202-2. * I believe we also need an additional PDU for rack 102-1. And we want to verify we have enough power cables for all the minis. * Do we have 5' cables on order? The 3' cables are too short since the minis are not full depth. As a stopgap for the machines we brought up for jhford, we took some 5' cables from Castro. And the one thing I missed in my initial count of things we still need... I think the IP addressable power strips are not currently cross connected, so that also needs to be done. Zandr was ordering more dongles, and there are a number of resistors on digipengi's desk. I'll see if I can catch him while he's on vacation to see what the status of the dongles are.
The racked minis are now in inventory. The spreadsheet is updated with an "in inv?" column. If you change something for a column that has a "Y", there, you'll need to update inventory, too.
Depends on: 683715
mrz has ordered dongles.
An update from emux on hardware: I have been working with Internap to get the extra whip (basically, an outlet for the PDUs) in racks 2-1 and 2-2. They schedule the electrical work with contractors and agree that it is 1 hour's worth of work, once they get it scheduled. I escalated to our Account Manager and was told that I will have a timeframe for the work today, for sure. If I am not near a computer when i get the estimate, I'll text you right away. This blocks 35 of the 177 mac minis at scl1. The switches are scheduled to arrive tomorrow. All of the cables are staged and ready to go. The cross-connects are in place. I am assigning the ticket for switch configuration to Dustin. Two Issues: 1. I cannot find the resistors. They are not on Matt's desk, nor in scl1. It is possible that they are *in* one of his desk drawers. I'll wait to hear back from him before invading his desk. 2. For racks 2-3 and 2-4, the primary and management NICs are both cabled into the same switch. This needs to be changed to primary-> primary switch and management-> secondary switch. It's a quick change to make. However, it will require about 5 minutes of downtime for the 10 mac minis that releng is already using. The management NICs for the HPs will be offline for 20 minutes. It would be nice to have a one-hour maintenance window to take care of it, in case there are any interruptions. Blocks are: 1. Power for 35 mac minis. [will have an estimate on this today] 2. Dongles/Resistors for 167 mac minis 3. 4 switches for 129 mac minis [wednesday] 4. Switch Configuration 5. Cable Maintenance
The giant roll'o'resistors is sitting on top of his rollcart thingy under his desk.
(In reply to Amy Rich [:arich] from comment #20) > However, it will require about 5 minutes of downtime for the > 10 mac minis that releng is already using. This aren't in production. If I am around, I can shut them down for you. If I am not around, please feel free to shut them down cleanly.
John: they went down at about 8:30 today. I haven't had confirmation, but I believe that's finished now.
(In reply to Dustin J. Mitchell [:dustin] from comment #23) > John: they went down at about 8:30 today. I haven't had confirmation, but I > believe that's finished now. these machines are all showing an uptime of 10 days, was the maintenance only done to the network? If so, can you confirm that the work is complete?
Now that RelEng is also responsible for former-MoMo thunderbird machines, it no longer makes sense to set up two separate pools of identical machines, with identical toolchains. Instead we'll setup a single new shared pool. Therefore closing bug#688801 as a DUP of bug671415, and tweaking summary to match reality. (Its still unclear what we'll do with the existing MoMo machines, in their own MoMo network, but at least for new machines going forward, consolidation like this buys us economy of scale, simpler configs/network/IT load, better burst capability, and seems a no-brainer.)
Summary: Get 160 'rev4' minis deployed in scl1 and ready to accept images from Releng → Get 183 'rev4' minis deployed in scl1 and ready to accept images from RelEng
(In reply to John Ford [:jhford] from comment #24) > (In reply to Dustin J. Mitchell [:dustin] from comment #23) > > John: they went down at about 8:30 today. I haven't had confirmation, but I > > believe that's finished now. > > these machines are all showing an uptime of 10 days, was the maintenance > only done to the network? If so, can you confirm that the work is complete? The machines did not go down, only the network cables were swapped. And, yes, I believe that this work is complete. If not, we'll reschedule.
In fact, it is not complete. I started work and was interrupted with the PDU/cabling emergency at the office. Sorry guys, we do need to reschedule.
I've changed the bug summary to reflect the actual number of mini slaves (there's also one server for spec ops) we should have for scl1 (160+1+22 - rev4-testing, rev4-testing2, and 6 foopies)
Summary: Get 183 'rev4' minis deployed in scl1 and ready to accept images from RelEng → Get 176 'rev4' mini slaves deployed in scl1 and ready to accept images from RelEng
Blocks: 690236
7 more minis to be allocated as foopies in bug 689937 brings our count in scl1 down to 169.
Summary: Get 176 'rev4' mini slaves deployed in scl1 and ready to accept images from RelEng → Get 169 'rev4' mini slaves deployed in scl1 and ready to accept images from RelEng
From emux: I worked with the electrician at scl1 early this morning... (4:30am) then left to get some rest [...] The new outlets are installed. PDU mounted and plugged in. More Mac minis can be plugged in. Regarding cross-connects: I am going to get some lunch and head back to scl1 to make sure everything is plugged in so Dustin can configure the switches.
all of talos-r4-snow-{001..080} are in DNS and DHCP now. Almost - there are two for which we don't have MAC addresses, that we'll need to track down with a crash cart. I'm updating inventory now.
The number of slaves has changed from 160 to 183, then to 176 and now to 169. I am going to file bugs for a total of 88 rev4 minis to become 10.6 test slaves. I am picking a static number to avoid confusion stemming from changing the total number of slaves. Once we have 88 of these machines in production, lets evaluate how well this is meeting our demand and adjust the final number of 10.6 rev4 testing slaves accordingly. These 88 slaves will be comprised of the 10 set up in bug 683718 and 78 newly imaged machines, as set up in 690951 (once the GO is given to image those machines).
Blocks: 690951
As an update, all of the minis are now in DNS, albeit some with temporary names, and in DHCP for those we have MAC addresses for. Inventory has the correct hostnames, but switch ports are incorrect because the API does not allow editing those :(
Switch ports are fixed in inventory. I've nominated a machine as a reference machine. John, how would you like to handle setting that up? New bug? It's not online yet - it will need to be installed first. We also need to label the remaining 70 talos-r4-snow-XXX's. The organization is a little bit logical, but confusing enough that I figured a diagram can help. In the following, the left column is the rack position, and the two columns are give the machine number; on the labels, these should be preceded by "talos-r4-snow-", e.g., the left side of rack position 12 in 102-1 is "talos-r4-snow-029". 102-1 ----- 12 029 030 11 027 028 10 025 026 9 023 024 8 021 022 7 019 020 6 017 018 5 015 016 4 013 014 3 011 012 102-2 ----- 12 049 050 11 047 048 10 045 046 9 043 044 8 041 042 7 039 040 6 037 038 5 035 036 4 033 034 3 031 032 102-3 ----- 34 069 070 33 067 068 32 065 066 31 063 064 30 061 062 29 059 060 28 057 058 27 055 056 26 053 054 25 051 052 102-4 ----- 36 ref 35 010 (server) 34 001 002 33 003 004 32 005 006 31 007 008 30 009 010 29 079 080 28 077 078 27 075 076 26 073 074 25 071 072 When labeling these, please spot-check a few asset tags -- maybe the top, middle, and bottom of each rack -- to make sure the real world corresponds to my spreadsheets and inventory.
digipengi vidyoed us in so we could get a look at scl1, and I have a few more updates: We have all of the necessary CDUs connected and powered up in racks 1, 3, and 4 now. We have 2 out of 3 CDUs in rack 2. We have cable management installed on the sides of the racks except in between racks 1 and 2. All of the switches are racked, and Matt and Dustin are going to work on getting the rest of them powered, networked, and configured tonight. Time permitting, Matt is also going to work on connecting more cabling to the existing minis. mrz has solder cups on order for the dongles. Once we have those, we'll hand the cups and the resistors over to the contractor to build the end piece of the dongles. Spec ops has determined that we can take images of 10.6.8 and 10.6.4 with the existing DS software. We are waiting on jhford for a disk image or, preferably signing off on instructions to create a ref machine from scratch (so we know that we can repeat the process from a restore disc).
Per irc conversation with coop, the 8 or so minis that are left from the thunderbird pool (we've been borrowing machines to use as foopies, prototype machines, and will need a couple more as ref images) are to be folded into the general releng pool.
I'm not sure which of the bugs is being used to image these machines, but a FW800 cable will be required. John, do you know if there's one in scl1 now? If not, where should Matt find one?
Here's the list of tasks that remain for the machines in scl1: 1. [emux] The HPs need to be re-plugged in. Erica is at the dc now working on that this morning (this does not impact deployment of minis, but is in the same overarching project in scl1). 2. [emux] PDUs need to be configured so they can be managed remotely, and information needs to be provided on what machines are plugged into which outlets. 3. [emux] Label the console cable running to the Core Switch. 4. [emux] Dongles: Erica left a voicemail for Dao. We have all the parts for soldering. She should hear back from him today and will schedule pickup for the supplies. 5. [emux] We are still waiting on the final PDU, but everything that is racked is powered. We have enough outlets for 16 more mac minis to be plugged in. 6. [emux/digipengi] Rack the rest of the minis (does not impact the 10.6 tester rollout since we have enough for those already). 7. [jhford] provide image for the 10.6 (and later 10.7) minis 8. [digipengi] deploy one mini using jhford's image/instructions 8. [digipengi] take an image of said machine (do we need to borrow a dongle from one of the existing 10 machines for this step?) 9. [jhford] test the newly deployed machine, make adjustments as necessary (goto 7 until production-ready image is verified). 10. [digipengi] image remaining 69 10.6 testers. 11. [emux/digipengi] add finished dongles to the minis. 12. [digipengi] identify any non-functional minis and send them for repair 13. [digipengi] image the remaining 10.7 testers.
Update on #1: HPs in rack 4 are powered and networked. HPs in rack 3 are being worked on. Update on #3: DONE: Console cable labeled. Update on #4 & #11 I spoke to emux, and dao is picking up all the materials for the dongles at 2:00PM today and will deliver them Monday. emux says we should have them all attached to the minis by Tuesday night. Update on #7-#10: digipengi and jhford worked on getting a good image this morning, and digipengi will be imaging more machines this afternoon. I expect that with someone in the datacenter imaging full time, it will take us roughly two days to complete the 10.6 machines (assuming no unforseen circumstances).
(In reply to Amy Rich [:arich] [:arr] from comment #39) > Here's the list of tasks that remain for the machines in scl1: > 7. [jhford] provide image for the 10.6 (and later 10.7) minis > 8. [digipengi] deploy one mini using jhford's image/instructions > 8. [digipengi] take an image of said machine (do we need to borrow a dongle > from one of the existing 10 machines for this step?) > 9. [jhford] test the newly deployed machine, make adjustments as necessary > (goto 7 until production-ready image is verified). Steps 7-9 are done. Thankfully we only needed to take one reference image. All imaged hosts are syncing properly with puppet. > 10. [digipengi] image remaining 69 10.6 testers. Matt made lots of progress. There are about 20 machines left to image from what I can see. > 11. [emux/digipengi] add finished dongles to the minis. > 12. [digipengi] identify any non-functional minis and send them for repair > 13. [digipengi] image the remaining 10.7 testers.
is this done now?
(In reply to Dustin J. Mitchell [:dustin] from comment #42) > is this done now? if all the machines are racked, powered, networked, dongled, and ready to be imaged my concerns here are done.
(re comment #42 - I was thinking this bug was for the snow testers only .. oops!) Regarding lion testers, I'm planning out the resource requirements. I think that most of the hardware/IT prerequisites will be done easily in time, but I have some questions: Will we be following the same process as for the snow minis, where we snapshot a fully puppeted reference machine rather than a fresh install? What are the time parameters you expect on building a reference machine that we can snapshot to begin deployment? This would include any necessary pre-snapshot testing. Is it possible to image all of the machines at once? Doing so minimizes the number of DC trips, which is a better use of our limited on-the-ground resources. I know there's *some* danger of a fatal flaw in the image requiring us to go back and do it again, but I'm hoping that chance is small, and that any necessary fixes can be accomplished with puppet, rather than a reimage.
Assignee: mrz → dustin
Severity: normal → major
(catching up from being on vacation) Step 10 was DONE earlier this week. The talos-r4-snow minis are all imaged and in production (with the exception of the few that are in the scl1 reboots bug). Per bug 693723 we need to reimage the original 10 with the same process.
Step 1 is DONE (all HPs are plugged in) Step 11 is DONE (all minis have dongles). This leaves: 2. [emux] CDUs need to be configured so they can be managed remotely, and information needs to be provided on what machines are plugged into which outlets. Erica will be working on this today. 5. [emux] We are still waiting on the final CDU, but everything that is racked is powered. We have enough outlets for all mac minis to be plugged in. The vendor is having issues tracking down our last CDU, and Erica will be talking with them again. This does not block the rollout of any more of the r4 minis we currently have slated for scl1, so for the purposes of this bug, I'm going to call this DONE. 6. [emux/digipengi] Rack the rest of the minis (does not impact the 10.6 tester rollout since we have enough for those already). Matt has 6 more chassis to rack (12 minis). We have power cords, network cables, and dongles for all 12, so it's just a matter of having time to get this done. Since it's not a higher priority than tegras at the moment, we've been allocating those resources elsewhere. 12. [digipengi] identify any non-functional minis and send them for repair Since the minis come 2 to a rackmount chassis, we'll need to schedule downtime for the second functioning mini in any chassis that contains a defective mini. 13. [digipengi] image the remaining 10.7 testers. Still blocked on getting an image from releng.
Assigning to jhford for questions in comment #44.
Assignee: dustin → jhford
(In reply to Dustin J. Mitchell [:dustin] from comment #44) > (re comment #42 - I was thinking this bug was for the snow testers only .. > oops!) No worries! > Regarding lion testers, I'm planning out the resource requirements. I think > that most of the hardware/IT prerequisites will be done easily in time, but > I have some questions: > > Will we be following the same process as for the snow minis, where we > snapshot a fully puppeted reference machine rather than a fresh install? At this point, it is looking like we will be doing a reference image of a machine that is fully configured with puppet. > What are the time parameters you expect on building a reference machine that > we can snapshot to begin deployment? This would include any necessary > pre-snapshot testing. Until I get tests running consistently green, I won't be able to give accurate time estimates, but my goal is to have the machine setup finalised this week. Once I have things set up as best as I can in SF, the machines will need to be wiped and set up in MV/SCL1 to be able to build the reference image with puppet. Once this is done I will be requesting that 10 machines be imaged. > Is it possible to image all of the machines at once? Doing so minimizes the > number of DC trips, which is a better use of our limited on-the-ground > resources. I know there's *some* danger of a fatal flaw in the image > requiring us to go back and do it again, but I'm hoping that chance is > small, and that any necessary fixes can be accomplished with puppet, rather > than a reimage. I asked for the 10/70 split for two reasons. One is that it gets me a workable pool to start finding issues much quicker than getting 80 machines up. The other is that, as you mention, limiting to 10 minis minimizes work required if the imaged is completely broken. Unlike the 10.6 rev4 machines, we are working with a totally new OS this time. Its the first lion deployment in releng infra. The risk of the image being broken is a lot higher. My preference is to do another 10 machine medium scale test before rolling out to all 80 machines.
OK, I'll plan to have a refimage in place in mtv1 a week from Wednesday, as a safe bet. I'll also plan for a 10/70 split.
I updated my project plan. The current overall ETA is early to mid November, but that's assuming we have two Matts - one for tegras and one for r4's. Since we only have one, that date is likely to slip a bit until tegras are done. When the reference machine is done, it should go directly to scl1, as we probably won't have the necessary deploystudio equipment in mtv1 to snapshot it.
colo-trip: --- → mtv1
(In reply to Dustin J. Mitchell [:dustin] from comment #50) > I updated my project plan. The current overall ETA is early to mid > November, but that's assuming we have two Matts - one for tegras and one for > r4's. Since we only have one, that date is likely to slip a bit until > tegras are done. > > When the reference machine is done, it should go directly to scl1, as we > probably won't have the necessary deploystudio equipment in mtv1 to snapshot > it. talos-r4-snow-ref is a machine in scl1 and is the machine that will be used as the reference image.
Blocks: 696507
The current state, as posted to the other bugs as well, is that the reference image works fine, DeployStudio works fine, and this is running on talos-r4-lion-{001..010} right now. {011..080} are racked, powered, networked, etc., and ready to receive an image. Once releng signs off on these ten, we'll just need some time onsite to get the remainder imaged.
It sounds like the remaining work here is going to be tracked in bug 698603. Are the machines that aren't talos-r4-snow-XXX or talos-r4-lion-XXX all ready for imaging as well? If bug 698603 doesn't track the remaining work, what is left here?
Blocks: 698603
I'm not sure how this bug *blocks* bug 698603, but I certainly don't see any additional work here, beyond cleanup, that isn't covered in bug 698603, and you're the only other person who would know of such work, so I think that means RESO/FIXED?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
the block was that you can't image machines that aren't ready to be imaged. This bug tracked getting machines ready to be imaged, the other bug tracks actually imaging them. There are 9 slaves that I haven't requested as lion or snowleopard testers. Do you know what they are set aside for?
There are a number of minis that are still broken, stuck in chassis, etc., and need to be chased to ground. We'll get those straightened out - maybe while I'm in town - and then figure out what to do with them. Aside from the Mini Server, they aren't set aside for anything.
As far as the other minis, there was mention of using some of them to integrate thunderbird. I would also recommend that some be set aside to work on 10.7 builders since we don't have r5s in the colo yet.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.