668395 - Heatsinks, Fans, and RAM for all scl1 iX systems

Reporter

Description

•

14 years ago

This is the final "final solution" for iX, referenced in bug 596366 comment 118. Zandr outlines the process well in that bug: ---- iX will set up with a couple of guys on-site in scl1. Mozilla will provide 3-4 folks to manage moving machines in and out of racks. Dustin will take a set of machines out of service to 'prime the pump'. Once those are ready to come out, we'll start pulling machines out of the rack and handing them off to iX. iX will install the new HSF and the upgraded memory, and hand the machines back to us. We'll get them racked back up, and hand them off to arr/Dustin for a quick smoke test and return to service. As those machines come online, new machines will be finishing builds and ready to come out for upgrade. I expect the downtime for any given machine will be on the order of 15-30 minutes, and in the name of pipelining we might have 5-10 machines off at a time. We can pull machines in any order as they become idle. Paul (from iX) and I did 8 machines like this in something like 45min. If we can get two workflows running in parallel, we should be able to get this work done in one or two days. ("We" in this case is Zandr plus one or two folks from ops, staffing TBD) ---- Note that we are *not* replacing drives. I will take care of feeding machines in order of priority: begin with known-bad, then follow with a mix of linux, linux64, and win32, being careful not to leave any one silo with too few builders. I'll fill in with w64 machines as necessary to keep the onsite people busy. I'll try to order by rack, where possible. Known-bad machines will need a new image, and we'll take care of that as possible - it's not practical to start a remote re-image in 10-15 minutes, since it requires a bunch of reboots to catch the BIOS screen. We'll start known-good machines up in a disabled state, and test things briefly where possible - meaning hdparm on Linux - before pushing them back to the production whence they came. Anything that doesn't pass muster will be punted into its own bug and handled later as appropriate. Tracking at https://spreadsheets.google.com/ccc?key=0AksdMKVHqpC3dDlNMnJyRGZHLXdEV2hicEpIcUU1dGc&hl=en&authkey=CIXTof4M A few important points: - this is an all-or-nothing project. All IX systems should have a new heatsink and some RAM by the end. - this will *not* require a downtime, although it may lead to elevated build wait times during the repairs. TBD: - when mtv1 happens (lower priority, since hosts in mtv1 are acting fine, but part of this project)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 1

•

14 years ago

(In reply to comment #1) > Note that we are *not* replacing drives. In case some machines have bad drives, can you have some pre-imaged linux and some pre-imaged win32 drives on hand so if a machine is determined to have a bad drive, it can be swapped immediately, while everyone is still at the colo? (Instead of needing a second trip, or needing to wait to image after installing a blank disk).

Zandr Milewski [:zandr]

Assignee

Comment 2

•

14 years ago

(In reply to comment #1) > (In reply to comment #1) > > > Note that we are *not* replacing drives. > In case some machines have bad drives, can you have some pre-imaged linux > and some pre-imaged win32 drives on hand We will have a supply of replacement drives on hand, but they won't be pre-imaged. Predicting how many of which machines are likely to fall out, building a pre-imaging rig, then spending time doing the imaging on spec just isn't a good use of time. If we encounter a bad drive, we'll replace it, and we can image the machines remotely.

Amy Rich [:arr] [:arich]

Updated

•

14 years ago

Assignee: server-ops-releng → zandr

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 3

•

14 years ago

I've set up the spreadsheet to track both state and priority of each machine. The states are: 1 = running 2 = disabled, still running 3 = off & ready for upgrade 4 = unracked/upgraing 5 = re-racked & powered 6 = back in production 8 = back, needs setup/tests 9 = back, error (see notes) 10 = in mtv1 and the priorities are designed to work through the various slave types evenly. Remaining prep work: - turn off buildbot-master5 (we can disable a few days beforehand) - replace scl-production-puppet (bug 659005 and its follow-ons) - add rack position to spreadsheet Possible blocker: - We can't get a 32-bit image to run on upgraded hardware, possibly due to PAE bugs in the ancient kernel we're running? Details and investaigation are on bug 668376.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 4

•

14 years ago

Remaining prep work: - turn off buildbot-master5 (we can disable a few days beforehand) - replace scl-production-puppet (bug 659005 and its follow-ons) - add rack position to spreadsheet - halt all of the state-3 machines before D-day Post-D-Day: - update inventory to indicate the status of every machine: new RAM, new drive, HSF

Zandr Milewski [:zandr]

Assignee

Comment 5

•

14 years ago

Proposing Tuesday, 19 Jul for D-Day. I'm waiting for confirmation back from iX right now.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 6

•

14 years ago

D-day proceeded *very* smoothly. I'll summarize briefly here; please ask if there are further needs. We (iX and Mozilla) un-racked, added HSF and RAM, and re-racked 146 machines in scl1, including three buildbot masters and a puppet master. No machines in mtv1 were affected. None of the failures we were concerned about occurred. All of the machines that were in production yesterday were returned to production and verified. Conversely, all of the machines that were not in production yesterday are verified as far as possible, but *not* moved into production; in other words, we didn't do any reimaging or other manual set-up today. All of this work occurred in just over 8 hours, including a lunch break, with no downtime. Only one job was burned, and that was a hung build where waiting for the timeout failure served no purpose. On the machines running linux, Amy verified the read speeds. They're all what we'd like - most >100MB/s, and all at least 80MB/s. Details are in the spreadsheet. The state as of this writing is exactly the state as of 24 hours ago, except that all of the affected machines have a new heatsink/fan and 8GB of RAM. I'm going to work through the remaining problems and open bugs. Some may still need to go back to iX, but we can work that out at our leisure. In the interim, this bug is finished. I'll leave it open for the moment for questions. The parent bug (bug 596366) is effectively complete (finally) at this point.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 7

•

14 years ago

Let's limit this bug to scl1 and open a new one for mtv1.

Summary: Heatsinks, Fans, and RAM for all iX systems → Heatsinks, Fans, and RAM for all scl1 iX systems

Zandr Milewski [:zandr]

Assignee

Comment 8

•

14 years ago

For the benefit of those trying to play the home game: There are a number of comments in the spreadsheet that are out of date. Don't jump to any conclusions based on anything you might read in the 'Status' column before we get that cleaned up.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 9

•

14 years ago

So the current situation, based on the project spreadsheet, is: 108 slaves up and running (including those in bug 673436) 8 slaves with hardware problems (bug 672973) 30 future w64 slaves --- 146 (there are also 74 slaves in mtv1 that aren't part of this bug) Given that the main project is done and all of the lingering machines are now on other bugs, this one is done!

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

Heatsinks, Fans, and RAM for all scl1 iX systems

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: zandr)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated