Closed Bug 758275 Opened 12 years ago Closed 12 years ago

reimage w32 builders as w64 builders

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

According to coop, releng is close to getting all windows builds functional on w64 servers.  This means that we'll be reimaging most w32 machines as w64 machines.

There are two groups of servers that will be affected, w32 servers in scl1 and w32 servers in mtv1.  

w32-ix-slave01
w32-ix-slave02
w32-ix-slave03
w32-ix-slave04
w32-ix-slave05
w32-ix-slave07
w32-ix-slave08
w32-ix-slave09
w32-ix-slave10
w32-ix-slave11
w32-ix-slave13
w32-ix-slave14
w32-ix-slave15
w32-ix-slave16
w32-ix-slave17
w32-ix-slave18
w32-ix-slave19
w32-ix-slave20
w32-ix-slave21
w32-ix-slave22
w32-ix-slave23
w32-ix-slave24
w32-ix-slave25
w32-ix-slave26
w32-ix-slave27
w32-ix-slave28
w32-ix-slave29
w32-ix-slave30
w32-ix-slave31
w32-ix-slave32
w32-ix-slave33
w32-ix-slave34
w32-ix-slave35
w32-ix-slave36
w32-ix-slave37
w32-ix-slave38
w32-ix-slave39
w32-ix-slave40
w32-ix-slave41
w32-ix-slave42
w32-ix-slave43
w32-ix-slave44

mw32-ix-slave11
mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24
mw32-ix-slave25
mw32-ix-slave26

In order to reimage them, the following needs to happen:

Prep work, can be done now:
* Create dns/dhcp entries for the machines in the winbuild domain.  Take the MAC info from inventory for these to create new entries for w64-ix-slave43 - w64-ix-slave84 for the w32-ix machines and w64-ix-slave85 - w64-ix-slave-101 for the mw32 machines.  Create entries for both primary and management interfaces for every machine. (matt)
* verify that rack space with power and network are available in scl1 (bug 758245)
* configure VLAN for primary (and, if necessary) management interfaces for machines moving from mtv1 (dustin)

Reimaging work for machines in scl1 (42 machines), must wait till we're ready to pull the trigger:
* change VLAN for primary (and, if necessary) management interfaces (dustin)
* verify that primary and mgmt interfaces work (matt/dustin)
* modify nagios (arr)
* modify inventory (matt)
* modify corp dns (remove old A and PTR records, update bmo CNAME) (arr)
* push new image (matt)

The machines in mtv1 will need to be moved to scl1 and require a hardware upgrade before being put into service there, so this will take longer and should be done in a separate batch.

Reimaging work for machines in mtv1 (16 machines), must wait till we're ready to pull the trigger:
* unrack machines (matt/jake)
* perform hardware upgrades (matt/jake)
* rack machines in scl1 (matt/jake)
* verify that primary and mgmt interfaces work (matt/jake/dustin)
* modify nagios (arr)
* modify inventory (matt)
* modify corp dns (remove old A and PTR records, update bmo CNAME) (arr)
* push new image (matt)
Armen and I are frantically trying to fix the w64 image. We have narrowed it down to the buildbot update python script (was affected with the SCL3 move I think) and we should have that done by tomorrow if things go smoothly. I will start the dns/dhcp entries tomorrow/over the weekend. The actual imaging will only take me a day once we have everything else hammered out.
(In reply to Amy Rich [:arich] [:arr] from comment #0)
> Reimaging work for machines in scl1 (42 machines), must wait till we're
> ready to pull the trigger:
> * change VLAN for primary (and, if necessary) management interfaces (dustin)
> * verify that primary and mgmt interfaces work (matt/dustin)
> * modify nagios (arr)
> * modify inventory (matt)
> * modify corp dns (remove old A and PTR records, update bmo CNAME) (arr)
> * push new image (matt)

We're ready to start converting the w32 ix machines in scl1 to w64 on our end. How soon can this be done? What items on the IT punch list above are still outstanding?

I can coordinate with whoever will be doing the re-imaging (MaRu, I assume) to take batches of w32 slaves offline as required for the rest of the week.

> The machines in mtv1 will need to be moved to scl1 and require a hardware
> upgrade before being put into service there, so this will take longer and
> should be done in a separate batch.

We have another merge coming next Monday, after which Aurora will also be building on w64 and we should be clear to move 16 of the 26 machines from mtv -> scl. 

What portion of these prelim steps could be performed while the machines are still in mtv? 

Would we upgrade the hardware on the machines staying (for now) in mtv, or wait until they are also ready to move (esr17)?
The only thing we can do ahead of time is pre-populate DNS and DHCP on the windows domain controller.  Everything else has to wait till we take the machine out of commission as a w32 builder because it will be disruptive to DNS, DHCP, the hardware, or the network for that machine and builds will fail.

The DNS/DHCP pre-population will be done today.

Maru will be the one doing the reimaging of the w32 machines, and he'll need to coordinate with dustin to get the switch ports cut over.  We should come up with batches of machines and track this via etherpad.

As far as upgrading the hardware that's staying in mtv1, it's time consuming (and there's limited space) to unrack, upgrade, and rerack them, so we'd hoped to do that when the physical machine moves happen.  Is there a reason you're looking to do that now?
(In reply to Amy Rich [:arich] [:arr] from comment #3)
> As far as upgrading the hardware that's staying in mtv1, it's time consuming
> (and there's limited space) to unrack, upgrade, and rerack them, so we'd
> hoped to do that when the physical machine moves happen.  Is there a reason
> you're looking to do that now?

No, just looking for any potential time savings.
(In reply to Chris Cooper [:coop] from comment #4)
> (In reply to Amy Rich [:arich] [:arr] from comment #3)
> > As far as upgrading the hardware that's staying in mtv1, it's time consuming
> > (and there's limited space) to unrack, upgrade, and rerack them, so we'd
> > hoped to do that when the physical machine moves happen.  Is there a reason
> > you're looking to do that now?
> 
> No, just looking for any potential time savings.

This would be the opposite of time saving.  :}
Depends on: 760129
I've only looked at slave43 and slave45 so far, but I've seen some problems:

* both were logged in as administrator
* both had the old VNC password
* both had the old cltbld password
* neither have a e:\builds\moz2_slave dir
Depends on: 760371
All the w32-ix-slaves in scl1 have been re-imaged now. Leaving open to track the mw32-ix machines in mtv.
Do we have an ETA on when we are moving and upgrading these machines?
Moving any of the mw32 machines is on hold until we get the power problem in 3/mdf fixed.
Depends on: 761237
Coop, are we moving forward with the planned relocation/reimage of the 16 machines in mtv1 now that the power issue has been fixed?
(In reply to Amy Rich [:arich] [:arr] from comment #11)
> Coop, are we moving forward with the planned relocation/reimage of the 16
> machines in mtv1 now that the power issue has been fixed?

Now that merge day has passed, let's start the process of moving the 16 iX machines and getting them upgraded/reimaged.
For the record, yesterday we also moved TB jobs to win64 slaves. I think up to mozilla-beta (only mozilla-release and esr are left).
Depends on: 774829
Matt needs to pre-load the domain controllers (old and new) with the DHCP info for the remaining 16 machines (mw32-ix-slave11 - mw32-ix-slave26).
DNS entries have been made for

w64-ix-slave85
w64-ix-slave85-mgmt
w64-ix-slave86
w64-ix-slave86-mgmt
w64-ix-slave87
w64-ix-slave87-mgmt
w64-ix-slave88
w64-ix-slave88-mgmt
w64-ix-slave89
w64-ix-slave89-mgmt
w64-ix-slave90
w64-ix-slave90-mgmt
w64-ix-slave91
w64-ix-slave91-mgmt
w64-ix-slave92
w64-ix-slave92-mgmt
w64-ix-slave93
w64-ix-slave93-mgmt
w64-ix-slave94
w64-ix-slave94-mgmt
w64-ix-slave95
w64-ix-slave95-mgmt
w64-ix-slave96
w64-ix-slave96-mgmt
w64-ix-slave97
w64-ix-slave97-mgmt
w64-ix-slave98
w64-ix-slave98-mgmt
w64-ix-slave99
w64-ix-slave99-mgmt
w64-ix-slave100
w64-ix-slave100-mgmt
DHCP entries have been made as well.
And now I've added the DHCP and DNS entries for winbuild
The mapping of the old to new hostnames for this batch:

w64-ix-slave85                  mw32-ix-slave11
w64-ix-slave86                  mw32-ix-slave12
w64-ix-slave87                  mw32-ix-slave13
w64-ix-slave88                  mw32-ix-slave14
w64-ix-slave89                  mw32-ix-slave15
w64-ix-slave90                  mw32-ix-slave16
w64-ix-slave91                  mw32-ix-slave17
w64-ix-slave92                  mw32-ix-slave18
w64-ix-slave93                  mw32-ix-slave19
w64-ix-slave94                  mw32-ix-slave20
w64-ix-slave95                  mw32-ix-slave21
w64-ix-slave96                  mw32-ix-slave22
w64-ix-slave97                  mw32-ix-slave23
w64-ix-slave98                  mw32-ix-slave24
w64-ix-slave99                  mw32-ix-slave25
w64-ix-slave100                 mw32-ix-slave26

w64-ix-slave85-mgmt             mw32-ix-slave11-mgmt
w64-ix-slave86-mgmt             mw32-ix-slave12-mgmt
w64-ix-slave87-mgmt             mw32-ix-slave13-mgmt
w64-ix-slave88-mgmt             mw32-ix-slave14-mgmt
w64-ix-slave89-mgmt             mw32-ix-slave15-mgmt
w64-ix-slave90-mgmt             mw32-ix-slave16-mgmt
w64-ix-slave91-mgmt             mw32-ix-slave17-mgmt
w64-ix-slave92-mgmt             mw32-ix-slave18-mgmt
w64-ix-slave93-mgmt             mw32-ix-slave19-mgmt
w64-ix-slave94-mgmt             mw32-ix-slave20-mgmt
w64-ix-slave95-mgmt             mw32-ix-slave21-mgmt
w64-ix-slave96-mgmt             mw32-ix-slave22-mgmt
w64-ix-slave97-mgmt             mw32-ix-slave23-mgmt
w64-ix-slave98-mgmt             mw32-ix-slave24-mgmt
w64-ix-slave99-mgmt             mw32-ix-slave25-mgmt
w64-ix-slave100-mgmt            mw32-ix-slave26-mgmt
Requesting input on which of the machines in comment #18 to leave "as is". We need to retain 2 from the group of 11-15, 20, 26. Since this work requires unracking the machines, no sense making it harder or riskier than need be.

Optional - if we do work in batches, the "other 2" can be physically upgraded, they just would not be reimaged. Our business need is to keep 10 w32 core builders online.

Please propose a list of 14 boxes to move, and we can coordinate on taking those out of service, so this work can move forward. Thanks.
Depends on: 780022
Can we get:
 a) an update on when the selection will be made
 b) an ET on time to do the hardware upgrade and reimage

Thanks!
Blocks: 784891
No longer blocks: PGOSilverBullet
No longer depends on: 780022
Hal, I've re-asked your question in the appropriate (hardware) bug 774829.
Hal in response to your items

a) a list has been give to DCOps to verify
b) I will get a time frame from DCOps
774829 has the list of boxes, DCOps is ready to get started on this.  I'll coordinate with Hal
Assignee: mlarrain → arich
All of the slated mw32 machines have been imaged as w64.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Woot! And now they're in production! Thanks!
Can someone review this sql statement to delete these slaves from slavealloc?
delete * from slaves where notes like '%bug 774829%';

mysql> select name from slaves where notes like '%bug 774829%';
+-----------------+
| name            |
+-----------------+
| mw32-ix-slave13 |
| mw32-ix-slave14 |
| mw32-ix-slave15 |
| mw32-ix-slave16 |
| mw32-ix-slave17 |
| mw32-ix-slave18 |
| mw32-ix-slave19 |
| mw32-ix-slave20 |
| mw32-ix-slave21 |
| mw32-ix-slave22 |
| mw32-ix-slave23 |
| mw32-ix-slave24 |
| mw32-ix-slave25 |
| mw32-ix-slave26 |
+-----------------+
14 rows in set (0.01 sec)
Armen: those are the ones that got retasked, yes.
(In reply to Amy Rich [:arich] [:arr] from comment #27)
> Armen: those are the ones that got retasked, yes.

I've screwed up mysql statements before and I was hoping for someone to double check my delete statement :)
(In reply to Armen Zambrano G. [:armenzg] from comment #28)
> (In reply to Amy Rich [:arich] [:arr] from comment #27)
> > Armen: those are the ones that got retasked, yes.
> 
> I've screwed up mysql statements before and I was hoping for someone to
> double check my delete statement :)

nvm. it seems that we're not going to remove decommissioned slaves from slavealloc and change the UI from showing them.
That sounds kinda silly.  What's the bug for that?
(In reply to Dustin J. Mitchell [:dustin] from comment #30)
> That sounds kinda silly.  What's the bug for that?

I don't know. Perhaps it is not filed.

coop, do we have a bug for hiding decommissioned slaves by default?
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.