Closed Bug 670761 Opened 13 years ago Closed 13 years ago

Clone remaining win64 slaves

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: mlarrain)

References

Details

Attachments

(4 files)

We are going to use bug 645024 as the bug to get things done with the first 9 slaves.

Once we are done with the first 9 slaves we would like to clone all the other win64 machines whenever is plausible.
Assignee: server-ops-releng → mlarrain
Waiting for the new vlan to go into effect before the rest of the machines are imaged using the new WDS/MDT deploy method.
Depends on: 668521
We are going to have to switch to use VS2010 to VS2008 since win32 is moving forward to VS2010.

We should get this in before cloning more machines. I have filed bug 672799 for myself.

digipengi: I haven't asked this before; what thoughts/plans do you have on how to deploy changes post-cloning? FTR on Win32 we have OPSI (which I failed to setup for win64).
For this bug to start happening:
"digipengi: [...] we need to get WDS1 and DC1 working correctly and then get some new builds pushed out into vlan40 and turned into production machines then we will cycle over what is in production now"
From today at 11:11AM:
"digipengi: armenzg, just trying to clean things up with the new vlan and then I will be able to image new builders"
I need to get 1 or 2 machines into vlan40 for testing to verify the new system is built out properly.
Opened bug id=674932 to get two machines moved over.
Depends on: 674932
So my plan for this week:
 - move all non-production w64 builders over to winbuild
 - start imaging 'em

There are complications, of course (I think the big one will be getting the netops DNS delegating properly to the windows DNS, which is gated on setting up a second DC, which is something Matt will have to do - that will block putting these into production, but not imaging).

Based on this page:
 http://www.clintmcguire.com/2011/01/26/add-dhcp-reservations-with-a-script/
and the netops DHCP configs, I added a reservation for every IPMI and host MAC in the fleet -- with the exception of win64-ix-ref, which we can deal with later.

Next up, I need to assign static winbuild IPs to all of the IPMI interfaces (lots of clicking) and get zandr to validate a few of the switchports (tonight), then reconfig the switches to move those ports to the winbuild VLAN.
I also dialed the leases back to 4 hours, just in case we need to move them around.  It was previously set to 8 days, which may be a bit long.  The setting should match infra's lease duration.
All of the non-production hosts are now moved over to vlan 40.  Most still have static IPs for their IPMI interfaces, but I'll fix that (DHCP will give them the same address anyway).

I built an entirely new scope that contains both ipmi and host macs - it looks like scopes are tied to subnets.  When I brought up an IPMI interface with the two scopes (both active), it ignored its reservation in the ipmi scope, and got a dynamic address in the host scope.  Things seem to work better with one scope.

It looks like win64-ix-ref and w64-ix-slave02 got mixed up, too.  I'll straighten that out.

I'm going to adjust the ops DNS to match the new reality, although I haven't done so yet.
Assignee: mlarrain → dustin
Ops DNS and DHCP/w64/jjjj changes are done.  Note that the ops DNS is currently authoritative for winbuild.scl1 and the 10.12.{40,41} zones.

I edited the DNS/DHCP for the w64-ix-slaveNN that are in production, too, by mistake.  That's back to where it should be now (for slaves 10,12,17,19-24).
I used commands like

 dnscmd /recordadd winbuild.scl1.mozilla.com w64-ix-slave02 /createptr A 10.12.40.22

to update the windows DNS, even though it's not delegated to yet.
The full list of w64-ix-slaveNN yet to come up is:

w64-ix-slave02
w64-ix-slave06
w64-ix-slave07
w64-ix-slave08
w64-ix-slave09
w64-ix-slave11
w64-ix-slave13
w64-ix-slave14
w64-ix-slave15
w64-ix-slave16
w64-ix-slave18
w64-ix-slave25
w64-ix-slave26
w64-ix-slave27
w64-ix-slave28
w64-ix-slave29
w64-ix-slave30
w64-ix-slave31
w64-ix-slave32
w64-ix-slave33
w64-ix-slave34
w64-ix-slave35
w64-ix-slave36
w64-ix-slave37
w64-ix-slave38
w64-ix-slave39
w64-ix-slave40
w64-ix-slave41
w64-ix-slave42

Aki, can you get those into slavealloc, buildbot-configs, and graphs now (probably as staging, in which case you might as well enable them), so that they can get started as they come up?  06 and 09 are already ready to go if you want to reboot to check everything out.

I'll post a list over the weekend of the fully-imaged machines, before handing this back to matt on Monday.
Hm, I don't see any ix boxes in graphs.
addCodesighsSteps might just be posting codesighs based on builder name.

http://hg.mozilla.org/build/buildbotcustom/file/c4217bc408ab/process/factory.py#l1474
Attachment #551179 - Flags: review?(bear) → review+
When trying to add the slaves to slavealloc,

Traceback (most recent call last):
  File "/tools/slavealloc-4b555ecdedcf/lib/python2.6/site-packages/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/tools/slavealloc-4b555ecdedcf/lib/python2.6/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/tools/slavealloc-4b555ecdedcf/lib/python2.6/site-packages/twisted/internet/base.py", line 426, in _continueFiring
    callable(*args, **kwargs)
  File "/tools/slavealloc-4b555ecdedcf/src/tools/lib/python/slavealloc/scripts/main.py", line 44, in do_command
    d = defer.maybeDeferred(func, args)
--- <exception caught here> ---
  File "/tools/slavealloc-4b555ecdedcf/lib/python2.6/site-packages/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/tools/slavealloc-4b555ecdedcf/src/tools/lib/python/slavealloc/scripts/dbimport.py", line 113, in main
    for row in masters ])
exceptions.KeyError: 'nickname'
(In reply to Aki Sasaki [:aki] from comment #16)
> exceptions.KeyError: 'nickname'

This is because the CSV file doesn't have a 'nickname' column.
OK, the following are imaged and ready to roll:

w64-ix-slave06
w64-ix-slave08
w64-ix-slave09
w64-ix-slave11
w64-ix-slave13
w64-ix-slave14
w64-ix-slave15
w64-ix-slave16
w64-ix-slave18
w64-ix-slave25
w64-ix-slave26
w64-ix-slave27
w64-ix-slave29

I have a few more running, but I'm done with this (fun!) project for the day.  Matt, catch me to debrief on where I got to and what still needs to be done.
Releng: when these are all either imaged or have a hardware-failure bug, we'll make a new bring-these-machines-up bug.  In the interim, pre-flighting the configs as Aki has done, and moving these into staging, is a great way to shave off a few days.
Assignee: dustin → mlarrain
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> (In reply to Aki Sasaki [:aki] from comment #16)
> > exceptions.KeyError: 'nickname'
> 
> This is because the CSV file doesn't have a 'nickname' column.

There is no mention of a nickname column here:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Add_a_slave_to_slavealloc

There is no nickname here:
http://slavealloc.build.mozilla.org/ui/#slaves

What is the nickname supposed to be?
(In reply to Aki Sasaki [:aki] from comment #20)
> (In reply to Dustin J. Mitchell [:dustin] from comment #17)
> > (In reply to Aki Sasaki [:aki] from comment #16)
> > > exceptions.KeyError: 'nickname'
> > 
> > This is because the CSV file doesn't have a 'nickname' column.
> 
> There is no mention of a nickname column here:
> https://wiki.mozilla.org/ReleaseEngineering/How_To/Add_a_slave_to_slavealloc
> 
> There is no nickname here:
> http://slavealloc.build.mozilla.org/ui/#slaves
> 
> What is the nickname supposed to be?

n/m, user error.
--slave-data rather than --master-data.  That's done.

What master do we test these on?
Put them in the dev/pp pool, and they'll select one of the preprod masters.
Comment on attachment 551493 [details] [diff] [review]
[checked-in]  add win64-ix boxen to preprod

Looks good to me.
Attachment #551493 - Flags: review+
Comment on attachment 551493 [details] [diff] [review]
[checked-in]  add win64-ix boxen to preprod

http://hg.mozilla.org/build/buildbot-configs/rev/0da5787c01ae
Attachment #551493 - Flags: review?(catlee)
(In reply to Dustin J. Mitchell [:dustin] from comment #18)
> OK, the following are imaged and ready to roll:
> 
> w64-ix-slave06
> w64-ix-slave08
> w64-ix-slave09
> w64-ix-slave11
> w64-ix-slave13
> w64-ix-slave14
> w64-ix-slave15
> w64-ix-slave16
> w64-ix-slave18
> w64-ix-slave25
> w64-ix-slave26
> w64-ix-slave27
> w64-ix-slave29
> 
These slaves are connected now to staging-master:8040 and taking jobs without human interaction besides adding them to slave alloc and rebooting them.

The builds should fail to upload but that is alright. I will move them tomorrow to production if they are doing well.

Let me know when you have the remaining ones and I will put them through staging.
according to my list math, these are still pending:

w64-ix-slave02
w64-ix-slave07
w64-ix-slave28
w64-ix-slave30
w64-ix-slave31
w64-ix-slave32
w64-ix-slave33
w64-ix-slave34
w64-ix-slave35
w64-ix-slave36
w64-ix-slave37
w64-ix-slave38
w64-ix-slave39
w64-ix-slave40
w64-ix-slave41
w64-ix-slave42
The most up-to-date list is:

Imaged & ready:
w64-ix-slave06
w64-ix-slave08
w64-ix-slave09
w64-ix-slave11
w64-ix-slave13
w64-ix-slave14
w64-ix-slave15
w64-ix-slave16
w64-ix-slave18
w64-ix-slave25
w64-ix-slave26
w64-ix-slave27
w64-ix-slave29
w64-ix-slave30
w64-ix-slave31
w64-ix-slave32
w64-ix-slave33

Pending:
w64-ix-slave34
w64-ix-slave35
w64-ix-slave36
w64-ix-slave37
w64-ix-slave38
w64-ix-slave39
w64-ix-slave40
w64-ix-slave41
w64-ix-slave42

Need matt's TLC:
w64-ix-slave02
w64-ix-slave07
w64-ix-slave28

In production, need to move to the winbuild VLAN:
w64-ix-slave10
w64-ix-slave12
w64-ix-slave17
w64-ix-slave19
w64-ix-slave20
w64-ix-slave21
w64-ix-slave22
w64-ix-slave23
w64-ix-slave24

I'm working on the pending list, but it's veeeeeery slow going.
Blocks: 671647
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #26)
> (In reply to Dustin J. Mitchell [:dustin] from comment #18)
> > OK, the following are imaged and ready to roll:
> > 
> > w64-ix-slave06
> > w64-ix-slave08
> > w64-ix-slave09
> > w64-ix-slave11
> > w64-ix-slave13
> > w64-ix-slave14
> > w64-ix-slave15
> > w64-ix-slave16
> > w64-ix-slave18
> > w64-ix-slave25
> > w64-ix-slave26
> > w64-ix-slave27
> > w64-ix-slave29
> > 
> These slaves are connected now to staging-master:8040 and taking jobs
> without human interaction besides adding them to slave alloc and rebooting
> them.
> 
> The builds should fail to upload but that is alright. I will move them
> tomorrow to production if they are doing well.
> 
> Let me know when you have the remaining ones and I will put them through
> staging.

80-90% of the jobs taken by these slaves failed due to loss connections.
I can't put these slaves in production until we figure out why.
I think that got solved in bug 677992.  We'll know for sure in a few hours.
After a lot of pain inflicted by myself I managed to get all slaves connected to dev-master01 (I had been trying to switch them from staging-master to the new masters).

I noticed that slave w64-ix-slave29 was named w64-ix-slave19 by mistake. I fixed it and rebooted.

After they finish their compilation step they should be failing on the upload step. They should not time out after dustin resolving bug 677992.

I will report back in few hours.
All the slaves failed at make buildsymbols (except one that failed at compilation).
They all failed with a lost connection around 59minutes into the step.
I am very confused as the fix of dustin should have done the trick with regards connection dropping.
I wonder if it is the master that is not behaving.

I rebooted all the slaves (as some had not been able to connect back) and triggered more jobs.

I have also noticed weird times in twistd.log on some of them but not sure how it is related:
2011-08-11 21:07:22-0700 
^ several hours into the future
I set up a manual connection with 'telnet' and 'nc', and looking at it I see:

dustin@fw1a.scl1> show security flow session source-prefix 10.12.40.47/32 destination-prefix 10.12.49.129 
node0:
--------------------------------------------------------------------------

Session ID: 23851, Policy name: trust-to-trust/8, State: Active, Timeout: 1700, Valid
  In: 10.12.40.47/50469 --> 10.12.49.129/9999;tcp, If: reth1.40, Pkts: 3, Bytes: 144
  Out: 10.12.49.129/9999 --> 10.12.40.47/50469;tcp, If: reth1.48, Pkts: 2, Bytes: 92
Total sessions: 1

so sure enough, the policy is not being applied.  I think I see the problem there (trust-to-trust) - I'll confer with netops to see what the fix is.
Status update
#############
* (relops/netops) once bug 677992 is solved we can move the batch of 13 slaves armenzg has on staging to production masters
* (relops) imaging of 8 slaves from "Pending" section in comment 28
* (relops) TLC for 3 slaves in comment 28
* (armenzg) put slaves w64-ix-slave[30..33] on staging
* (armenzg+dustin) disable the production slaves currently in vlan48, move them to vlan40, and re-enable
I finished imaging the rest of the machines. There are two that are bad and need to go back to iX. w64-ix-slave02 and w64-ix-slave41.
It seems that I am not having the timeouts anymore. Thanks Dustin!
I will be moving the first batch of 13 slaves to production.

I will put on staging the slaves that digipengi cloned.

Out of the slaves that need TLC I know this:
* w64-ix-slave02 needs to be sent to iX - bug#?
* w64-ix-slave07 - status?
* w64-ix-slave28 - status?
* w64-ix-slave41 needs to be sent to iX - bug#?

(In reply to Matthew Larrain[:digipengi] from comment #37)
> I finished imaging the rest of the machines. There are two that are bad and
> need to go back to iX. w64-ix-slave02 and w64-ix-slave41.
Do you have a bug to keep track of them or spreadsheet?

BTW what is the status of slaves 01,03,04 and 05? I don't see them mentioned anywhere.

Status update
#############
* [DONE] (relops/netops) once bug 677992 is solved we can move the batch of 13 slaves armenzg has on staging to production masters
* [DONE-c37] (relops) imaging of 8 slaves from "Pending" section in comment 28
* (relops) TLC for 4 slaves in comment 38
** 2 of them need to go to iX
* (armenzg) put slaves w64-ix-slave[30..33] on staging
** wip
* (armenzg+dustin) disable the production slaves currently in vlan48, move them to vlan40, and re-enable
** once we have the other slaves on production
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #38)

> BTW what is the status of slaves 01,03,04 and 05? I don't see them mentioned
> anywhere.

They were stolen at various times to become things like buildbot-master2 and scl-production-puppet.

As we replace those with more appropriate hardware for services with large failure domains like those, we'll return them to the pool.
zandr thanks for clarifying that. I had lost track of it.

I believe that 4 slaves have not been re-imaged like the other ones as they have the auto-logon issue:
* w64-ix-slave34
* w64-ix-slave36
* w64-ix-slave40
* w64-ix-slave42

digipengi (or whoever can) could you please have a look at them and see what makes them different?
I will take a look at it right now.
Spending part of the day to review the logs from C:\MININT\SMSOSD\OSDLOGS to see why the autologin didn't work.
dustin I've got all of the jobs that I triggered today to fail at the compilation step (rather than "make buildsymbols"). I have triggered few more to check tomorrow.

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

Perhaps unrelated but I have also noticed that the clobberer step fails (not when I use mw64-ix-slave01 which is not on the winbuild network):
> E:
> cd builds\moz2_slave\m-cen-w64\tools\clobberer
> python clobberer.py -s tools -t 168 http://build.mozilla.org/stage-clobberer/index.php mozilla-central "WINNT 6.1 x86-64 mozilla-central build" m-cen-w64 w64-ix-slave27 http://dev-master01.build.scl1.mozilla.com:8040/
...
Checking clobber URL: http://build.mozilla.org/stage-clobberer/index.php?master=http%3A%2F%2Fdev-master01.build.scl1.mozilla.com%3A8040%2F&slave=w64-ix-slave27&builddir=m-cen-w64&branch=mozilla-central&buildername=WINNT+6.1+x86-64+mozilla-central+build
Error contacting server
The clobberer problem is probably due to ACL's on build.m.o.  That should be fixed now.

I don't know what to tell you about the compile problems.  Was there active log traffic up until the end?
Pushing a single image to w64-ix-slave42 to see if it is network dropage causing the poor imaging. Will report back soon.
As pointed out by Dustin in IRC it could also be an issue with the KVM IO bandwidth. Will need to work with bkero to review IO performance as well as keep an eye on network traffic.
digipengi, slaves 40 & 42 seem to have been reimaged this time. I have put them on staging and have taken jobs. I will report back.

Did you reimage again slaves 34 & 36? They seem to still need the 2nd reimaging.

(In reply to Dustin J. Mitchell [:dustin] from comment #44)
> 
> I don't know what to tell you about the compile problems.  Was there active
> log traffic up until the end?
I believe there was low log activity but the problem might now be gone. I will let you know today.

Status update
#############
* [DONE] (armenzg) put slaves w64-ix-slave[30..42] on staging
* [DONE] (dustin) fixed ACL issue with stage-clobberer
* (armenzg) move slaves from staging to production
** landed patches; waiting for slaves to do green runs to move them
* (relops) TLC for 4 slaves in comment 38
** 2 of them need to go to iX - TODO we need a bug to track it
** 2 of them need status update - TODO get status update
* (relops) 4 of the slaves did not get reimaged properly (probably bandwidth issues on KVM IO)
** slaves 40 & 42 have been reimaged and put successfully on staging
** slaves 34 & 36 need to be reimaged
* (armenzg+dustin) disable the production slaves currently in vlan48, move them to vlan40, and re-enable
** not to be done yet. once we have the other slaves in production
Yeah I tried to reimage them last night but IPMI was acting up. I am doing it right now if IPMI plays nicely.
I have put a bunch of the slaves into production (just left a few on staging that did not convince me yet).
w64-ix-slave{13,15,18,25,26,27,29,30,31,32,33,35,37,38,39,40,42}
I will report back tomorrow to see how they do.
NOTE: no try & fixing symbols
Slaves 34 and 36 are now good to go, that is the last of the w64 machines I had to get ready to go into staging/production.
(In reply to Dustin J. Mitchell [:dustin] from comment #28)
> In production, need to move to the winbuild VLAN:
> w64-ix-slave10
> w64-ix-slave12
> w64-ix-slave17
> w64-ix-slave19
> w64-ix-slave20
> w64-ix-slave21
> w64-ix-slave22
> w64-ix-slave23
> w64-ix-slave24
> 

I have disabled these slaves on slavealloc, taken a note, gracefully shutdown these slaves and rebooted them once.
dustin feel free to move these slaves at any given time.
Attachment #553563 - Flags: review?(nrthomas)
moving them now.
Status update
#############
* (armenzg) move slaves from staging to production
** w64-ix-slaves[07,11,20,28,33] are left to be moved to production (I will get to it)
* (relops) w64-ix-slaves[02,41]  need to go to iX - bug 673972
* (relops) some slaves did not get reimaged properly (probably bandwidth issues on KVM IO)
** [DONE] slaves 34, 36, 40 & 42 slaves have been reimaged and are now on production
** w64-ix-slaves[07,28] need to be reimaged due to failed reimage (KVM IO suspect)
* (relops) move production slaves currently in vlan48 to winbuild
** armenzg has disabled them on comment 53
** armenzg to put them back in production once moved
* (armenzg) enable win64 in all branches
Depends on: 673972
per meeting with IT yesterday (and followup with armen this morning):

* all available machines imaged as of yesterday. Thanks digipengi.
* now up to armen to finish setup.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #55)

> * (relops) some slaves did not get reimaged properly (probably bandwidth
> issues on KVM IO)
> ** w64-ix-slaves[07,28] need to be reimaged due to failed reimage (KVM IO
> suspect)

I've moved wds01 off to one of the newly rebuilt kvm nodes, so we should see better performance there.
OK
* the remaining vlan48 hosts are moved to the proper IPs
* inventory (oob IP and switch port) is updated for all w64-ix-slaveNN
* netops' DNS and DHCP are updated with these changes
Now that everything has been moved this bug can be closed.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 553563 [details] [diff] [review]
enable win64 for all branches plus fix symbols (do not enable for try)

I bet philor would say we should just do bug 679809 and enable everywhere in one go, so this is r+ if you promise to resolve bug 679809 quickly.
Attachment #553563 - Flags: review?(nrthomas) → review+
(In reply to Nick Thomas [:nthomas] from comment #60)
> Comment on attachment 553563 [details] [diff] [review]
> enable win64 for all branches plus fix symbols (do not enable for try)
> 
> I bet philor would say we should just do bug 679809 and enable everywhere in
> one go, so this is r+ if you promise to resolve bug 679809 quickly.

I do. It is my highest priority but I want the win64 build machines to be exercising while I look into it (I want to see what is required to enable it on Try beyond _just_ enabling it). Thanks Nick.
Comment on attachment 553563 [details] [diff] [review]
enable win64 for all branches plus fix symbols (do not enable for try)

Checked-in on default as:
http://hg.mozilla.org/build/buildbot-configs/rev/463457bad080
Attachment #553449 - Attachment description: add first batch of win64 slaves to production → [checked-in] add first batch of win64 slaves to production
Attachment #551493 - Attachment description: add win64-ix boxen to preprod → [checked-in] add win64-ix boxen to preprod
Attachment #551179 - Attachment description: add win64-ix boxen to staging → [checked-in] add win64-ix boxen to staging
Status update
#############
* (armenzg) move slaves from staging to production
** w64-ix-slaves[06,14] are left to be moved to production (I will get to it)
* (relops) w64-ix-slaves[07,28] need to be reimaged
* [DONE] (relops) move production slaves currently in vlan48 to winbuild
** [DONE] armenzg to put them back in production once moved
* [DONE] (armenzg) enable win64 in all branches

I am moving any status updates to bug 558448.

From my side my main focus right now will be to add Try support in bug 679809.
Fixed slave07 and 28. Also was able to switch from multicast to unicast to increase speed of deployments as well as possibly fixing the issue with pushing multiple images.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: