Closed Bug 645024 Opened 13 years ago Closed 13 years ago

clone w64-ix-slave[10,12,17,19-24].b.m.o off win64-ix-ref.bm.o

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: mlarrain)

References

Details

Attachments

(2 files)

I have setup win64-ix-ref with everything that it is needed for being in production (including nagios).

The machine to image *from* is win64-ix-ref.b.m.o (10.250.49.229) (physically in
650castro).

I would like slaves w64-ix-slave* 01 through 20 to be imaged from it (these slaves are in SCL).
The first seven were reused for buildbot master and that is tracked in bug 637973.
We can clone slaves 08 through 27 if it makes things easier.

The only thing to notice is that I need 28.83 unallocated GB to be added to the E drive. zandr said that this could be done through DeployStudio.

FTR this machine had Win2008 installed in bug 641940 and the instructions followed to set this machine up are https://wiki.mozilla.org/ReferencePlatforms/Win64

NOTE that the IPMI access of the machine is not available and it cannot be reached by the hostname but the IP number (see bug 642893).

FTR this is a Q1 goal which seems that we will manage to hit thanks for the fast response of IT in all my blocking issues. Thank you guys!
Actually, let's to 12-31, since that's the first contiguous block after the machines that have been appropriated or sent out for repair.

I'll make the ref image and then check that the partitions get expanded correctly when deployed. Failing that, we'll figure out how to reclaim that space on the ref machine.
Assignee: server-ops-releng → zandr
Summary: clone w64-ix-slave[01-20].b.m.o off win64-ix-ref.bm.o → clone w64-ix-slave[12-31].b.m.o off win64-ix-ref.bm.o
This WFM! thanks! 
Exciting times :)
I've fixed up the partitioning, that wasn't a big problem.

I'm still trying to figure out how to clone these machines. Open to suggestions, but DeployStudioPC isn't working. I can clone the machine, but when I copy the clone onto a new box, it won't boot.

I can run the windows repair tools, which take hours, suggest that they may have fixed the problem on the first pass, then declare that the bootloader is missing and admit defeat on the second pass.
Not sure if this would help.
Would trying to clone mw64-ix-slave01 just to see if the results are different help?

zandr let me know if you would need me to redo the work and/or any other ways to help you. I don't have suggestions for DeployStudioPC. I will look in the morning in the forums and see if there is any similar issues.
Blocks: 647287
Sure, I can try cloning mw64-ix-slave01. Will advise.
Blocks: 637973
No longer depends on: 637973
I don't know too much the are of deployments of images.

I have found Windows Deployment Services as an alternative (but closed source) and I have posted a question wrt to Windows 2008 support on DeployStudio (on the guide I only read XP and Win7 support).
OS: Mac OS X → Windows Server 2008
Hardware: x86 → x86_64
What are the latest? Any chance this could get done next week?
Blocks: 650335
Current status - 

DeployStudio & Clonezilla fail to clone these images - the cloned machine fails to boot.  Still working through some way to image these. 

The not-so-pleasant alternative is to manually build each one.
For reference, this is the procedure for building from bare metal:

https://wiki.mozilla.org/ReferencePlatforms/Win64
Thanks for the status update.

What is the exact problem? Do we know if anyone is having similar issues with DeployStudio?

We might want to look into fixing OPSI or setting up puppet on Win64 so we make sure the setup is the same across the board.
Is this from the out-of-the-box image? (rather than win64-ix-ref)
No longer blocks: 637973
Backing up multiple IRC conversations with Armen, various email threads, etc.

Deploystudio does NOT successfully image Win2k8r2/64 machines.

I have not seen any evidence of others success or failure.

I have had a consultant on deck and have now hired a windows guy (start date TBD).

The next step is to rebuild the reference machine. I had to reinstall the base OS to get the partitions right at setup time. Trying to rearrange them after the fact leads to all sorts of problems.

I have, multiple times, sent word through arr to the Releng meeting, and talked to Armen directly on IRC with the following next step:

Either Armen or I, whomever has time first, needs to rebuild the reference machine.
No longer blocks: 565402, 661010
Depends on: 661010
Depends on: 661566
For clarity, this is now a metabug, all the action is in bug 661010 (figure out how to clone w64) and 661566 (build something to clone)
Status: NEW → ASSIGNED
To make sure bugs matches understanding, I believe there are 3 tasks here: 

(In reply to comment #13)
> For clarity, this is now a metabug, 
bug#645024 is for the work of cloning *onto* w64-ix-slave{12...31}?

> all the action is in bug 661010 (figure
> out how to clone w64) and 661566 (build something to clone)
bug#661566 is done, and the machine to clone *from* is rebuilt (again). 
bug#661010 is pending Matt arriving Monday.
(In reply to comment #14)
> To make sure bugs matches understanding, I believe there are 3 tasks here: 
> 
> (In reply to comment #13)
> > For clarity, this is now a metabug, 
> bug#645024 is for the work of cloning *onto* w64-ix-slave{12...31}?

I think it is unsafe to assume that the specific machines will be 12-31 given the hardware woes we've been working through. We will clone as many w64 machines as we have working as quickly as we can, once we resolve the other two bugs. Trying to predict which ones those are given the iX machine churn is not a good use of time.

> > all the action is in bug 661010 (figure
> > out how to clone w64) and 661566 (build something to clone)
> bug#661566 is done, and the machine to clone *from* is rebuilt (again). 
> bug#661010 is pending Matt arriving Monday.

Correct.
digipengi got me the following cloned slaves (cloned from slave21) to be put unto staging.
w64-ix-slave20
w64-ix-slave21
w64-ix-slave22
w64-ix-slave23
w64-ix-slave24
I checked the slaves and digipengi is figuring out why we have a D drive rather than an E drive. The script is marked to create an E drive.
Bug 647287 will be used to shepperd these machines through staging into production.
They didn't have 2 CD/DVD drives before hand, I wonder if one of those bumped it out of being the E:\ drive. I am checking now.
Armen I checked and it looks like the CD/DVD drives had bumped around the E:\ drive for some reason it shouldn't happen again. I changed the drives letters to the correct alignment. Let me know if there is anything else that needs tending to.
digipengi I don't see the registry changes for auto-login [1] in slaves #21 & #22 (I haven't checked the others).

This is very important for me to determine why the registry change is not there as cloning machines would require this manual step.

Do you want to use this bug to debug all issues?

I also see the slaves with different names than expected:
* w64-ix-slave22 named as MININT-F3G0Q56 (10.12.48.132).
* w64-ix-slave23 named as MININT-EP0GDB5 (10.12.48.133). (NOTE it has a D drive)
* w64-ix-slave24 named as MININT-RIEOPE9 (10.12.48.134).

[1] https://bug428123.bugzilla.mozilla.org/attachment.cgi?id=367599
Assignee: zandr → mlarrain
The names thing is my bad when I pushed those I didn't name them so that is an easy fix. 

What is the registry key that should be there? I can go into the WIM image and verify it is there.
(In reply to comment #22)
> The names thing is my bad when I pushed those I didn't name them so that is
> an easy fix. 
> 
> What is the registry key that should be there? I can go into the WIM image
> and verify it is there.

I pasted it on the previous comment [1]. The asterisks get replaced by cltbld's current password. 

I can see that other registry changes have been maintained:
> [HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Python\Pythoncore\2.6\InstallPath]
> @="C:\\mozilla-build\\python"

[1] https://bug428123.bugzilla.mozilla.org/attachment.cgi?id=367599
digipengi has adjusted his scripts to deal with the auto-login registry.
While he works in another set of machines I will test these set by doing the manual adjusting of the auto-login and the right hostnames.
I wrote a command line script that adds the registry key;
 
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon]
"DefaultUserName"="cltbld"
"DefaultPassword"="*****"
"AutoAdminLogon"="1"

Working on a script to verify the Drive letters stay consistent regardless of CD/DVD drives being added as well. The way Windows works by default is it assigns drive letters in sequence. Since we don't have a standard D:\ drive I am forcing a script to give the builds partition the E:\ drive on first deploy.
I have adjusted the auto-login registry on slaves 20 & 21 but it doesn't seem to auto-login.

I have logged in manually slave #21 to see how it fairs (it should only take one job and reboot).

I don't know why they are not auto-logging in even after fixing the registry. I am very confused; I will look in the morning with a fresh mind.
Could it be;

Start Menu -> 'netplwiz' -> Uncheck "Users must enter a username and password", click "OK", Enter Password
(In reply to comment #27)
> Could it be;
> 
> Start Menu -> 'netplwiz' -> Uncheck "Users must enter a username and
> password", click "OK", Enter Password

You are right. This means that when the machine we cloned from has been set up differently or when we clone it we miss that check.

I will investigate why this step is now needed when it wasn't on the ref docs.

We are going to have to deploy another runslave.py since the one we tested was pre-adding the slaves to slavealloc.
digipengi I have the new runslave.py to be deployed.

The instructions to follow are:
- RDP as Administrator
- open cmd.exe
- cd C:\
- rm runslave.py
- C:\mozilla-build\wget\wget.exe -Orunslave.py hg.mozilla.org/build/puppet-manifests/raw-file/1bbb72f7dcea/modules/buildslave/files/runslave.py

or whatever gets C:\runslave.py replaced with the newer one ;-)


Where are we now?
#################
- digipengi fixed his deployment scripts
- I have slaves 20 and 22 connected to staging (21 is being silly)
- I will use slaves 23 and 24 to determine the exact cause of auto-login not working off the bat when I import logon.reg

Things to check for next batch of cloned slaves
###############################################
* Windows is activated
** I don't think I have hit this issue but just as a reminder
* we have the correct E drive
* we have the correct hostname
* we should be able to auto-login
** I believe comment 27 is the solution but I have to determine why is this new step needed and logon.reg was not good enough
* once the machine is rebooted the path E:\builds\moz2_slave should exists (even if it doesn't connect to buildbot)
** to connect to buildbot I have to enable the slave on slavealloc's web ui

Can you think of anything else I am missing?
This is what I have done to slave #23.
1) wget -Ologon.reg --no-check-certificate https://bug428123.bugzilla.mozilla.org/attachment.cgi?id=367599
2) and imported logon.reg with the right password [1]
3) rebooted
4) I checked that netplwiz showed the check unchecked
5) After reboot I see that some values have been resetted [2]
6) I import logon.reg once more
7) This time I don't see "AutoLogonCount" and I reboot
8) After reboot I can see that cltbld has logged in

In normal circumstances after step #3 we should be good to go. We have to create a logon.reg that will ensure that value is removed. I have not determined how that value is in the registry.

[1] The only difference in the registry when I compare it with a working slave is that there are two values I had not seen before "AutoLogonCount":0 and DefaultDomainName (which is empty)
[2] After reboot I don't see "DefaultPassword" and "AutoAdminLogon" is back to 0. "AutoLogonCount" is gone and "DefaultUserName" is still cltbld. At this point slave 21 and 23 match except for DefaultDomainName as indicated in [1].
I edited my drive letter script and am now testing it. 
The problem was that the "Sel (select)" command doesn't have a "noerr" parameter, so if there is no existing volume E, then this command will fail and diskpart will not process any further commands in the script.
The diskpart script below works fine.

Diskpart.txt
===========
Sel volume 0 -------------> Volume 0 is always the CD/DVD drive.
Assign letter=F noerr
Sel disk 0
Sel partition 2
Assign letter=E noerr
I found this:
http://www.msfn.org/board/topic/77903-autologoncount/
"If AutoLogonCount = 0, then WinLogon deletes AutoAdminLogon, AutoLogonCount, and DefaultPassword from the registry. During the next reboot, the user must log on manually."
and perhaps this too:
http://support.microsoft.com/kb/889715
"You cannot automatically log on by using the AutoLogonCount and AdminPassword options after you use Sysprep.exe on Windows Server 2003"

I am not sure I understand this correctly but I believe AutoLogonCount gets added when sysprep is run (not sure if that is used inside of MDT). This means that when AutoLogonCount reaches 0 it removes the DefaultPassword and sets AutoAdminLogon to 0.

This would explain why we would not be able to login.
I have prepared a patch to deal with this.
Attachment #544857 - Flags: review?(mlarrain)
Where are we now?
#################
- the auto-login and drives issues have been resolved
- slaves 20 & 22-24 are running on staging and are having green builds

What is coming next?
####################
* I can't figure out why slave21's buildbot is not authorized to connect to staging
* digipengi is/will create more slaves with his modified scripts
** I will put such slaves on staging whenever I get them
*** this would require: 1) enable the slave on slave alloc, 2) put staging keys and 3) reboot
Armen for the reg file should I deploy it as;

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon]
"DefaultUserName"="cltbld"
"DefaultPassword"="*****"
"AutoAdminLogon"="1"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon
"AutoLogonCount"=-


or

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon]
"DefaultUserName"="cltbld"
"DefaultPassword"="[Actual password for cltbld]"
"AutoAdminLogon"="1"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon
"AutoLogonCount"=-
Hey how about a vbs script instead? 

http://pastebin.mozilla.org/1267734
And the VBS script saves the day. It is now assigning the drive letters like it should. Now to add Armen's logon.reg file
(In reply to comment #37)
> And the VBS script saves the day. It is now assigning the drive letters like
> it should. Now to add Armen's logon.reg file

Sweet :) I have no experience with VBS but I am all in for saving the day ;)

For logon.reg use it with the actual cltbld password as long as it ends up on a non-public repo.
I have put 4 of these slaves with mw64-ix-slave01 to do production builds as they did well on staging over the weekend.
updating the logon.reg script and saved pushing it to 5 more machines then getting those over to Armen to test/deploy.
I have 4 more machines I am deploying right now *the last of the ones I was given for testing* and then I will ask netops to move them over.
Created Bug 670741 to deal with moving the machines to the right vlan for armen to add them to staging/prod.
Blocks: 670761
digipengi and I are going to use this bug just to clone this first set of 9 machines that he got to get things going.

The remaining win64 slaves will come from bug 670761.

### Status update
* once slaves from bug 670741 are moved to the right vlan I will put them on production
* I hope to declare the first 5 slaves ready tomorrow
* digipengi cloned w64-ix-slave[10,12,17,19]
* digipengi requested NetOps to move those last 4 slaves into vlan48
Summary: clone w64-ix-slave[12-31].b.m.o off win64-ix-ref.bm.o → clone w64-ix-slave[10,12,17,19-24].b.m.o off win64-ix-ref.bm.o
slaves10, 12, 17, 19 doesn't seem to have got the logon.reg applied.
Do you want to have a look at them and see what happened?
### Status update
* digipengi to check slaves 10,12,17 & 19 to see why new logon.reg did not get applied
* slaves 20,22,23 & 24 are on production taking jobs
* slave21 still needs debugging by moi
* clone remaining win64 slaves in bug 670761 (this depends on figuring out issue #1)
as far as the logs look it should have ran the script I am removing the /s variable on the script for it so when we push another image it will prompt you of its success or failure
Attached file setup the E drive
Depends on: 671369
(In reply to comment #45)
> ### Status update
> * digipengi to check slaves 10,12,17 & 19 to see why new logon.reg did not
> get applied
> * slaves 20,22,23 & 24 are on production taking jobs
> * slave21 still needs debugging by moi
> * clone remaining win64 slaves in bug 670761 (this depends on figuring out
> issue #1)

The only difference on this update is that issue#1 is being debugged in newly filed bug 671369.
(In reply to comment #48)
> > * slave21 still needs debugging by moi
The weirdest thing ever.

slave-alloc delivers the tac but locally it has the name of the slave in upper case and therefore it gets an unauthorized message.

I will take care of putting it through staging and see how it fairs.

I will also have to determine why an upper case name was used locally while the tac had the value lowercase.
(In reply to comment #49)
> (In reply to comment #48)
> > > * slave21 still needs debugging by moi
I figured this out.

The machine was named "W64-IX-SLAVE21" rather than "w64-ix-slave21".

This means that slavealloc was being called with the upper case url rather than the lower case one.
http://slavealloc.build.mozilla.org/gettac/W64-IX-SLAVE21
http://slavealloc.build.mozilla.org/gettac/w64-ix-slave21

I will file a bug to fix this by either making slavealloc return always the lower-case version of the slave or changing runslave.py to lower case it before calling slavealloc. I believe the first one makes more sense.

I will put the slave on staging and see how it fairs.
I have caught another step I forgot in the ref machine.
We have to enable PING.

This can be accomplished by running "netsh firewall set icmpsetting 8".

digipengi said that he will add it to his list.

For the 4 slaves that are in production I ssh'ed into them and run this command:
> runas /user:Administrator "netsh firewall set icmpsetting 8"
Depends on: 671647
I added the command to the images so it will run inline with a new install
hacking in the ability to get logon.reg to run on it's own now.
digipengi did you include "netsh firewall set icmpsetting 8" as part of the list of automated steps?

##### status update
* I have moved slaves 12,17 & 19 to staging
** manually adjusted auto-login
** move to production once they are good
* digipengi has recreated slave #10 with logon.reg script changes
* armenzg will put slave #10 on staging once NetOps move the slave to vlan48
* filed bug 671647 to add first 4 slaves into nagios
* once slave#10 is tested good we will be ready to clone the rest of the pool
##### status update
* I have moved slaves 12,17,19 & 21 to *production*
** we now have 8 production slaves in total
* digipengi has resolved the logon.reg issue
* NetOps has moved slave 10 to the correct vlan
* armenzg will put slave 10 through staging
* bug 671647 - add slaves into nagios

Once slave 10 is on production and the nagios bug closed we will close this bug and move to bug 670761 which we will most likely have to wait on bug 668521.
digipengi rebooting slave10 was good enough to get it to connect to staging.
I am moving slave10 to production and report back.

##### status update
* we have initial 9 machines batch all in production
* only bug 671647 left to add nagios checks

Once these two are done we will move to bug 670761.
##### status update
* vlan40 is still being worked out on bug 668521
* bug 671647 left to add nagios checks

Once these two issues are done we will move to bug 670761.
Attachment #544857 - Flags: review?(mlarrain)
Let's close this bug as there is no more to be done besides waiting for bug 671647 to be fixed.

We will soon start see action happening in bug 670761.
Please subscribe to it to see updates.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 671647
No longer depends on: 671647
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: