Closed Bug 1253088 Opened 9 years ago Closed 9 years ago

b-2008 loan process doesn't complete

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: nthomas, Assigned: grenade)

References

Details

(Whiteboard: [windows][aws])

Attachments

(2 files)

I've tried this a couple of times and it seems to hang: On aws-manager2 per https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Windows_2008_machines: $ python cloud-tools/scripts/aws_create_instance.py -c cloud-tools/configs/b-2008 -r us-east-1 -s aws-releng --loaned-to nthomas@mozilla.com --bug 1251112 -k secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i cloud-tools/instance_data/us-east-1.instance_data_prod.json b-2008-ec2-nthomas 2016-03-02 00:05:55,587 - INFO - Sanity checking DNS entries... 2016-03-02 00:05:55,589 - INFO - Checking name conflicts for b-2008-ec2-nthomas 2016-03-02 00:06:34,920 - INFO - waiting for workers 2016-03-02 00:06:35,107 - INFO - Using IP 10.134.52.38 2016-03-02 00:06:35,323 - INFO - subnet subnet-2ba98340 2016-03-02 00:06:59,869 - INFO - instance Instance:i-7ab5bce2 created, waiting to come up 2016-03-02 00:07:29,186 - INFO - assimilating Instance:i-7ab5bce2 2016-03-02 00:07:29,404 - INFO - waiting for instance to shut down I can connect to the instance (at b-2008-ec2-nthomas.build.mozilla.org aka 10.134.52.38) with VNC, and it appears to be idle; up since 3/2/2016 1:45:48am Pacific. Mark, could you investigate ? What I really want is a production spec slave which still has runner present and starts up buildbot on boot. I would swap out the ssh keys to the staging set, and maybe some credentials files, and connect to a staging master (via slavealloc) but otherwise production with runner tec. Should I be creating the instance differently ? This is blocking release testing ahead of 46.0b1 kicking off next week, and adds a little risk to the release promotion project because we won't know if windows builds are working OK if we need to fall back to the old system.
Flags: needinfo?(mcornmesser)
:nthomas last week when we tried to loan to you a 2008-ec2 slave we notice that the script wait for the instance to be stopped ("INFO - waiting for instance to shut down") but the user-data at the end of it doesn't have this step, the shutdown command, and the process get stucked. I spoke with grenade, and he said it will add the shutdown command at the end of user-data script in order to work. Another think that we saw is the password for cltbld/root user when the slave is puppetized and is generated a randomly password and is not a static password like on linux slaves or windows slaves from DC. As buildduty when we loan a slave to a loaner we will never know the new password for cltbld / root user. I suggest to be used the password from loanerou.txt.gpg
(In reply to Nick Thomas [:nthomas] from comment #0) > I've tried this a couple of times and it seems to hang: > > On aws-manager2 per > https://wiki.mozilla.org/ReleaseEngineering/How_To/ > Loan_a_Slave#Windows_2008_machines: > > $ python cloud-tools/scripts/aws_create_instance.py -c > cloud-tools/configs/b-2008 -r us-east-1 -s aws-releng --loaned-to > nthomas@mozilla.com --bug 1251112 -k secrets/aws-secrets.json --ssh-key > /home/buildduty/.ssh/aws-ssh-key -i > cloud-tools/instance_data/us-east-1.instance_data_prod.json > b-2008-ec2-nthomas > 2016-03-02 00:05:55,587 - INFO - Sanity checking DNS entries... > 2016-03-02 00:05:55,589 - INFO - Checking name conflicts for > b-2008-ec2-nthomas > 2016-03-02 00:06:34,920 - INFO - waiting for workers > 2016-03-02 00:06:35,107 - INFO - Using IP 10.134.52.38 > 2016-03-02 00:06:35,323 - INFO - subnet subnet-2ba98340 > 2016-03-02 00:06:59,869 - INFO - instance Instance:i-7ab5bce2 created, > waiting to come up > 2016-03-02 00:07:29,186 - INFO - assimilating Instance:i-7ab5bce2 > 2016-03-02 00:07:29,404 - INFO - waiting for instance to shut down > > I can connect to the instance (at b-2008-ec2-nthomas.build.mozilla.org aka > 10.134.52.38) with VNC, and it appears to be idle; up since 3/2/2016 > 1:45:48am Pacific. > > Mark, could you investigate ? What I really want is a production spec slave > which still has runner present and starts up buildbot on boot. I would swap > out the ssh keys to the staging set, and maybe some credentials files, and > connect to a staging master (via slavealloc) but otherwise production with > runner tec. Should I be creating the instance differently ? > > This is blocking release testing ahead of 46.0b1 kicking off next week, and > adds a little risk to the release promotion project because we won't know if > windows builds are working OK if we need to fall back to the old system. In that case a general lone will not work. There is a process to spin up ec2 slave which will work, but I am not sure which cloud tool script does that. In the interests of time, I will spin up a builder this morning and drop the info in this bug.
Flags: needinfo?(mcornmesser)
(In reply to Vlad Ciobancai [:vladC] from comment #1) > :nthomas last week when we tried to loan to you a 2008-ec2 slave we notice > that the script wait for the instance to be stopped ("INFO - waiting for > instance to shut down") but the user-data at the end of it doesn't have this > step, the shutdown command, and the process get stucked. > I spoke with grenade, and he said it will add the shutdown command at the > end of user-data script in order to work. > > Another think that we saw is the password for cltbld/root user when the > slave is puppetized and is generated a randomly password and is not a static > password like on linux slaves or windows slaves from DC. As buildduty when > we loan a slave to a loaner we will never know the new password for cltbld / > root user. I suggest to be used the password from loanerou.txt.gpg Grenade: ^^
Flags: needinfo?(rthijssen)
(In reply to Mark Cornmesser [:markco] from comment #2) > In that case a general lone will not work. There is a process to spin up ec2 > slave which will work, but I am not sure which cloud tool script does that. > In the interests of time, I will spin up a builder this morning and drop the > info in this bug. This are the steps https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Windows_2008_machines
vladC: ty, I will bookmark that for future reference. nthomas: b-2008-ec2-0001 / 10.134.54.123 is consistent with the standard slave spec including runner and buildbot. Currently it has all the standard production passwords and keys. It also already has a entry in SlaveAlloc.
(In reply to Mark Cornmesser [:markco] from comment #5) > vladC: ty, I will bookmark that for future reference. > > nthomas: b-2008-ec2-0001 / 10.134.54.123 is consistent with the standard > slave spec including runner and buildbot. Currently it has all the standard > production passwords and keys. It also already has a entry in SlaveAlloc. From what I understood from grenade if a b-2008-ec2-<number> is created the secret stuff and passwords will remain untouched. If a slave is created b-2008-ec2-<name> the secret stuff will be removed and the passwords will be changed. Also the buildduty team verify before the slave will be loaned to somebody.
markco, thanks for setting up the instance. DNS is incorrect but I can connect directly using the IP of 10.134.54.123. Unfortunately the instance thinks it has computer name b-2008-spot-031, and signs onto buildbot with that. If I change it in Windows to b-2008-ec2-0001, then it is put back to b-2008-spot-031 after the required reboot. So not usable just at the moment. I've left b-2008-spot-031 disabled in slavealloc for now. Vlad, I've kind of mixed up two issues here. Probably I should have continued in bug 1251112 asking for a production-like slave, and this could just be about the missing shutdown after userdata. Too late now.
grenade helped me a bunch on IRC. We ended up modifying the instance user data to change the definition of [string] $hostname = 'b-2008-ec2-0001', then started it back up. This fixes up the hostname. It turns out that ssh keys aren't managed in this situation (grenade says this is a bug), but it suits me just fine. He also spun up b-2008-ec2-0002, which I may use too.
Today we received another loan request bug 1254017 and I created a windows-2008-ec2 instance. Like I mention in comment #1 the script https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Windows_2008_machines is waiting a shutdown but the slave never shut it down. The password is generated randomly not a static password. Kmoir can you please look over this issue and try to talk with the team ?
Flags: needinfo?(kmoir)
I'll take this bug as I know how to fix the scripting issue with waiting for a shutdown. Also, if Vlad or someone can provide me with an email address to mail auto-generated passwords to (and a public key to encrypt them with), I can make sure that those responsible for loans get an email with the new credentials for the loaner.
Assignee: relops → rthijssen
Flags: needinfo?(vlad.ciobancai)
Flags: needinfo?(rthijssen)
Flags: needinfo?(kmoir)
cancelling my needinfo since Rob will fix issues
Flags: needinfo?(vlad.ciobancai)
(In reply to Rob Thijssen (:grenade - GMT) from comment #10) > Also, if Vlad or someone can provide me with an email address to > mail auto-generated passwords to (and a public key to encrypt them with), I > can make sure that those responsible for loans get an email with the new > credentials for the loaner. I'm sorry but why we are trying to complicate the entire process in stead of using at the same password like in DC or for the rest of linux slave.
Because we(In reply to Vlad Ciobancai [:vladC] from comment #12) > I'm sorry but why we are trying to complicate the entire process in stead of > using at the same password like in DC or for the rest of linux slave. Because the shared password mechanism is insecure. Everyone who has ever taken a loaner can log into anyone else's loaner using insecure credentials. We don't want to use shared passwords on any platform, but Windows loaners are one of the first to be corrected.
Flags: needinfo?(vlad.ciobancai)
Attached file vlad.ciobancai.pub
Flags: needinfo?(vlad.ciobancai)
(In reply to Rob Thijssen (:grenade - GMT) from comment #13) > Because we(In reply to Vlad Ciobancai [:vladC] from comment #12) > > > I'm sorry but why we are trying to complicate the entire process in stead of > > using at the same password like in DC or for the rest of linux slave. > > Because the shared password mechanism is insecure. Everyone who has ever > taken a loaner can log into anyone else's loaner using insecure credentials. > We don't want to use shared passwords on any platform, but Windows loaners > are one of the first to be corrected. Understood now, thank you. I attached my public key and also the public key for Alin
Another issue that we occurred is when the instance is created successfully the vnc connection is not working (the reason that we received was the password is not setup). In order to enable the vnc we setup the password here : c:\program files\uvnc bvba\UltraVnc\uvnc_settings.exe after we setup the password and rebooted the instance, the vnc worked.
:grenade I have added some suggestion in the above comments, when you have time can you please look over them ?
Flags: needinfo?(rthijssen)
hi vlad. i actually spent a fair bit of time trying to solve the vnc issue. the problem is that the vnc encryption mechanism is proprietary and the command line tool for encrypting the passwords doesn't actually work. The gui does, but we don't have access to that in automation. i spent some time trying to reverse engineer the tool to work out how to create a working encryption tool that vnc would be able to decrypt passwords from, but didn't reach a solution. since the days of these buildbot slaves are quite limited, and the usefulness of such a tool would be short lived, i didn't want to spend more time on it.
Flags: needinfo?(rthijssen)
Whiteboard: [windows][aws]
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: