Closed Bug 1002624 Opened 10 years ago Closed 10 years ago

Setup Mac Mini for testing DeployStudio implementation for qa.scl3.mozilla.com

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: vinh)

References

Details

Attachments

(1 file)

We need a new Mac Mini set-up, which we can use to test the DeployStudio installation as done by bug 997230. This is only a temporary solution and necessary for verification before we want to apply it to real and existing machines.

It doesn't matter which system is installed, it will be overwritten anyway. The only thing AFAIC is that the machine has to be located in the QA VLAN.

Thanks
colo-trip: --- → scl3
You could steal https://inventory.mozilla.org/systems/show/4993/ and just switch the VLAN and DNS/DHCP.  Just mark it "temporary" as described in the notes.
I've set up a temp mac mini for your test.

https://inventory.mozilla.org/en-US/systems/show/4788/

qa-deploystudio1.qa.scl3.mozilla.com
10.22.73.46
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Vinh, did depoystudio actually work to do this install?  From irc it sounded like there wre problems.
Dustin,
I went ahead and set up a spare mac mini that we have in storage.
Sorry to just be getting back to this.  I tried to do a DS install of this host, and it went away and has not come back - I can't ping it.

From what I can see, it's not on a PDU, so I think I'm helpless.  Can you see what state it's in, and maybe try netbooting it?

Keep in mind it's not a working mac that we need -- it's a working install of DeployStudio.  So if you can give me any info on what might be wrong with the DS server in that VLAN, that'd be a lot more helpful than getting this test mac back online.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
:dustin - I've tried netbooting but getting the following error:

unknown host - The address 'tester1.local' is not valid or the host is unreachable
(In reply to Vinh Hua [:vinh] from comment #6)
> :dustin - I've tried netbooting but getting the following error:
> 
> unknown host - The address 'tester1.local' is not valid or the host is
> unreachable

This is from the netboot image not being configured or setup properly.  I've started building a new image but it will take a couple hours to complete.
The image has been rebuilt and set to default.  see https://bugzilla.mozilla.org/show_bug.cgi?id=997230#c18 :vinh, please go ahead and try this again when you have time.  Ping me if you run into problems.  At the very least, it should login to the DS server, mount the repo and display the workflow list
Dustin, I need an OS X host for testing my proxy patch. What is the status here? Can we get this finalized?
Blocks: 997721
Status as I have it is in comment 8.
Flags: needinfo?(vhua)
Last state I know of:

The netboot image worked but instead of auto-selecting the deploystudio.qa server, it required a manual selection.  It provided a dropdown to select either deploystudio.qa or tester1.local.  My suspicion is that the ds runtime is auto-discovering tester1.local even though I unchecked the bonjour settings during the netboot creation.  I would suggest either disabling any zeroconf services on deploystudio.qa and/or make sure all the names on deploystudio.qa match its fqdn. (there might also be another host with the name tester1.qa)  And by names, I mean 'computername' 'hostname' and 'fqdn' as reported and set by scutil.

Other than that, nothing should be stopping you from testing the workflows.  It will just require manual intervention on the client side for the time being.
Attached image photo-2.JPG
Henrik - In the meantime, which image do you want me to select?
Flags: needinfo?(vhua)
Well, the purpose of that mac mini is to get deploystudio working, so no sense installing an image on it.

deploystudio1:~ root# scutil --get ComputerName
deploystudio1.qa.scl3.mozilla.com
deploystudio1:~ root# scutil --get LocalHostName
deploystudio1qascl3mozillacom
deploystudio1:~ root# scutil --get HostName
deploystudio1.qa

so I suspect that there's some other system on this VLAN that's running bonjour.  Henrik, do you know what 'tester1' might be?
Dustin, not sure I understand. If we want to test that Deployment studio is working, we might have to also test that installing an image on this machine works. Or am I wrong? Maybe we should talk about that tool, given that it's a bit hard to understand for me right now.

Also I would like to have a Mini I can use for testing puppet. As best we should use exactly that mini here. So if something screws up we can easily re-image the Mini.

(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> deploystudio1:~ root# scutil --get ComputerName
> deploystudio1.qa.scl3.mozilla.com
> deploystudio1:~ root# scutil --get LocalHostName
> deploystudio1qascl3mozillacom
> deploystudio1:~ root# scutil --get HostName
> deploystudio1.qa
> 
> so I suspect that there's some other system on this VLAN that's running
> bonjour.  Henrik, do you know what 'tester1' might be?

Where did you get this from? I don't see any reference for 'tester1' in the output above.
I don't want to start using that mini for testing puppet until we're sure deploystudio works.  If deploystudio is requiring user interaction, then it doesn't work yet.  If we install an image and start playing with puppet, then Vinh's just going to reboot it out from under you next time he's in scl3 and tries deploystudio.

As for tester1, yes that was exactly my observation.  I think that means that some *other* host on the VLAN is identifying itself as 'tester1.local' via Bonjour, and causing the deploystudio runtime to prompt.

And for the record, this is pretty much the normal level of annoyance and frustration that deploystudio brings.
Vinh, I talked with Dustin and we agreed on that we can share the box. Given that I'm in Europe I would like to use the box for testing puppetagain. Later in the day you could use if for the final testing of deployserver. Would that work for you? The only thing to obey would be to leave the mini up running.

So I would suggest that we get OS X 10.7.5 installed on that box.

(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> As for tester1, yes that was exactly my observation.  I think that means
> that some *other* host on the VLAN is identifying itself as 'tester1.local'
> via Bonjour, and causing the deploystudio runtime to prompt.

Where did you got this information from? I cannot find it.
:whimboo - In case there's no confusion, you want me to install OS X 10.7.5 on the mini?
Yes, via deploystudio if possible. Thanks.
So it turns out that `dig @224.0.0.251 -p 5353` allows you to query mDNS.  I used that to try to find tester1.local, with no luck.  i also used it to do a reverse lookup of every IP in the VLAN, and while it turned up a bunch of minis, none gave the name 'tester1.local'.  I wonder if tester1 is actually the netboot image?

As a side-note, deploystudio1 is kicking errors like this every 10s, and has a high load average:

May 28 10:21:16 deploystudio1.qa collabd[240]: [CSConnectionPool.m:196 fa7d000 +9998ms] Could not open a connection to Postgres. Please make sure it is running and has the correct access.
May 28 10:21:16 deploystudio1.qa collabd[240]: [CSXCWorkSchedulerService.m:196 fa7d000 +0ms] Failed to open DB connection, retrying in 10s: [CSDatabaseError] Connection to DB failed

I stopped collabd:

deploystudio1:~ root# serveradmin stop collabd
collabd:state = "STOPPED"

On Jake's advice, I disabled Bonjour on the DS server:

deploystudio1:~ root# launchctl unload -w /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist
deploystudio1:~ root# launchctl unload -w /System/Library/LaunchDaemons/com.apple.mDNSResponderHelper.plist
deploystudio1:~ root# ps aux | grep mDNS
root            22029   0.0  0.0  2432784    604 s000  S+   10:27AM   0:00.00 grep mDNS

So, vinh, let's see if DS works now, and if it doesn't, just get 10.7.5 on there for whimboo's work overnight.  Then we can try again in the morning.
DS did not work.  Now I'm at the DS runtime screen with only "http://tester1.local:60080" as the drop down option.  I do not recall what the other path address is to enter manually.  

Should I try to install 10.7.5 via DVD install disc?
(In reply to Vinh Hua [:vinh] from comment #20)
> DS did not work.  Now I'm at the DS runtime screen with only
> "http://tester1.local:60080" as the drop down option.  I do not recall what
> the other path address is to enter manually.  
> 
> Should I try to install 10.7.5 via DVD install disc?

This really makes me believe there is another netinstall(netboot) service running somewhere on qa or dhcp is being fwd'd to one.
(In reply to Vinh Hua [:vinh] from comment #20)
> Should I try to install 10.7.5 via DVD install disc?

If it would be possible, that would help me a lot. But if you think the above problems can be solved before the weekend, you dont have to do it. I will be back on Monday.
So I think we got DS working.  Turning off the Netinstall service proved, the mini was netbooting to the correct server but I don't think it was getting the correct image.  I move the old NB images to root to ensure DSR-2001 was the only image being served.  I also had to re-enable bonjour since the DS service was having a fit not being able to publish to it.  According to :vinh, the mini now boots right into the workflow list.

I don't think there is a 10.7.5 image though.  So it might be best to install 10.7.5 with media and then capture an image and add it to a new workflow.
Once that's captured and in a workflow, I can set up the puppety automation on that workflow.
I'm in the process of installing 10.7.5 now.
10.7.5 was installed.  Currently capturing the image, titled "osx-10.7.5_05_29_14".
Great news all! I just want to add that for our purpose we need net boot images for 10.6, 10.7, 10.8, and 10.9. All those systems are still used for testing. But that can be done later. It would be good to see 10.7.5 working first. Thanks.
So for my reference, DS is now working, and this bug has wandered info "capture 10.7.5".  That failed overnight, but vinh will try again.

We haven't really discussed, why not just use the 10.7.2 that we already have an image for?  If there's a good reason for that (enough for vinh and relops to spend precious time on it), please open a new bug and close this one.  If there's not a good reason, then let's just use 10.7.2 (and still close this bug).
The mini has froze again during the image creation process; twice.
per irc with Jake, looks like the image creation completed.  What's the next course of action?


vinh> Hey Jake I came back to check on the image and the mini froze again.
14:01 <vinh> So I tried attempt #3 and it says filename already exists.
14:02 <vinh> I'm hoping attempt #2 finished successfully before the mini froze?
14:30 <dividehex> i think #2 succeeded
14:30 <dividehex> i see files for them
10.7.2 would be fine for now. Finally we should have the latest versions for each version of OS X, right?
Perhaps, but we won't get to "finally" by trying to do everything at once :)

So, it sounds like we now have a 10.7.5 snapshot, but it's not yet set up for puppet deployments, nor tested.  However, that means that the mini in question can probably have 10.7.2 installed on it using the existing workflow.

In the DS Admin, I created a "10.7.2 PuppetAgain" group and moved the mini into it.  I then set Automation -> Start workflow automatically for that group to "Restore bld-lion-r5-puppetagain".  Of the words in that name, "lion" and "puppetagain" are the important bits.

I don't seem to be able to connect to the test mini with either VNC or SSH, so I can't netboot it, but at this point I *believe* that it will "just work" if Vinh netboots it.  "Just work" should mean that it comes up, puppetizes, and can be logged into as root.

I've verified the deploypass is correct and checked the inventory/DNS/DHCP info for the host.  Henrik, please add a node definition for it ASAP, or it won't successfully run puppet (in which case you'll need to use the kickstart root password if you need to get in).

And then let's close this long, winding bug :)
I've netbooted the mini but the workflow did not start automatically.  After manually selecting "restore bld-lion-r5-puppetagain", RAID creation failed because the mini only has one 500gb hard drive.  I am swamped with the SCL1 move today so will not be able to install the 2nd hard drive.
OK, definitely don't add a new HDD.  What is the configuration of the other minis in this VLAN?  We can make a new workflow for their hardware configuration.
Ok, I have added this host to the qa nodes in qa.scl3.mozilla.com:
https://hg.mozilla.org/qa/puppet/rev/948a0d168128

(In reply to Dustin J. Mitchell [:dustin] from comment #34)
> OK, definitely don't add a new HDD.  What is the configuration of the other
> minis in this VLAN?  We can make a new workflow for their hardware
> configuration.

None of the Mini's are using RAID. All have a single HDD included. And there is actually no need for a RAID system.
We have to make some progress here given that I cannot test anything related to our puppetagain configuration without any node being available. So what can we do in the short term here?
Flags: needinfo?(vhua)
Flags: needinfo?(jwatkins)
I believe jwatkins and dustin can best answer, unless you need another mini racked and configured I can jump on it.
Flags: needinfo?(vhua)
Vinh, if we can get this machine setup so it is reachable and has 10.7.5 installed it would be totally fine. We can surely delay testing for the deploystudio.
:whimboo - The mini is now running 10.9.3
Dustin, what would I have to do to prepare the mini for puppet? I installed puppet and facter. But when I try to run the agent it reports an error because "this master is not a CA". What does it mean, and how can I get it working? I would like to test the latest changes for our QA org in Puppetagain. Thanks.
Well, I may have to add this node to the qa config in my environment. Will test this in a bit.
Actually this host is already part of the qa nodes manifest file. Could this be that we have problems with the certificate given that the machine has been reinstalled?
Sorry, but I accidentally shutdown this box. Now I'm not able to re-connect to it. Vin, can you please bring it back online? Thanks.
Flags: needinfo?(jwatkins)
Flags: needinfo?(vhua)
Mini should be on now.
Flags: needinfo?(vhua)
Thanks Vinh! So I did a run of puppetize.sh on that box, which seemed to be successful. But the puppet command right after failed again:

$ sudo ./puppetize.sh 
Password:
Contacting puppet server puppet
deploypass: 
23 Jun 15:14:45 ntpdate[288]: no server suitable for synchronization found
Certificate request for qa-deploystudio1.qa.scl3.mozilla.com
Certificates are ready; run puppet now.
qa-deploystudio1:~ mozauto$ sudo puppet agent --test
Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
Exiting; failed to retrieve certificate and waitforcert is disabled

Dustin, do you have an idea what's missing here?
The default ssldir is wrong on OS X, so you need to pass --ssldir=/var/lib/puppet/ssl.  That's in puppetize.sh if you want to just copy/paste it.
Damn. Now that you are saying that! I can remember. Could we add this to the following wiki page?

https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Re-issue_Certificates_for_a_host

Or something better? You may have a suggestion here.
Sure, feel free, but note that it's only required on the first install.  Once puppet runs, it updates puppet.conf to contain the correct path, and everything is happy.
Ok, I got all sorted out with the help from Dustin on IRC. I have added all the necessary hiera secrets for root and the builder user. Puppet agent works now and can successfully initiate the system with the current state of QA slaves.

Moving bug over to QA infrastructure for continued testing of puppet and then the deployserver.
Assignee: server-ops-dcops → nobody
Status: REOPENED → ASSIGNED
Component: Server Operations: DCOps → Infrastructure
Product: mozilla.org → Mozilla QA
QA Contact: dmoore
Version: other → unspecified
(In reply to Dustin J. Mitchell [:dustin] from comment #48)
> Sure, feel free, but note that it's only required on the first install. 
> Once puppet runs, it updates puppet.conf to contain the correct path, and
> everything is happy.

Right. So actually I updated the following page under 'puppetize.sh' to mention that option for the first puppet agent call.

https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Puppetization_Process#puppetize.sh
Depends on: 1029614
I created a new workflow named "Restore bld-r5-lion-puppetagain without raid" which should work on a non-RAID host like qa-puppetagain1.  However, I didn't try imaging the host.  When you get a chance, try netbooting it again (bless --netboot --server bsdp://10.22.73.45; reboot) and let's see what happens.
Thanks Dustin. I may not do this the next days, given that I want to finish the usual puppet flow first. That is more important for us, and I don't want to risk loosing this testing Mac. It's too valuable for me at the moment.
Cannot actually test all the bits until proxies are set. Moving dependency to blocking.
No longer blocks: 997721
Depends on: 997721
Depends on: 1008880, 1008879
(In reply to Dustin J. Mitchell [:dustin] from comment #51)
> I created a new workflow named "Restore bld-r5-lion-puppetagain without
> raid" which should work on a non-RAID host like qa-puppetagain1.  However, I
> didn't try imaging the host.  When you get a chance, try netbooting it again
> (bless --netboot --server bsdp://10.22.73.45; reboot) and let's see what
> happens.

Henrik -- up to you.
Ok, I triggered such an image process by the command as given above. Lets see how it works, and if the box comes back.
The box didn't come back until now. Dustin or Vinh, can one of you please have a look at? I would need this box tomorrow for testing Flash. Thanks.
You can remote into the box that's being installed from the DeployStudio console.  Generally the error will be apparent.

However, VNC's not working for me (either with Chicken, which rejects the password, or Screen Sharing, which gives me a black screen), so I can't do this.  I assume you've adjusted VNC settings based on our irc conversation, so I'll leave the remoting to you.
Sorry, but I only changed the screenresolution on that host. Sadly you will have to reboot to allow VNC to work. Now the screen is visible.

Sadly I cannot find any way to control that host, also there are no logs available for it. Maybe Vin should have a look at the hardware itself, and reboot if possible.
Flags: needinfo?(vhua)
The mini is getting "cannot find a valid disk to partition" error during netboot.
Flags: needinfo?(vhua)
I think that's something for Dustin then. Looks like the partition setup is still not correct. Dustin, just out of interest, where is this code all located?
It's not code, it's clicky-pointy stuff.  It's in the workflow defined in the DeployStudio admin console.

Since it's getting an error about partitioning, then the client node is booting to the DS netboot image.  So I'm not sure why it ended up at a gray screen (which was how vinh found it, and I'm guessing how it looks now after he re-tried a netboot).  It didn't boot back to the original OS, either.

So, we need to change the target disk for the partitioning step in the workflow.  Your guess is as good as mine as to what it should be.  IIRC, there's an option that just selects the "first" disk, which would probably work.  Try making that change?  I'm on my Linux laptop today and can't connect to the deploystudio box.  Once that's done, we'll need another manual touch in scl3 to try netbooting it again.
(In reply to Dustin J. Mitchell [:dustin] from comment #61)
> work.  Try making that change?  I'm on my Linux laptop today and can't
> connect to the deploystudio box.  Once that's done, we'll need another
> manual touch in scl3 to try netbooting it again.

With Remmina you can perfectly connect to this box. I would have to deep dive into all that first, which would take a while to get confident with. :/ Maybe you can have a try?
Wait. Is that the Restore a master on a volume step? If yes, there was user selection active for the target volume! So this might have been the reason. I changed it now to first volume as you said.

So Vinh has to reboot the machine again?
Maybe -- I think it was from the partition step that failed, but you seem to have found a smoking gun.

You and I are both at about the same level of deep-diving on deploystudio.. and I don't know what Remmina is, but vinagre didn't work.

So yes, with that change, let's try a new netboot.
(In reply to Dustin J. Mitchell [:dustin] from comment #64)
> You and I are both at about the same level of deep-diving on deploystudio..
> and I don't know what Remmina is, but vinagre didn't work.

FYI: http://sourceforge.net/projects/remmina/

> So yes, with that change, let's try a new netboot.

Vinh, can you please try again? It would be great to get the Mini back for testing. :)
Flags: needinfo?(vhua)
Still seeing the same valid disk partition error.
Flags: needinfo?(vhua)
per IRC, vinh booted this back to the on-disk OS, and it should be usable in that state in the interim.  We'll have to figure out what's wrong with the DS workflow next week.
Yeah would be good. Please be aware that I will be away from 07/19 to 08/03.
Something went bad with this mini again. It is not reachable via screen sharing nor SSH. Vinh, can you please reboot it?
Flags: needinfo?(vhua)
The host was hung so I had to reboot it.  However it is auto netbooting into Deploy Studio.
Flags: needinfo?(vhua)
(In reply to Vinh Hua [:vinh] from comment #70)
> The host was hung so I had to reboot it.  However it is auto netbooting into
> Deploy Studio.

Auto netbooting is an indication that the previous Deploy Studio workflow did not finish.  One of the last steps in finalize.sh step of the workflow is to bless the disk to boot.
This host is still not working Vinh. I cannot connect via ssh. Just to say again I'm talking about 10.22.73.46. I would need this up given that we want upgrade the 10.7 box in mozmill-ci to 10.10 soon.
Flags: needinfo?(vhua)
:dustin - Would you be able to help out with what Jake suggested in comment 71?
Flags: needinfo?(vhua)
I can't get the mouse to work via VNC, so I can't even login, let alone look at the workflows.  We know what the issue is (comment 66), just not how to fix it.  I assume that's why it's still blessed for the netboot.

I suspect that just blessing the disk would fix the problem with repeated netboots.  I'm not sure how best to do that, but if there's a "Startup Items" control panel item in the deploystudio runtime, that would do the trick.

By the way, 10.10 doesn't work with DeployStudio yet -- Jake's working on that in another bug.
Ok, so please wait. Lets recap what this bug was for. By the original request it handled the setup of a testing machine for deploystudio. Meanwhile it diverged into a nearly catch all about deploystudio issues. I would say we close this bug given that the test mini exists, and follow-up with anything else on bug 997230, which is about the deploystudio itself.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Assignee: nobody → vhua
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: