Closed
Bug 969261
Opened 10 years ago
Closed 10 years ago
Create Linux VM for hosting Mozmill-CI test reports based on CouchDB
Categories
(Infrastructure & Operations :: Virtualization, task)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: whimboo, Assigned: ericz)
References
()
Details
(Whiteboard: [qa-automation-wanted])
Attachments
(1 file)
95.46 KB,
image/jpeg
|
Details |
Right now we are using iriscouch.com for hosting our mozmill test results done via mozmill-ci. Sadly it turned out already a couple of times that this service is not that stable enough, and went down during critical times. Same as what happened yesterday during tests of updates for beta and releasetest. See bug 969052. Depending that we can never rely on an external service I request that we bring back the couch databases in house. Therefore we need a simple Linux server with about 100GB of data storage. It can be a minimal Linux installation (not Ubuntu). Not sure what IT recommends in such cases. The only applications which need to be installed are couchdb (http://couchdb.apache.org/) and probably nginx so that we could offer the service via port 80. We will have multiple dashboards hosted on this server, so we would also need multiple CNAME settings to access them via virtual hosts. This settings request might have to go to a separate bug later, but I wanted to mention it for now. Also the host has to be accessible from the internet, so a Firewall exclusion has to be added.
Reporter | ||
Comment 1•10 years ago
|
||
Two months without any response here :(. Who can actually create such a virtual machine?
Severity: normal → major
Flags: needinfo?(dparsons)
Comment 2•10 years ago
|
||
I'm sorry, I have no idea how this slipped through our fingers I can get you a RHEL6 VM with whatever minimal specs you think are appropriate. (2 cpu, 2G RAM?) 100G is not a problem. Do you want that as part of the / or a separate partition to be named? Also, do you have a VLAN/hostname in mind? CJK
Assignee: server-ops-virtualization → cknowles
Flags: needinfo?(dparsons)
Reporter | ||
Comment 3•10 years ago
|
||
(In reply to Chris Knowles [:cknowles] from comment #2) > I can get you a RHEL6 VM with whatever minimal specs you think are Yes, please use a system which is fully supported by IT/RelOps. I assume it will end-up being a puppetized host? I asked because we are currently working on that for our CI infrastructure in scl3. > appropriate. (2 cpu, 2G RAM?) 100G is not a problem. Do you want that as Better to use 4GB of RAM, and we can keep a single core. Otherwise 100GB sounds fine. > part of the / or a separate partition to be named? I think best would be to have it as a separate partition as best with LVM? > Also, do you have a VLAN/hostname in mind? Oh right. Sorry, that I forgot about that. I think we should use db1.qa.scl3.mozilla.com.
Comment 4•10 years ago
|
||
So, sounds good, the main specs. the VM templates are not currently LVM, instead of "partition" I should have asked: you asked for 100G data storage - would you like that as part of the /, or in a separate mount point? CJK
Reporter | ||
Comment 5•10 years ago
|
||
Lets get this added as /data, so we keep the data separated.
Comment 6•10 years ago
|
||
Alright, that VM is created. db1.qa.scl3.mozilla.com RHEL6_x86-64 1 core CPU 4G RAM. Default disk for / and 100G in /data initial puppetization and nagios have been added, and it should be ready for any loving customization - let me know if you need further help.
Comment 7•10 years ago
|
||
So, as far as enterprise linux distros, PuppetAgain supports CentOS-6.{2,5}, wihile infra puppet supports RHEL6. The two have different install processes, too. It looks like this is set up for infra puppet now. If that's not appropriate, speak now :)
Reporter | ||
Comment 8•10 years ago
|
||
Hm, I haven't known that we have different puppet systems! So given the info from Dustin, I think we should really change that, so we use PuppetAgain here too. I don't see why a single VM out of our >50 VMs should be based on Infra puppet, while all others get managed via PuppetAgain. CentOS should be perfectly fine, and AFAIR we already ran Couchdb a while ago on CentOS already. Sorry Chris.
Comment 9•10 years ago
|
||
I believe that the PuppetAgain CentOS kickstart process is documented (by gcox). The CentOS kickstart process seems to produce aligned VMs (whereas Ubuntu does not), but this is probably worth verifying after creation.
Comment 10•10 years ago
|
||
Alright, I'll give it a shot at puppetagain... dustin, if any problems occur, I'll be knocking on your door. :) My confusion here was that the IT/Relops note gave me a conflicting signal that led me towards IT, rather than relops. I'll bring down the db1 here in a moment... CJK
Comment 11•10 years ago
|
||
Alright, spun up a puppetagain VM, and it's seemed successful - however, when either Henry or I try to log in, it doesn't accept our keys and asks for a password, and even for root, doesn't take any of the usual suspects (from an IT side) Can you let me know what I've done wrong? Also, I was unable to find centos specific puppetagain docs, there's the ubuntu ones that I wrote with your help, but if there's something special to centos, I'm unable to find it. Any help is appreciated.
Comment 12•10 years ago
|
||
CentOS and Ubuntu processes differ only in which PXE menu selection to make. As for what's wrong, there was a problem with certificates, and now a syntax error in the manifests, both handled in bug 1008872.
Reporter | ||
Comment 13•10 years ago
|
||
I fixed the syntax error, which I have introduced yesterday. :/ So we should try again.
Comment 14•10 years ago
|
||
Great. It works now, I am able to login to puppetmaster
Comment 15•10 years ago
|
||
Alright, I've tried spinning that VM again, and still I'm prompted for a password when trying to ssh to it as myself or root, and non of the usual suspects that I have work for it. Is there some manifest that the server needs to be added to, or is there still a problem somewhere. Any assistance at this point is appreciated. CJK
Comment 16•10 years ago
|
||
I'll put the PuppetAgain kickstart password in the syadmins GPG file, but you're correct that there's no node definition in place, and the node won't get the usual set of SSH keys until that's corrected. Andrei should be able to take it from here (by adding the node definition).
Comment 17•10 years ago
|
||
I'm pretty sure the password (I've been using the QA deploy password in the PUPPET_PASS=xxx section) is already in passwords. The machine is up, and running, if the node were added to the node definition, would it just "pick things up"? or should I respin once the node definition is added. Andrei - let me know know when the node def is added, I still have to do some tweaking, like the data volume and updated ESX tools, before I can release this VM. CJK
Comment 18•10 years ago
|
||
(In reply to Chris Knowles [:cknowles] from comment #17) > I'm pretty sure the password (I've been using the QA deploy password in the > PUPPET_PASS=xxx section) is already in passwords. The default root password is different from the deploy password. > The machine is up, and running, if the node were added to the node > definition, would it just "pick things up"? or should I respin once the > node definition is added. It will just pick things up. It runs puppet every 10m. > Andrei - let me know know when the node def is added, I still have to do > some tweaking, like the data volume and updated ESX tools, before I can > release this VM. I'll put the kickstart password in GPG now, and ping you.
Reporter | ||
Comment 19•10 years ago
|
||
(In reply to Andrei Eftimie from comment #14) > Great. It works now, I am able to login to puppetmaster Btw. the comment from Andrei was not meant for this bug but for bug 1008872. :)
Comment 20•10 years ago
|
||
Alright, given the update from Dustin - I've logged into the VM with the root password, and setup the vmware tools as well as the extra volume at /data - at this point, I'm not sure if you need anything further from me. Let me know.
Reporter | ||
Comment 21•10 years ago
|
||
Dustin, would we be blocked on anything related to puppet here? Do we have support for proxies in Mint? I don't think so. So the fix on bug 997721 would also be necessary here?
Comment 22•10 years ago
|
||
Mint?? We don't support Mint at all..
Reporter | ||
Comment 23•10 years ago
|
||
Ups, Cent OS is what I meant. Given that has been installed above, right?
Comment 24•10 years ago
|
||
That sounds much better. And yes, proxies work on CentOS already.
Reporter | ||
Comment 25•10 years ago
|
||
Dustin, I tried to log into that machine but it fails. I'm using the same key config as for any other host in qa.scl3.mozilla.com. So maybe the last puppet run was not successful?
Comment 26•10 years ago
|
||
Well, that's odd -- it looks like it had CentOS installed, but not puppetagain. I'll figure out what happened with cknowles.
Comment 27•10 years ago
|
||
Discussed with Dustin - did the pxe boot, puppetagain on ESX, centos install, using the QA deployment puppet_pass - per dustin, that should have worked. per IRC, dustin's going to try it again and update on what I need to change in process, or what else went wrong. CJK
Comment 28•10 years ago
|
||
Well, that was not what I expected. For unknown reasons, it's trying to use mirrorlist.centos.org. dustin@cerf ~ $ host mirrorlist.centos.org mirrorlist.centos.org has address 64.235.47.134 mirrorlist.centos.org has address 72.232.223.58 mirrorlist.centos.org has address 204.15.73.243 mirrorlist.centos.org has IPv6 address 2a02:2498:1:3d:5054:ff:fed3:e91a For reasons I can't explain, it chooses the v6 address even though v6 isn't configured on this network (I think..). Then it fails to connect and doesn't fall back to the other three v4 addresses. It hangs indefinitely in this state. So I suspect that the VM was rebooted out of this state somehow, and came up in the state we discovered this morning.
Comment 29•10 years ago
|
||
Adding further mystery, the kickstart profile says # blow away any existing repositories and add our own # (note that this assumes that the repo server resolves as 'repos') mkdir -p /etc/yum.repos.d rm -f /etc/yum.repos.d/* yet [root@db1 yum.repos.d]# ls -al total 32 drwxr-xr-x. 2 root root 4096 May 20 07:25 . drwxr-xr-x. 76 root root 4096 May 20 07:35 .. -rw-r--r--. 1 root root 1926 Nov 30 16:07 CentOS-Base.repo -rw-r--r--. 1 root root 638 Nov 30 16:07 CentOS-Debuginfo.repo -rw-r--r--. 1 root root 630 Nov 30 16:07 CentOS-Media.repo -rw-r--r--. 1 root root 4528 Nov 30 16:07 CentOS-Vault.repo -rw-r--r--. 1 root root 577 May 20 07:25 init.repo where those CentOS-* are the files containing "mirrorlist.centos.org". That's coming from the "yum upgrade -y" that occurs before the attempt to install puppet, as confirmed by looking in /var/log/yum.log after an aborted kickstart. Access to mirrorlist.centos.org might work fine for the releng networks, since outgoing HTTP is allowed. Or maybe centos.org just added IPv6. Dunno. I've added a line to blow away the repos at that time, and re-started the kickstart.
Comment 30•10 years ago
|
||
OK, that worked up to the point where it runs puppet. There's no node definition in place, so the puppet runs are failing. Once you fix that, it should puppetize right up.
Comment 31•10 years ago
|
||
(I determined that it was at this point by seeing the CentOS startup screen on the console, then SSH'ing as root with the kickstart password and seeing [root@db1 ~]# cat puppetize.log Contacting puppet server puppet 20 May 08:02:18 ntpdate[4995]: no server suitable for synchronization found Certificate request for db1.qa.scl3.mozilla.com securely removing deploypass Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m ) so puppetize is still running. It's trying every 10 minutes, and as soon as it has a successful run, it will set everything up.
Reporter | ||
Comment 32•10 years ago
|
||
Problem here is that the node cannot be found. So I added it for now as toplevel::base, until we have custom manifests done. https://hg.mozilla.org/build/puppet/rev/79775af720be I will wait for the next puppet run, and close this bug as fixed if it worked.
Status: NEW → ASSIGNED
Assignee | ||
Comment 33•10 years ago
|
||
This box is in Infra Nagios, but we can't log in to it because it is a PuppetAgain box. Should we perhaps move it to Releng Nagios? It doesn't do us much good here.
Reporter | ||
Comment 34•10 years ago
|
||
I think that should be a good thing to do. I will wait for Dustin and what he thinks. But given that PuppetAgain is under RelEng control, we might also want to use their Nagios system.
Reporter | ||
Comment 35•10 years ago
|
||
Eric, you mentioned nagios alerts on IRC. Can you please tell us which specific issues you are seeing? Thanks.
Comment 36•10 years ago
|
||
No, releng is not maintaining QA's systems, so the alerts would not be useful there, either. The options are: * Don't use monitoring * Use infra nagios and build a proper playbook for SREs to follow * Set up a QA nagios instance (not worth the trouble yet) For the second option, you'll need to add the SREs to $admin_users so they can login. I can take care of that in another bug. The best option is probably a blend of 1 and 2: just monitor hosts for basic things like being up and having the SSH port open, where the remediation for the SREs is fairly straightforward.
Reporter | ||
Comment 37•10 years ago
|
||
Sounds like a good idea. So yes, lets get it filed separately. I will leave this bug open until we got a response from Eric, regarding the failures. Not sure if that affects my further work in setting up puppetagain via bug 717161
Flags: needinfo?(eziegenhorn)
Assignee | ||
Comment 38•10 years ago
|
||
Ok so this is a one-off not monitored by releng, it should likely go into the Infra Puppet since we're the default bucket. So if :whimboo and/or :dustin can get us root/sudo access, we can re-puppetize it and setup basic monitoring for the host. If we're against infra puppet for this, then if we can keep the MOC's keys up to date and give them sudo we can at least do basic monitoring for you.
Flags: needinfo?(eziegenhorn)
Reporter | ||
Comment 39•10 years ago
|
||
Eric, as mentioned above, all of our upcoming hosts will be driven by puppetagain. Given that we want to have a homogeneous environment, this host should also be covered there. But as Dustin said, if we can use the infra nagios and give you the admin access, it should all work, right?
Assignee | ||
Comment 40•10 years ago
|
||
Ok, then yes I believe we can do basic checks and alert handling if we have ssh and sudo access.
Reporter | ||
Comment 41•10 years ago
|
||
Great. Dustin, can you make those modifications please? Not sure if that should be global or just for us.
Comment 42•10 years ago
|
||
I landed the change on your default branch. You can merge if it looks OK. http://hg.mozilla.org/qa/puppet/rev/888a98c8c451
Reporter | ||
Comment 43•10 years ago
|
||
Thanks Dustin. Alphabetical sorted names would have been good, but it's minor and can be fixed later. Changes landed as: https://hg.mozilla.org/qa/puppet/rev/bbe9c54a2228
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 44•10 years ago
|
||
Sorry, we still have the nagios stuff to do here. So reopening and over to Eric.
Assignee: cknowles → eziegenhorn
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 45•10 years ago
|
||
Verified db1.qa.scl3.mozilla.com is in infra Nagios with basic checks. I still cannot ssh in. Once I verify access, I'll close this bug.
Reporter | ||
Comment 46•10 years ago
|
||
You should be able to login as root. In this case your added SSH key will be used.
Assignee | ||
Comment 47•10 years ago
|
||
:whimboo, I can't login as root either.
Comment 48•10 years ago
|
||
You should be good now. Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10 only lists db1 as toplevel::base, which doesn't include any kind of puppet run (either at boot or every 30 minutes). As a DB host, it should probably be toplevel::server.
Reporter | ||
Comment 49•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #48) > Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10 > only lists db1 as toplevel::base, which doesn't include any kind of puppet > run (either at boot or every 30 minutes). As a DB host, it should probably > be toplevel::server. Makes sense. I will take care of it when I start working on the couchdb module over on bug 717161. Eric, please mark the bug verified if all is working fine for you know. Thanks.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 50•10 years ago
|
||
:whimboo, no I still can't ssh in as root. Also, does this use the standard infra root password for if we need to access the console?
Comment 51•10 years ago
|
||
Sorry, the root thing is my fault -- the change I suggested in comment 48 is the right fix. Not the infra root, but the QA root password is in GPG.
Assignee | ||
Comment 53•10 years ago
|
||
db1.qa.scl3 just rebooted and a few things happened: 1) The Nagios server was prevented from connecting to nrpe due to nrpe configuration, causing a bunch of alerts 3) There are checks that are missing because we're combining a PuppetAgain host with Infra nagios. They must've been there before it seems, because it just started alerting this evening. Maybe PuppetAgain removed them? 4) Some members of the MOC don't seem to have access to this server The way I see it, having a host in PuppetAgain but infra Nagios is just a bad idea. If we want to keep working on it we'll have to get the appropriate nagios checks, nrpe configs and keys on it.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comment 54•10 years ago
|
||
Agreed - let's get a new bug open for that, and work on it when some other stuff has settled. Most likely we'll come up with a hostgroup with only a few checks configured. Bug 926468 should fix the issue of some MOC people not having access -- it's been a while since i've done a manual refresh (which is why you had an old SSH key in there).
Reporter | ||
Comment 55•10 years ago
|
||
Means we can close this bug now? OR do we first have to remove traces of infra nagios on this host? Lets do it so I can start workign on this box next week.
Assignee | ||
Comment 56•10 years ago
|
||
I pared down the nagios checks for db1.qa.scl3 to just a simple ping up/down check. There is no issue with it not being infra-puppetized then, so yes we can close this now.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•