Closed Bug 969261 Opened 10 years ago Closed 10 years ago

Create Linux VM for hosting Mozmill-CI test reports based on CouchDB

Categories

(Infrastructure & Operations :: Virtualization, task)

All
Linux
task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: whimboo, Assigned: ericz)

References

()

Details

(Whiteboard: [qa-automation-wanted])

Attachments

(1 file)

Right now we are using iriscouch.com for hosting our mozmill test results done via mozmill-ci. Sadly it turned out already a couple of times that this service is not that stable enough, and went down during critical times. Same as what happened yesterday during tests of updates for beta and releasetest. See bug 969052.

Depending that we can never rely on an external service I request that we bring back the couch databases in house. Therefore we need a simple Linux server with about 100GB of data storage. It can be a minimal Linux installation (not Ubuntu). Not sure what IT recommends in such cases. The only applications which need to be installed are couchdb (http://couchdb.apache.org/) and probably nginx so that we could offer the service via port 80.

We will have multiple dashboards hosted on this server, so we would also need multiple CNAME settings to access them via virtual hosts. This settings request might have to go to a separate bug later, but I wanted to mention it for now. Also the host has to be accessible from the internet, so a Firewall exclusion has to be added.
Two months without any response here :(. Who can actually create such a virtual machine?
Severity: normal → major
Flags: needinfo?(dparsons)
I'm sorry, I have no idea how this slipped through our fingers

I can get you a RHEL6 VM with whatever minimal specs you think are appropriate. (2 cpu, 2G RAM?)  100G is not a problem.  Do you want that as part of the / or a separate partition to be named?

Also, do you have a VLAN/hostname in mind?

CJK
Assignee: server-ops-virtualization → cknowles
Flags: needinfo?(dparsons)
(In reply to Chris Knowles [:cknowles] from comment #2)
> I can get you a RHEL6 VM with whatever minimal specs you think are

Yes, please use a system which is fully supported by IT/RelOps. I assume it will end-up being a puppetized host? I asked because we are currently working on that for our CI infrastructure in scl3.

> appropriate. (2 cpu, 2G RAM?)  100G is not a problem.  Do you want that as

Better to use 4GB of RAM, and we can keep a single core. Otherwise 100GB sounds fine. 

> part of the / or a separate partition to be named?

I think best would be to have it as a separate partition as best with LVM?

> Also, do you have a VLAN/hostname in mind?

Oh right. Sorry, that I forgot about that. I think we should use db1.qa.scl3.mozilla.com.
So, sounds good, the main specs.

the VM templates are not currently LVM, instead of "partition" I should have asked: you asked for 100G data storage - would you like that as part of the /, or in a separate mount point?

CJK
Lets get this added as /data, so we keep the data separated.
Alright, that VM is created.
db1.qa.scl3.mozilla.com
RHEL6_x86-64
1 core CPU
4G RAM.
Default disk for / and 100G in /data

initial puppetization and nagios have been added, and it should be ready for any loving customization - let me know if you need further help.
So, as far as enterprise linux distros, PuppetAgain supports CentOS-6.{2,5}, wihile infra puppet supports RHEL6.  The two have different install processes, too.  It looks like this is set up for infra puppet now.  If that's not appropriate, speak now :)
Hm, I haven't known that we have different puppet systems! So given the info from Dustin, I think we should really change that, so we use PuppetAgain here too. I don't see why a single VM out of our >50 VMs should be based on Infra puppet, while all others get managed via PuppetAgain. CentOS should be perfectly fine, and AFAIR we already ran Couchdb a while ago on CentOS already. Sorry Chris.
I believe that the PuppetAgain CentOS kickstart process is documented (by gcox).  The CentOS kickstart process seems to produce aligned VMs (whereas Ubuntu does not), but this is probably worth verifying after creation.
Alright, I'll give it a shot at puppetagain... dustin, if any problems occur, I'll be knocking on your door.  :)

My confusion here was that the IT/Relops note gave me a conflicting signal that led me towards IT, rather than relops.  

I'll bring down the db1 here in a moment...

CJK
Alright, spun up a puppetagain VM, and it's seemed successful - however, when either Henry or I try to log in, it doesn't accept our keys and asks for a password, and even for root, doesn't take any of the usual suspects (from an IT side)

Can you let me know what I've done wrong?  

Also, I was unable to find centos specific puppetagain docs, there's the ubuntu ones that I wrote with your help, but if there's something special to centos, I'm unable to find it.  

Any help is appreciated.
CentOS and Ubuntu processes differ only in which PXE menu selection to make.

As for what's wrong, there was a problem with certificates, and now a syntax error in the manifests, both handled in bug 1008872.
I fixed the syntax error, which I have introduced yesterday. :/ So we should try again.
Great. It works now, I am able to login to puppetmaster
Alright, I've tried spinning that VM again, and still I'm prompted for a password when trying to ssh to it as myself or root, and non of the usual suspects that I have work for it.

Is there some manifest that the server needs to be added to, or is there still a problem somewhere.  

Any assistance at this point is appreciated.  

CJK
I'll put the PuppetAgain kickstart password in the syadmins GPG file, but you're correct that there's no node definition in place, and the node won't get the usual set of SSH keys until that's corrected.  Andrei should be able to take it from here (by adding the node definition).
I'm pretty sure the password (I've been using the QA deploy password in the PUPPET_PASS=xxx section) is already in passwords.

The machine is up, and running, if the node were added to the node definition, would it just "pick things up"?  or should I respin once the node definition is added.

Andrei - let me know know when the node def is added, I still have to do some tweaking, like the data volume and updated ESX tools, before I can release this VM.

CJK
(In reply to Chris Knowles [:cknowles] from comment #17)
> I'm pretty sure the password (I've been using the QA deploy password in the
> PUPPET_PASS=xxx section) is already in passwords.

The default root password is different from the deploy password.

> The machine is up, and running, if the node were added to the node
> definition, would it just "pick things up"?  or should I respin once the
> node definition is added.

It will just pick things up.  It runs puppet every 10m.

> Andrei - let me know know when the node def is added, I still have to do
> some tweaking, like the data volume and updated ESX tools, before I can
> release this VM.

I'll put the kickstart password in GPG now, and ping you.
(In reply to Andrei Eftimie from comment #14)
> Great. It works now, I am able to login to puppetmaster

Btw. the comment from Andrei was not meant for this bug but for bug 1008872. :)
Alright, given the update from Dustin - I've logged into the VM with the root password, and setup the vmware tools as well as the extra volume at /data - at this point, I'm not sure if you need anything further from me.  

Let me know.
Dustin, would we be blocked on anything related to puppet here? Do we have support for proxies in Mint? I don't think so. So the fix on bug 997721 would also be necessary here?
Mint?? We don't support Mint at all..
Ups, Cent OS is what I meant. Given that has been installed above, right?
That sounds much better.  And yes, proxies work on CentOS already.
Blocks: 997738
Dustin, I tried to log into that machine but it fails. I'm using the same key config as for any other host in qa.scl3.mozilla.com. So maybe the last puppet run was not successful?
Well, that's odd -- it looks like it had CentOS installed, but not puppetagain.  I'll figure out what happened with cknowles.
Discussed with Dustin - did the pxe boot, puppetagain on ESX, centos install, using the QA deployment puppet_pass - per dustin, that should have worked.  

per IRC, dustin's going to try it again and update on what I need to change in process, or what else went wrong.

CJK
Well, that was not what I expected.

For unknown reasons, it's trying to use mirrorlist.centos.org.

dustin@cerf ~ $ host mirrorlist.centos.org
mirrorlist.centos.org has address 64.235.47.134
mirrorlist.centos.org has address 72.232.223.58
mirrorlist.centos.org has address 204.15.73.243
mirrorlist.centos.org has IPv6 address 2a02:2498:1:3d:5054:ff:fed3:e91a

For reasons I can't explain, it chooses the v6 address even though v6 isn't configured on this network (I think..).  Then it fails to connect and doesn't fall back to the other three v4 addresses.

It hangs indefinitely in this state.  So I suspect that the VM was rebooted out of this state somehow, and came up in the state we discovered this morning.
Adding further mystery, the kickstart profile says

# blow away any existing repositories and add our own
# (note that this assumes that the repo server resolves as 'repos')
mkdir -p /etc/yum.repos.d
rm -f /etc/yum.repos.d/*

yet

[root@db1 yum.repos.d]# ls -al
total 32
drwxr-xr-x.  2 root root 4096 May 20 07:25 .
drwxr-xr-x. 76 root root 4096 May 20 07:35 ..
-rw-r--r--.  1 root root 1926 Nov 30 16:07 CentOS-Base.repo
-rw-r--r--.  1 root root  638 Nov 30 16:07 CentOS-Debuginfo.repo
-rw-r--r--.  1 root root  630 Nov 30 16:07 CentOS-Media.repo
-rw-r--r--.  1 root root 4528 Nov 30 16:07 CentOS-Vault.repo
-rw-r--r--.  1 root root  577 May 20 07:25 init.repo

where those CentOS-* are the files containing "mirrorlist.centos.org".

That's coming from the "yum upgrade -y" that occurs before the attempt to install puppet, as confirmed by looking in /var/log/yum.log after an aborted kickstart.

Access to mirrorlist.centos.org might work fine for the releng networks, since outgoing HTTP is allowed.  Or maybe centos.org just added IPv6.  Dunno.

I've added a line to blow away the repos at that time, and re-started the kickstart.
OK, that worked up to the point where it runs puppet.  There's no node definition in place, so the puppet runs are failing.  Once you fix that, it should puppetize right up.
(I determined that it was at this point by seeing the CentOS startup screen on the console, then SSH'ing as root with the kickstart password and seeing

[root@db1 ~]# cat puppetize.log
Contacting puppet server puppet
20 May 08:02:18 ntpdate[4995]: no server suitable for synchronization found
Certificate request for db1.qa.scl3.mozilla.com
securely removing deploypass
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
)

so puppetize is still running.  It's trying every 10 minutes, and as soon as it has a successful run, it will set everything up.
Problem here is that the node cannot be found. So I added it for now as toplevel::base, until we have custom manifests done.

https://hg.mozilla.org/build/puppet/rev/79775af720be

I will wait for the next puppet run, and close this bug as fixed if it worked.
Status: NEW → ASSIGNED
This box is in Infra Nagios, but we can't log in to it because it is a PuppetAgain box.  Should we perhaps move it to Releng Nagios?  It doesn't do us much good here.
I think that should be a good thing to do. I will wait for Dustin and what he thinks. But given that PuppetAgain is under RelEng control, we might also want to use their Nagios system.
Eric, you mentioned nagios alerts on IRC. Can you please tell us which specific issues you are seeing? Thanks.
No, releng is not maintaining QA's systems, so the alerts would not be useful there, either.

The options are:
 * Don't use monitoring
 * Use infra nagios and build a proper playbook for SREs to follow
 * Set up a QA nagios instance (not worth the trouble yet)

For the second option, you'll need to add the SREs to $admin_users so they can login.  I can take care of that in another bug.

The best option is probably a blend of 1 and 2: just monitor hosts for basic things like being up and having the SSH port open, where the remediation for the SREs is fairly straightforward.
Sounds like a good idea. So yes, lets get it filed separately.  I will leave this bug open until we got a response from Eric, regarding the failures. Not sure if that affects my further work in setting up puppetagain via bug 717161
Flags: needinfo?(eziegenhorn)
Ok so this is a one-off not monitored by releng, it should likely go into the Infra Puppet since we're the default bucket.  So if :whimboo and/or :dustin can get us root/sudo access, we can re-puppetize it and setup basic monitoring for the host.

If we're against infra puppet for this, then if we can keep the MOC's keys up to date and give them sudo we can at least do basic monitoring for you.
Flags: needinfo?(eziegenhorn)
Eric, as mentioned above, all of our upcoming hosts will be driven by puppetagain. Given that we want to have a homogeneous environment, this host should also be covered there. But as Dustin said, if we can use the infra nagios and give you the admin access, it should all work, right?
Ok, then yes I believe we can do basic checks and alert handling if we have ssh and sudo access.
Great. Dustin, can you make those modifications please? Not sure if that should be global or just for us.
I landed the change on your default branch.  You can merge if it looks OK.
  http://hg.mozilla.org/qa/puppet/rev/888a98c8c451
Thanks Dustin. Alphabetical sorted names would have been good, but it's minor and can be fixed later.

Changes landed as:
https://hg.mozilla.org/qa/puppet/rev/bbe9c54a2228
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Sorry, we still have the nagios stuff to do here. So reopening and over to Eric.
Assignee: cknowles → eziegenhorn
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Verified db1.qa.scl3.mozilla.com is in infra Nagios with basic checks.  I still cannot ssh in.  Once I verify access, I'll close this bug.
You should be able to login as root. In this case your added SSH key will be used.
:whimboo, I can't login as root either.
You should be good now.

Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10 only lists db1 as toplevel::base, which doesn't include any kind of puppet run (either at boot or every 30 minutes).  As a DB host, it should probably be toplevel::server.
(In reply to Dustin J. Mitchell [:dustin] from comment #48)
> Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10
> only lists db1 as toplevel::base, which doesn't include any kind of puppet
> run (either at boot or every 30 minutes).  As a DB host, it should probably
> be toplevel::server.

Makes sense. I will take care of it when I start working on the couchdb module over on bug 717161.

Eric, please mark the bug verified if all is working fine for you know. Thanks.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
:whimboo, no I still can't ssh in as root.  Also, does this use the standard infra root password for if we need to access the console?
Sorry, the root thing is my fault -- the change I suggested in comment 48 is the right fix.

Not the infra root, but the QA root password is in GPG.
Works now, thanks!
Status: RESOLVED → VERIFIED
db1.qa.scl3 just rebooted and a few things happened:

1) The Nagios server was prevented from connecting to nrpe due to nrpe configuration, causing a bunch of alerts
3) There are checks that are missing because we're combining a PuppetAgain host with Infra nagios.  They must've been there before it seems, because it just started alerting this evening.  Maybe PuppetAgain removed them?
4) Some members of the MOC don't seem to have access to this server

The way I see it, having a host in PuppetAgain but infra Nagios is just a bad idea.  If we want to keep working on it we'll have to get the appropriate nagios checks, nrpe configs and keys on it.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Agreed - let's get a new bug open for that, and work on it when some other stuff has settled.  Most likely we'll come up with a hostgroup with only a few checks configured.

Bug 926468 should fix the issue of some MOC people not having access -- it's been a while since i've done a manual refresh (which is why you had an old SSH key in there).
Means we can close this bug now? OR do we first have to remove traces of infra nagios on this host? Lets do it so I can start workign on this box next week.
I pared down the nagios checks for db1.qa.scl3 to just a simple ping up/down check.  There is no issue with it not being infra-puppetized then, so yes we can close this now.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Thanks Eric!
Status: RESOLVED → VERIFIED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: