969261 - Create Linux VM for hosting Mozmill-CI test reports based on CouchDB

Reporter

Description

•

10 years ago

Right now we are using iriscouch.com for hosting our mozmill test results done via mozmill-ci. Sadly it turned out already a couple of times that this service is not that stable enough, and went down during critical times. Same as what happened yesterday during tests of updates for beta and releasetest. See bug 969052.

Depending that we can never rely on an external service I request that we bring back the couch databases in house. Therefore we need a simple Linux server with about 100GB of data storage. It can be a minimal Linux installation (not Ubuntu). Not sure what IT recommends in such cases. The only applications which need to be installed are couchdb (http://couchdb.apache.org/) and probably nginx so that we could offer the service via port 80.

We will have multiple dashboards hosted on this server, so we would also need multiple CNAME settings to access them via virtual hosts. This settings request might have to go to a separate bug later, but I wanted to mention it for now. Also the host has to be accessible from the internet, so a Firewall exclusion has to be added.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 1

•

10 years ago

Two months without any response here :(. Who can actually create such a virtual machine?

Severity: normal → major

Flags: needinfo?(dparsons)

Chris Knowles [:cknowles]

Comment 2

•

10 years ago

I'm sorry, I have no idea how this slipped through our fingers

I can get you a RHEL6 VM with whatever minimal specs you think are appropriate. (2 cpu, 2G RAM?)  100G is not a problem.  Do you want that as part of the / or a separate partition to be named?

Also, do you have a VLAN/hostname in mind?

CJK

Assignee: server-ops-virtualization → cknowles

Flags: needinfo?(dparsons)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 3

•

10 years ago

(In reply to Chris Knowles [:cknowles] from comment #2)
> I can get you a RHEL6 VM with whatever minimal specs you think are

Yes, please use a system which is fully supported by IT/RelOps. I assume it will end-up being a puppetized host? I asked because we are currently working on that for our CI infrastructure in scl3.

> appropriate. (2 cpu, 2G RAM?)  100G is not a problem.  Do you want that as

Better to use 4GB of RAM, and we can keep a single core. Otherwise 100GB sounds fine. 

> part of the / or a separate partition to be named?

I think best would be to have it as a separate partition as best with LVM?

> Also, do you have a VLAN/hostname in mind?

Oh right. Sorry, that I forgot about that. I think we should use db1.qa.scl3.mozilla.com.

Chris Knowles [:cknowles]

Comment 4

•

10 years ago

So, sounds good, the main specs.

the VM templates are not currently LVM, instead of "partition" I should have asked: you asked for 100G data storage - would you like that as part of the /, or in a separate mount point?

CJK

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 5

•

10 years ago

Lets get this added as /data, so we keep the data separated.

Chris Knowles [:cknowles]

Comment 6

•

10 years ago

Alright, that VM is created.
db1.qa.scl3.mozilla.com
RHEL6_x86-64
1 core CPU
4G RAM.
Default disk for / and 100G in /data

initial puppetization and nagios have been added, and it should be ready for any loving customization - let me know if you need further help.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

10 years ago

So, as far as enterprise linux distros, PuppetAgain supports CentOS-6.{2,5}, wihile infra puppet supports RHEL6.  The two have different install processes, too.  It looks like this is set up for infra puppet now.  If that's not appropriate, speak now :)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 8

•

10 years ago

Hm, I haven't known that we have different puppet systems! So given the info from Dustin, I think we should really change that, so we use PuppetAgain here too. I don't see why a single VM out of our >50 VMs should be based on Infra puppet, while all others get managed via PuppetAgain. CentOS should be perfectly fine, and AFAIR we already ran Couchdb a while ago on CentOS already. Sorry Chris.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

10 years ago

I believe that the PuppetAgain CentOS kickstart process is documented (by gcox).  The CentOS kickstart process seems to produce aligned VMs (whereas Ubuntu does not), but this is probably worth verifying after creation.

Chris Knowles [:cknowles]

Comment 10

•

10 years ago

Alright, I'll give it a shot at puppetagain... dustin, if any problems occur, I'll be knocking on your door.  :)

My confusion here was that the IT/Relops note gave me a conflicting signal that led me towards IT, rather than relops.  

I'll bring down the db1 here in a moment...

CJK

Chris Knowles [:cknowles]

Comment 11

•

10 years ago

Alright, spun up a puppetagain VM, and it's seemed successful - however, when either Henry or I try to log in, it doesn't accept our keys and asks for a password, and even for root, doesn't take any of the usual suspects (from an IT side)

Can you let me know what I've done wrong?  

Also, I was unable to find centos specific puppetagain docs, there's the ubuntu ones that I wrote with your help, but if there's something special to centos, I'm unable to find it.  

Any help is appreciated.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

10 years ago

CentOS and Ubuntu processes differ only in which PXE menu selection to make.

As for what's wrong, there was a problem with certificates, and now a syntax error in the manifests, both handled in bug 1008872.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 13

•

10 years ago

I fixed the syntax error, which I have introduced yesterday. :/ So we should try again.

Andrei Eftimie

Comment 14

•

10 years ago

Great. It works now, I am able to login to puppetmaster

Chris Knowles [:cknowles]

Comment 15

•

10 years ago

Alright, I've tried spinning that VM again, and still I'm prompted for a password when trying to ssh to it as myself or root, and non of the usual suspects that I have work for it.

Is there some manifest that the server needs to be added to, or is there still a problem somewhere.  

Any assistance at this point is appreciated.  

CJK

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

10 years ago

I'll put the PuppetAgain kickstart password in the syadmins GPG file, but you're correct that there's no node definition in place, and the node won't get the usual set of SSH keys until that's corrected.  Andrei should be able to take it from here (by adding the node definition).

Chris Knowles [:cknowles]

Comment 17

•

10 years ago

I'm pretty sure the password (I've been using the QA deploy password in the PUPPET_PASS=xxx section) is already in passwords.

The machine is up, and running, if the node were added to the node definition, would it just "pick things up"?  or should I respin once the node definition is added.

Andrei - let me know know when the node def is added, I still have to do some tweaking, like the data volume and updated ESX tools, before I can release this VM.

CJK

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

10 years ago

(In reply to Chris Knowles [:cknowles] from comment #17)
> I'm pretty sure the password (I've been using the QA deploy password in the
> PUPPET_PASS=xxx section) is already in passwords.

The default root password is different from the deploy password.

> The machine is up, and running, if the node were added to the node
> definition, would it just "pick things up"?  or should I respin once the
> node definition is added.

It will just pick things up.  It runs puppet every 10m.

> Andrei - let me know know when the node def is added, I still have to do
> some tweaking, like the data volume and updated ESX tools, before I can
> release this VM.

I'll put the kickstart password in GPG now, and ping you.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 19

•

10 years ago

(In reply to Andrei Eftimie from comment #14)
> Great. It works now, I am able to login to puppetmaster

Btw. the comment from Andrei was not meant for this bug but for bug 1008872. :)

Chris Knowles [:cknowles]

Comment 20

•

10 years ago

Alright, given the update from Dustin - I've logged into the VM with the root password, and setup the vmware tools as well as the extra volume at /data - at this point, I'm not sure if you need anything further from me.  

Let me know.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 21

•

10 years ago

Dustin, would we be blocked on anything related to puppet here? Do we have support for proxies in Mint? I don't think so. So the fix on bug 997721 would also be necessary here?

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

10 years ago

Mint?? We don't support Mint at all..

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 23

•

10 years ago

Ups, Cent OS is what I meant. Given that has been installed above, right?

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

10 years ago

That sounds much better.  And yes, proxies work on CentOS already.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

10 years ago

Blocks: 997738

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 25

•

10 years ago

Dustin, I tried to log into that machine but it fails. I'm using the same key config as for any other host in qa.scl3.mozilla.com. So maybe the last puppet run was not successful?

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

10 years ago

Well, that's odd -- it looks like it had CentOS installed, but not puppetagain.  I'll figure out what happened with cknowles.

Chris Knowles [:cknowles]

Comment 27

•

10 years ago

Discussed with Dustin - did the pxe boot, puppetagain on ESX, centos install, using the QA deployment puppet_pass - per dustin, that should have worked.  

per IRC, dustin's going to try it again and update on what I need to change in process, or what else went wrong.

CJK

Dustin J. Mitchell [:dustin] (he/him)

Comment 28

•

10 years ago

Attached image Screen Shot 2014-05-20 at 10.28.07 AM.jpeg — Details

Well, that was not what I expected.

For unknown reasons, it's trying to use mirrorlist.centos.org.

dustin@cerf ~ $ host mirrorlist.centos.org
mirrorlist.centos.org has address 64.235.47.134
mirrorlist.centos.org has address 72.232.223.58
mirrorlist.centos.org has address 204.15.73.243
mirrorlist.centos.org has IPv6 address 2a02:2498:1:3d:5054:ff:fed3:e91a

For reasons I can't explain, it chooses the v6 address even though v6 isn't configured on this network (I think..).  Then it fails to connect and doesn't fall back to the other three v4 addresses.

It hangs indefinitely in this state.  So I suspect that the VM was rebooted out of this state somehow, and came up in the state we discovered this morning.

Dustin J. Mitchell [:dustin] (he/him)

Comment 29

•

10 years ago

Adding further mystery, the kickstart profile says

# blow away any existing repositories and add our own
# (note that this assumes that the repo server resolves as 'repos')
mkdir -p /etc/yum.repos.d
rm -f /etc/yum.repos.d/*

yet

[root@db1 yum.repos.d]# ls -al
total 32
drwxr-xr-x.  2 root root 4096 May 20 07:25 .
drwxr-xr-x. 76 root root 4096 May 20 07:35 ..
-rw-r--r--.  1 root root 1926 Nov 30 16:07 CentOS-Base.repo
-rw-r--r--.  1 root root  638 Nov 30 16:07 CentOS-Debuginfo.repo
-rw-r--r--.  1 root root  630 Nov 30 16:07 CentOS-Media.repo
-rw-r--r--.  1 root root 4528 Nov 30 16:07 CentOS-Vault.repo
-rw-r--r--.  1 root root  577 May 20 07:25 init.repo

where those CentOS-* are the files containing "mirrorlist.centos.org".

That's coming from the "yum upgrade -y" that occurs before the attempt to install puppet, as confirmed by looking in /var/log/yum.log after an aborted kickstart.

Access to mirrorlist.centos.org might work fine for the releng networks, since outgoing HTTP is allowed.  Or maybe centos.org just added IPv6.  Dunno.

I've added a line to blow away the repos at that time, and re-started the kickstart.

Dustin J. Mitchell [:dustin] (he/him)

Comment 30

•

10 years ago

OK, that worked up to the point where it runs puppet.  There's no node definition in place, so the puppet runs are failing.  Once you fix that, it should puppetize right up.

Dustin J. Mitchell [:dustin] (he/him)

Comment 31

•

10 years ago

(I determined that it was at this point by seeing the CentOS startup screen on the console, then SSH'ing as root with the kickstart password and seeing

[root@db1 ~]# cat puppetize.log
Contacting puppet server puppet
20 May 08:02:18 ntpdate[4995]: no server suitable for synchronization found
Certificate request for db1.qa.scl3.mozilla.com
securely removing deploypass
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
Running puppet agent against server 'puppet'
Puppet run failed; re-trying after 10m
)

so puppetize is still running.  It's trying every 10 minutes, and as soon as it has a successful run, it will set everything up.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 32

•

10 years ago

Problem here is that the node cannot be found. So I added it for now as toplevel::base, until we have custom manifests done.

https://hg.mozilla.org/build/puppet/rev/79775af720be

I will wait for the next puppet run, and close this bug as fixed if it worked.

Status: NEW → ASSIGNED

Eric Ziegenhorn :ericz

Assignee

Comment 33

•

10 years ago

This box is in Infra Nagios, but we can't log in to it because it is a PuppetAgain box.  Should we perhaps move it to Releng Nagios?  It doesn't do us much good here.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 34

•

10 years ago

I think that should be a good thing to do. I will wait for Dustin and what he thinks. But given that PuppetAgain is under RelEng control, we might also want to use their Nagios system.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 35

•

10 years ago

Eric, you mentioned nagios alerts on IRC. Can you please tell us which specific issues you are seeing? Thanks.

Dustin J. Mitchell [:dustin] (he/him)

Comment 36

•

10 years ago

No, releng is not maintaining QA's systems, so the alerts would not be useful there, either.

The options are:
 * Don't use monitoring
 * Use infra nagios and build a proper playbook for SREs to follow
 * Set up a QA nagios instance (not worth the trouble yet)

For the second option, you'll need to add the SREs to $admin_users so they can login.  I can take care of that in another bug.

The best option is probably a blend of 1 and 2: just monitor hosts for basic things like being up and having the SSH port open, where the remediation for the SREs is fairly straightforward.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 37

•

10 years ago

Sounds like a good idea. So yes, lets get it filed separately.  I will leave this bug open until we got a response from Eric, regarding the failures. Not sure if that affects my further work in setting up puppetagain via bug 717161

Flags: needinfo?(eziegenhorn)

Eric Ziegenhorn :ericz

Assignee

Comment 38

•

10 years ago

Ok so this is a one-off not monitored by releng, it should likely go into the Infra Puppet since we're the default bucket.  So if :whimboo and/or :dustin can get us root/sudo access, we can re-puppetize it and setup basic monitoring for the host.

If we're against infra puppet for this, then if we can keep the MOC's keys up to date and give them sudo we can at least do basic monitoring for you.

Flags: needinfo?(eziegenhorn)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 39

•

10 years ago

Eric, as mentioned above, all of our upcoming hosts will be driven by puppetagain. Given that we want to have a homogeneous environment, this host should also be covered there. But as Dustin said, if we can use the infra nagios and give you the admin access, it should all work, right?

Eric Ziegenhorn :ericz

Assignee

Comment 40

•

10 years ago

Ok, then yes I believe we can do basic checks and alert handling if we have ssh and sudo access.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 41

•

10 years ago

Great. Dustin, can you make those modifications please? Not sure if that should be global or just for us.

Dustin J. Mitchell [:dustin] (he/him)

Comment 42

•

10 years ago

I landed the change on your default branch.  You can merge if it looks OK.
  http://hg.mozilla.org/qa/puppet/rev/888a98c8c451

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 43

•

10 years ago

Thanks Dustin. Alphabetical sorted names would have been good, but it's minor and can be fixed later.

Changes landed as:
https://hg.mozilla.org/qa/puppet/rev/bbe9c54a2228

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 44

•

10 years ago

Sorry, we still have the nagios stuff to do here. So reopening and over to Eric.

Assignee: cknowles → eziegenhorn

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Eric Ziegenhorn :ericz

Assignee

Comment 45

•

10 years ago

Verified db1.qa.scl3.mozilla.com is in infra Nagios with basic checks.  I still cannot ssh in.  Once I verify access, I'll close this bug.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 46

•

10 years ago

You should be able to login as root. In this case your added SSH key will be used.

Eric Ziegenhorn :ericz

Assignee

Comment 47

•

10 years ago

:whimboo, I can't login as root either.

Dustin J. Mitchell [:dustin] (he/him)

Comment 48

•

10 years ago

You should be good now.

Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10 only lists db1 as toplevel::base, which doesn't include any kind of puppet run (either at boot or every 30 minutes).  As a DB host, it should probably be toplevel::server.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 49

•

10 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #48)
> Henrik, http://hg.mozilla.org/qa/puppet/file/tip/manifests/qa-nodes.pp#l10
> only lists db1 as toplevel::base, which doesn't include any kind of puppet
> run (either at boot or every 30 minutes).  As a DB host, it should probably
> be toplevel::server.

Makes sense. I will take care of it when I start working on the couchdb module over on bug 717161.

Eric, please mark the bug verified if all is working fine for you know. Thanks.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Eric Ziegenhorn :ericz

Assignee

Comment 50

•

10 years ago

:whimboo, no I still can't ssh in as root.  Also, does this use the standard infra root password for if we need to access the console?

Dustin J. Mitchell [:dustin] (he/him)

Comment 51

•

10 years ago

Sorry, the root thing is my fault -- the change I suggested in comment 48 is the right fix.

Not the infra root, but the QA root password is in GPG.

Eric Ziegenhorn :ericz

Assignee

Comment 52

•

10 years ago

Works now, thanks!

Status: RESOLVED → VERIFIED

Eric Ziegenhorn :ericz

Assignee

Comment 53

•

10 years ago

db1.qa.scl3 just rebooted and a few things happened:

1) The Nagios server was prevented from connecting to nrpe due to nrpe configuration, causing a bunch of alerts
3) There are checks that are missing because we're combining a PuppetAgain host with Infra nagios.  They must've been there before it seems, because it just started alerting this evening.  Maybe PuppetAgain removed them?
4) Some members of the MOC don't seem to have access to this server

The way I see it, having a host in PuppetAgain but infra Nagios is just a bad idea.  If we want to keep working on it we'll have to get the appropriate nagios checks, nrpe configs and keys on it.

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

Dustin J. Mitchell [:dustin] (he/him)

Comment 54

•

10 years ago

Agreed - let's get a new bug open for that, and work on it when some other stuff has settled.  Most likely we'll come up with a hostgroup with only a few checks configured.

Bug 926468 should fix the issue of some MOC people not having access -- it's been a while since i've done a manual refresh (which is why you had an old SSH key in there).

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 55

•

10 years ago

Means we can close this bug now? OR do we first have to remove traces of infra nagios on this host? Lets do it so I can start workign on this box next week.

Eric Ziegenhorn :ericz

Assignee

Comment 56

•

10 years ago

I pared down the nagios checks for db1.qa.scl3 to just a simple ping up/down check.  There is no issue with it not being infra-puppetized then, so yes we can close this now.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 57

•

10 years ago

Thanks Eric!

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations