Closed Bug 734123 Opened 12 years ago Closed 12 years ago

set up puppet dashboard on puppetagain servers

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

()

Details

We should get a puppet dashboard set up on the puppetagain servers.

It would be great if the dashboard could appear on a single server, rather than having to hunt down the master for a particular host.

This is probably best experimented with on relabs-puppet and relabs07/08 if anyone wants to try it.
Assignee: server-ops-releng → dustin
So, dashboard requires a backend MySQL database, and seems to use quite a bit of CPU power - so it should be a separate VM from any of the puppet masters.

We could, potentially, run the frontend on the releng cluster, with a new VM just for workers.

I want to talk to Jabba about this before I dive in: is it worth the trouble? when does MySQL performance begin to suffer?
Based on the horsepower needed, this might be a good use of the hg mirror hardware sitting in scl1 currently (which, I think , may no longer be in use for hg?).  Dev services guys, can we reclaim that hardware?
And, sheeri, what do you think about running this on an existing DB cluster, given your experience with the infra pupppet dashboard?
If we put it on an existing DB cluster, it has to be one that's OK with having replication behind for several hours on a weekly basis, because when puppet dashboard is defragmented, it takes a while.

We found we needed it to be defrag'd weekly, and we still had space issues (jabba did a lot of work to not put too much data in the database, but there was still a ton of information and a lot of pruning work....)

That's also a lot of disk I/O, which might end up not playing nicely with the other VM's on the machine.
Is there such a cluster?  Should we set up a dedicated cluster for dashboard backends?

Which machines do you mean by VMs - DB servers, or the puppet workers?
We have a puppetdashboard DB cluster, but it's having a ton of problems at the moment. Jabba was investigating other puppet dashboard solutions, and we have a bug to add more disk to the existing ones. 

The disk I/O VM comment was intended for if you put the db on a vm.
OK, thanks.  It seems like we should put these two eggs in the same basket - they certainly shouldn't be in baskets with any other eggs.  So I'll wait to see how the sysadmins puppet stuff plays out.

Puppetagain load is pretty low right now, but will likely grow within 6mo or so to be similar to what sysadmins puppet is doing today.
Depends on: 786651
So, rough architecture plan is this:

workers = releng-puppet-dashN.private.scl3
UI = releng cluster (scl3)
report acceptance = releng cluster (scl3)
db = puppetdashboard{1,2}.db.phx1

flows:
 masters -> report acceptance tcp/3000
 report acceptance -> db tcp/3306 (cross-DC flow)
 UI -> db tcp/3306 (cross-DC flow)
 workers -> db tcp/3306

I'd very much like to use the existing releng cluster for the web stuff, since we can embed it in secure.pub.b.m.o.  I'm aware it will be slow using a phx1 backend.  If necessary, we can work around this (most likely requiring an additional SSL cert).
On Amy's advice to not do cross-DC flows, we're going to try to set up a separate DB cluster in scl3, using the old hg-mirror hardware.  So, I'll get some bugs filed for that, and close out the IP allocation and VM bugs.
No longer blocks: PuppetAgain
OK, bug 771121 tracks decomming the HG mirrors, which will free up that hardware by Sep 10.

That hardware is
 HP DL360G7 E5645 Base US Svr
 RAM: 6GB
 CPU: Intel® Xeon® E5645 (2.40GHz/6-core/12MB/80W, DDR3-1333, HT Turbo 1/1/1/1/2/3)
 Disk: Smart Array P410i
    with six
      physicaldrive 1I:1:1 (port 1I:box 1:bay *, SAS, 146 GB, OK)
    in a single
      logicaldrive 1 (683.5 GB, RAID 5, OK)

Sheeri, what do you think of using these as MySQL servers under load similar to puppetdashboard*.db.phx1?  Should we change the disk config?  Is it worth stalling long enough to get more RAM?
Depends on: 771121
Depends on: 788605
Depends on: 788630
puppetdash1/2.db.scl3 are kickstarting now.  I need to update DHCP in inventory.  I'll need to check up on whether these will be bonding, and how to encode the mgmt nic.
Depends on: 791023
If this is running on the releng cluster, then we need to disable diffs entirely so we don't leak secrets - bug 791102.
Depends on: 791102
OK, this is pretty much working.
  https://secure.pub.build.mozilla.org/puppetdash/

'course, you'll need to use the new secure vhost; in /etc/hosts:

63.245.215.57 secure.pub.build.mozilla.org


I verified that disabling the report vhost doesn't cause production failures:

Sep 14 10:34:17 releng-puppet1 puppet-master[21846]: Unable to submit report to http://puppetdash.pvt.build.mozilla.org/reports/upload [403] Forbidden

so I'll include this as a workaround to any issues of db/webhead slowness affecting production.


There are some lingering problems with full URL paths - puppet dashboard *mostly* works at a sub-URI, but not quite.  They seem harmless enough so far (they require you to auth twice, and then you'll get some broken image links), and I'll work to fix the upstream.


What remains:

 - monitoring for DB servers
 - monitoring for workers
 - review and update docs
OK, this is done.  We may need to revisit as we see how this scales.  It may need more workers, for example, or the db servers may need more tuning.

There were some issues with the webheads yesterday and today, but I strongly suspect those were due to a bogus master/master configuration of the databases.  I think this app is not master/master capable, so I switched the DBs to a master/slave configuration, with both DB's in the ro pool and only the master in the rw pool.

The installed webapp still has some absolute paths.  If we're patient, we'll wait until a new version is released with my patches applied; otherwise, we *could* patch this locally with puppet or build some custom RPMs.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.