Closed Bug 648913 Opened 14 years ago Closed 14 years ago

Set up graphite

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jbalogh, Assigned: oremj)

References

Details

Attachments

(1 file)

Attached file graphite install notes
First things first: graphite is a pain in the ass to set up. Now that we have the moaning out of the way, let's install graphite! We're going to use it for application-level metrics like etsy[1]. It's not competing with ganglia or nagios, it's a new thing to make our systems better. I've attached my notes from setting it up on khan, but they're not perfect. You guys know where to find me on irc so we can figure it out together. Getting this puppetized would be great. [1] http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
I just emailed Corey about this Friday afternoon. We'd also like to start using Graphite+Statsd to start measuring and tracking pieces of SUMO. I had a lot more luck with the Python Statsd implementation[1] than the Node.js implementation[2] from Etsy, but Jeff's experience may differ, and we have easy connections to the authors of both. (Statsd is a simple server/network daemon that lives in front of graphite--possibly on each web head--and acts as an intermediary between the app and graphite.) [1] https://github.com/sivy/py-statsd [2] https://github.com/etsy/statsd
(In reply to comment #1) > I had a lot more luck with the Python Statsd implementation[1] than the Node.js > implementation[2] from Etsy, Further experiments locally have revealed issues with the Python implementation. The node implementation has its issues, too, but I suspect those are easier to fix. Spending some time on this today.
I haven't had any problems with the js version, and I'm suspicious of the quality of the python version (the example client from that author was not good). The python version may have perf issues from the global interpreter lock. I'd rather start with statsd.js since it's good enough for etsy.
What kind of hardware does this need?
Graphite is an rrd like ganglia so it probably needs a decent disk. It's also a webapp and I read somewhere that it tries to cache in memory so we shouldn't be stingy there. If it's easier to start in a VM we can upgrade later.
Does it listen on the internet or on a private IP? Asking, because I'm not sure if I need to set up one in each datacenter.
We're going to talk to graphite over a private IP and the website can be behind vpn. Python is going to talk to statsd over UDP so it won't block, but we may talk directly to grahpite without statsd sometimes. I prefer separate datacenters to reduce latency and so that we're not sharing with sumo (which will scale better).
So is this bug for setting up a single instance of graphite/statsd or multiple? I think we could use multiple instances: * amo * sumo * input * mozilla.com/org
Is there a recommended place to run statsd? It's probably easiest if we run one instance on each graphite box, but because of its "queue and flush" model it might be less intense to run it on each webhead.
(In reply to comment #9) > Is there a recommended place to run statsd? It's probably easiest if we run one > instance on each graphite box, but because of its "queue and flush" model it > might be less intense to run it on each webhead. Seems like the former is better - I agree the queue/flush is nice, but having to setup node on each webhead seems ridiculous. You could even set it up on another VM altogether.
<jbalogh> jsocol: statsd-per-box wouldn't work <jbalogh> if you're timing something that happens on all the boxes they'd overwrite each other <jbalogh> or collide in some way, I don't know what graphite does there Another VM is possible but we're probably good on the graphite box, until we're not, anyway.
Any sense of an ETA for this? Do you want us to file separate bugs per datacenter/consumer?
Phong, can you set up a server for this and assign back to me?
Assignee: jeremy.orem+bugs → phong
Phong, or anyone else, is there any way to get an estimate of when this might be ready? I'm not asking for it to be top priority or anything, just some kind of ETA we can roughly plan around.
I have to check and see if we have a spare server for this.
I'm fine setting up amo's graphite in phx since we're moving soon, and that's where sumo is already.
Can this run side-by-side with ganglia? I'm just in the process of building out a pair of new Ganglia VMs (one for each data center) and perhaps could piggyback this app on them?
(In reply to comment #17) > Can this run side-by-side with ganglia? I don't know how graphite will perform when we're start pushing it harder, but it's probably worth a shot. The graphite docs say it's close enough to C speed to get the job done at Orbitz.
This shouldn't have been assigned to phong.. And the importance has apparently changed. Re-assigning. Pete, if you have puppet classes (or any other tips) for this could you punt them to rtucker please?
Assignee: phong → rtucker
Severity: normal → major
Assignee: rtucker → jdow
I'll start working on this today.
btw. is this something that needs to be publicly accessible, or would it be ok to put behind ldap auth?
How about just putting it on VPN for now?
So I see graphite.mozilla.org is up, behind LDAP, but afaict it's in SJC. What's the plan for PHX?
I'm still trying to work that out. I have a host in phx that is pretty much identical as the one behind graphite.mozilla.org. I'm trying to decide whether I want to try to do something like graphite.mozilla.org/phx and graphite.mozilla.org/sjc (requires some apache rewrites and proxypass magic, or whether something more like graphite-sjc.mozilla.org and graphite-phx.mozilla.org to make things easier.. I'm leaning towards the first option, as I like having a one-stop-shop for all things monitoring. Also, I'm only half done setting it up. The web ui is up and carbon is running, but I don't have statsd running or any further dashboard configuration done yet.
So, the next step is to install node.js and statsd, correct? If I understand it correctly, statsd needs to run as a daemon on this host and listen on some UDP port and your webapps will send it stuff? Or do we need to configure up a statsd client of some sort to put on the webheads to send the UDP packets? A lot of the documentation around this seems vague and more developer-friendly than sysadmin friendly, so I'm seeing a lot of "run ./startserver" and not a lot of "/etc/init.d/server start". I don't mind writing those scripts and making this work in a scalable fashion, I'm just not sure I've fully understood the overall architechture yet.
The app talks to statsd over UDP, statsd talks to graphite over TCP. statsd is a daemon running on node. It doesn't matter where statsd lives but it seems easy to stick it next to graphite. There's one statsd for one graphite, and all the app servers talk to one statsd. statsd buffers and aggregates numbers from all the app servers and dumps that to graphite every 10 seconds.
Assignee: jdow → jeremy.orem+bugs
Looks like it finish this up we just need some firewall work. I think the easiest thing to do would be just to allow all internal hosts to access ganglia1.dmz.sjc1.mozilla.com port 8125 over udp.
Assignee: jeremy.orem+bugs → network-operations
Component: Server Operations → Server Operations: Netops
This can be assigned back to me after the firewall changes are made. Also, the same thing should be done in phx except with ganglia1.dmz.phx1.mozilla.com
Assignee: network-operations → ahill
This should be done for most vlans in sjc. Let me know if I missed any. Phx will take a little more time since it is Juniper and the access lists work differently.
Firewall changes are complete in Phx.
Assignee: ahill → jeremy.orem+bugs
We need these settings to know how to talk to statsd on preview: STATSD_HOST = '??' STATSD_PORT = 8125 STATSD_PREFIX = 'pamo'
Added STATSD config to preview. What else needs to be done to close out this bug?
(In reply to comment #32) > Added STATSD config to preview. What else needs to be done to close out this > bug? We're done, I see it working from preview.amo. Thanks everyone! I'm sure james will be filing a followup for sumo soon.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: