Closed
Bug 648913
Opened 14 years ago
Closed 14 years ago
Set up graphite
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jbalogh, Assigned: oremj)
References
Details
Attachments
(1 file)
|
1.39 KB,
text/plain
|
Details |
First things first: graphite is a pain in the ass to set up.
Now that we have the moaning out of the way, let's install graphite! We're going to use it for application-level metrics like etsy[1]. It's not competing with ganglia or nagios, it's a new thing to make our systems better.
I've attached my notes from setting it up on khan, but they're not perfect. You guys know where to find me on irc so we can figure it out together. Getting this puppetized would be great.
[1] http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
Comment 1•14 years ago
|
||
I just emailed Corey about this Friday afternoon. We'd also like to start using Graphite+Statsd to start measuring and tracking pieces of SUMO.
I had a lot more luck with the Python Statsd implementation[1] than the Node.js implementation[2] from Etsy, but Jeff's experience may differ, and we have easy connections to the authors of both. (Statsd is a simple server/network daemon that lives in front of graphite--possibly on each web head--and acts as an intermediary between the app and graphite.)
[1] https://github.com/sivy/py-statsd
[2] https://github.com/etsy/statsd
Comment 2•14 years ago
|
||
(In reply to comment #1)
> I had a lot more luck with the Python Statsd implementation[1] than the Node.js
> implementation[2] from Etsy,
Further experiments locally have revealed issues with the Python implementation. The node implementation has its issues, too, but I suspect those are easier to fix. Spending some time on this today.
| Reporter | ||
Comment 3•14 years ago
|
||
I haven't had any problems with the js version, and I'm suspicious of the quality of the python version (the example client from that author was not good). The python version may have perf issues from the global interpreter lock. I'd rather start with statsd.js since it's good enough for etsy.
| Assignee | ||
Comment 4•14 years ago
|
||
What kind of hardware does this need?
| Reporter | ||
Comment 5•14 years ago
|
||
Graphite is an rrd like ganglia so it probably needs a decent disk. It's also a webapp and I read somewhere that it tries to cache in memory so we shouldn't be stingy there.
If it's easier to start in a VM we can upgrade later.
| Assignee | ||
Comment 6•14 years ago
|
||
Does it listen on the internet or on a private IP? Asking, because I'm not sure if I need to set up one in each datacenter.
| Reporter | ||
Comment 7•14 years ago
|
||
We're going to talk to graphite over a private IP and the website can be behind vpn. Python is going to talk to statsd over UDP so it won't block, but we may talk directly to grahpite without statsd sometimes.
I prefer separate datacenters to reduce latency and so that we're not sharing with sumo (which will scale better).
Comment 8•14 years ago
|
||
So is this bug for setting up a single instance of graphite/statsd or multiple?
I think we could use multiple instances:
* amo
* sumo
* input
* mozilla.com/org
Comment 9•14 years ago
|
||
Is there a recommended place to run statsd? It's probably easiest if we run one instance on each graphite box, but because of its "queue and flush" model it might be less intense to run it on each webhead.
(In reply to comment #9)
> Is there a recommended place to run statsd? It's probably easiest if we run one
> instance on each graphite box, but because of its "queue and flush" model it
> might be less intense to run it on each webhead.
Seems like the former is better - I agree the queue/flush is nice, but having to setup node on each webhead seems ridiculous. You could even set it up on another VM altogether.
Comment 11•14 years ago
|
||
<jbalogh> jsocol: statsd-per-box wouldn't work
<jbalogh> if you're timing something that happens on all the boxes they'd overwrite each other
<jbalogh> or collide in some way, I don't know what graphite does there
Another VM is possible but we're probably good on the graphite box, until we're not, anyway.
Comment 12•14 years ago
|
||
Any sense of an ETA for this? Do you want us to file separate bugs per datacenter/consumer?
| Assignee | ||
Comment 13•14 years ago
|
||
Phong, can you set up a server for this and assign back to me?
Assignee: jeremy.orem+bugs → phong
Comment 14•14 years ago
|
||
Phong, or anyone else, is there any way to get an estimate of when this might be ready? I'm not asking for it to be top priority or anything, just some kind of ETA we can roughly plan around.
Comment 15•14 years ago
|
||
I have to check and see if we have a spare server for this.
| Reporter | ||
Comment 16•14 years ago
|
||
I'm fine setting up amo's graphite in phx since we're moving soon, and that's where sumo is already.
Comment 17•14 years ago
|
||
Can this run side-by-side with ganglia? I'm just in the process of building out a pair of new Ganglia VMs (one for each data center) and perhaps could piggyback this app on them?
| Reporter | ||
Comment 18•14 years ago
|
||
(In reply to comment #17)
> Can this run side-by-side with ganglia?
I don't know how graphite will perform when we're start pushing it harder, but it's probably worth a shot. The graphite docs say it's close enough to C speed to get the job done at Orbitz.
Comment 19•14 years ago
|
||
This shouldn't have been assigned to phong.. And the importance has apparently changed. Re-assigning.
Pete, if you have puppet classes (or any other tips) for this could you punt them to rtucker please?
Assignee: phong → rtucker
Severity: normal → major
Updated•14 years ago
|
Assignee: rtucker → jdow
Comment 20•14 years ago
|
||
I'll start working on this today.
Comment 21•14 years ago
|
||
btw. is this something that needs to be publicly accessible, or would it be ok to put behind ldap auth?
Comment 22•14 years ago
|
||
How about just putting it on VPN for now?
Comment 23•14 years ago
|
||
So I see graphite.mozilla.org is up, behind LDAP, but afaict it's in SJC. What's the plan for PHX?
Comment 24•14 years ago
|
||
I'm still trying to work that out. I have a host in phx that is pretty much identical as the one behind graphite.mozilla.org.
I'm trying to decide whether I want to try to do something like graphite.mozilla.org/phx and graphite.mozilla.org/sjc (requires some apache rewrites and proxypass magic, or whether something more like graphite-sjc.mozilla.org and graphite-phx.mozilla.org to make things easier.. I'm leaning towards the first option, as I like having a one-stop-shop for all things monitoring.
Also, I'm only half done setting it up. The web ui is up and carbon is running, but I don't have statsd running or any further dashboard configuration done yet.
Comment 25•14 years ago
|
||
So, the next step is to install node.js and statsd, correct?
If I understand it correctly, statsd needs to run as a daemon on this host and listen on some UDP port and your webapps will send it stuff? Or do we need to configure up a statsd client of some sort to put on the webheads to send the UDP packets?
A lot of the documentation around this seems vague and more developer-friendly than sysadmin friendly, so I'm seeing a lot of "run ./startserver" and not a lot of "/etc/init.d/server start". I don't mind writing those scripts and making this work in a scalable fashion, I'm just not sure I've fully understood the overall architechture yet.
| Reporter | ||
Comment 26•14 years ago
|
||
The app talks to statsd over UDP, statsd talks to graphite over TCP. statsd is a daemon running on node. It doesn't matter where statsd lives but it seems easy to stick it next to graphite. There's one statsd for one graphite, and all the app servers talk to one statsd.
statsd buffers and aggregates numbers from all the app servers and dumps that to graphite every 10 seconds.
Updated•14 years ago
|
Assignee: jdow → jeremy.orem+bugs
| Assignee | ||
Comment 27•14 years ago
|
||
Looks like it finish this up we just need some firewall work.
I think the easiest thing to do would be just to allow all internal hosts to access ganglia1.dmz.sjc1.mozilla.com port 8125 over udp.
Assignee: jeremy.orem+bugs → network-operations
Component: Server Operations → Server Operations: Netops
| Assignee | ||
Comment 28•14 years ago
|
||
This can be assigned back to me after the firewall changes are made.
Also, the same thing should be done in phx except with ganglia1.dmz.phx1.mozilla.com
Comment 29•14 years ago
|
||
This should be done for most vlans in sjc. Let me know if I missed any. Phx will take a little more time since it is Juniper and the access lists work differently.
| Reporter | ||
Comment 31•14 years ago
|
||
We need these settings to know how to talk to statsd on preview:
STATSD_HOST = '??'
STATSD_PORT = 8125
STATSD_PREFIX = 'pamo'
| Assignee | ||
Comment 32•14 years ago
|
||
Added STATSD config to preview. What else needs to be done to close out this bug?
| Reporter | ||
Comment 33•14 years ago
|
||
(In reply to comment #32)
> Added STATSD config to preview. What else needs to be done to close out this
> bug?
We're done, I see it working from preview.amo.
Thanks everyone! I'm sure james will be filing a followup for sumo soon.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Infrastructure & Operations
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•