Closed Bug 921801 Opened 11 years ago Closed 10 years ago

Eliminate SPOF in graphite

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ericz, Assigned: ericz)

References

Details

Attachments

(2 files)

Graphite runs on a single blade which has redundant disks but is otherwise a single point of failure.  Investigate options for getting more hardware to eliminate this risk.
graphite6.private.scl3 is what I was referring to.  We don't yet have a blade in PHX1, but are working on it.
As per IRC, we'll likely duplicate SCL3 to PHX1 for now and down the road for redundancy we'll add more graphite servers.
Depends on: 919043
Corey,

Are we going to buy more blades/storblades for this?
Flags: needinfo?(cshields)
Tell me what you need and we will figure it out.   I'd suggest some SSDs.
Flags: needinfo?(cshields)
Eric,

Once we have phx1 running with the tuned FS etc, let's tell Corey what you need to make this redundant.
Flags: needinfo?(eziegenhorn)
SSDs would be a good bet, the node can't keep up with writes. No wonder asking it to do reads fails
IO write/read times.

Sorry for jumping in. I really appreciate IT setting up this service, but poor perf makes it really frustrating & time-consuming to get any data out of this.
Attachment #8372524 - Attachment description: render.png → cpu vs disk io(too much cpu, not enough disk speed)
Attachment #8372526 - Attachment description: download.png → Extremely poor read/write latencies
(In reply to Taras Glek (:taras) from comment #7)

> Sorry for jumping in. I really appreciate IT setting up this service, but
> poor perf makes it really frustrating & time-consuming to get any data out
> of this.

Taras, we're working on scaling this...unfortunately, the easy™ solution here (SSDs) is going to cost $91k per machine ;) So we're working on scalable solution...please bear with us while we figure out the best way to fix this.
(In reply to Shyam Mani [:fox2mike] from comment #8) 
> Taras, we're working on scaling this...unfortunately, the easy™ solution
> here (SSDs) is going to cost $91k per machine ;) So we're working on
> scalable solution...please bear with us while we figure out the best way to
> fix this.

I see a couple of options:
1. Blindly throwing SSDs at the problem: 32tb of 1tb ssds costs ~$16-20K. 
2. 32tb of hds ~= 1.2K.
3. Keeping 1 month of data on ssds, rest on harddrives ~ $2K.
4. Keeping long-term data in S3, current days of incoming data on ssd on ec2 <$36K/year
5. setup  machine with 1tb of ssds & warehouse rest of data in a ceph storage cluster(mirror aws model above) ~10K

Options 4 and 5 would require a couple of weeks of development work, but would give us effectively infinite scalability by separating storage from analysis.

None of these options are close to 91k per machine.
Just for kicks, I looked at using AWS CloudWatch hosted service for our needs,
we seem to be gathering 70metrics per machine(overkill?) which works out at 
70metrics*$0.50*500instances=$17500/month.
(In reply to Taras Glek (:taras) from comment #9)

> None of these options are close to 91k per machine.

Because these were priced with "enterprise" grade SSDs from HP.
(In reply to Shyam Mani [:fox2mike] from comment #11)
> (In reply to Taras Glek (:taras) from comment #9)
> 
> > None of these options are close to 91k per machine.
> 
> Because these were priced with "enterprise" grade SSDs from HP.

I think we disagree on class of hw best suited for data warehousing. That's fine.

Given that graphite prefers happy to have multiple carbon hosts behind a single frontend: Why not keep releng data in a carbon instance in ec2? This avoids purchasing hardware, it'll reduce amount of data transfered between EC2 and our datacenter, etc. Besides, sounds like we aren't saving money over AWS by purchasing hw. Spinning up an AWS instance would give us increasing processing capacity and give us time to decide on a proper solution.
Someone I know who runs large ganglia boxes says that the I/O rates burn out SSDs in months. Doesn't sound like a good plan on enterprise or other grade SSDs.
Not a graphite expert or nor do I have idea of amount of data required to retain when viewing stats but perhaps cheaper and performance wise better to simply load the recently accessed data into ram (cache it) and keep everything else on spinning disks, with perhaps some SSD buffer for middle data (think of L2arc in ZFS).

      
     / \
    /   \
   / ram \
  / SSDs  \
 /  disks  \
/___________\
(In reply to Peter Radcliffe [:pir] from comment #13)
> Someone I know who runs large ganglia boxes says that the I/O rates burn out
> SSDs in months. Doesn't sound like a good plan on enterprise or other grade
> SSDs.

Spoke to a few people about graphite deployments. Everyone swears by SSDs. A well-known portland startup collects 600,000 metrics, 30s resolution with 1 year of retention. They use a single big server with SSDs. They said it's been working fine since they switched it to SSDs, they had nothing but IO trouble with prior harddrive deployment.
No spof there, because metrics are also not core to their business.
Albert, Shyam, Eric, Corey  -where are we with this test of the hosted graphite and improving performance on the existing services?
Flags: needinfo?(cshields)
Flags: needinfo?(avillarde)
(In reply to SylvieV from comment #16)
> Albert, Shyam, Eric, Corey  -where are we with this test of the hosted
> graphite and improving performance on the existing services?

Sylvie,

I'll own this moving forward (too many things on Corey's plate). I'll sync up with Eric/Amy/Taras and give you an update.
Flags: needinfo?(eziegenhorn)
Flags: needinfo?(cshields)
Flags: needinfo?(avillarde)
(In reply to Shyam Mani [:fox2mike] from comment #17)
> (In reply to SylvieV from comment #16)
> > Albert, Shyam, Eric, Corey  -where are we with this test of the hosted
> > graphite and improving performance on the existing services?
> 
> Sylvie,
> 
> I'll own this moving forward (too many things on Corey's plate). I'll sync
> up with Eric/Amy/Taras and give you an update.

Thanks - please post the plan of action and timelines in this bug.
* Current issues

1) Too slow for the giant releng queries.
2) SSDs with the current hardware is expensive

SSDs are the easier option, but at the amount we need, it's not quite
cost effective. We can try cheaper disks, but we're not sure about how
reliable that might be.

* Steps taken so far

Graphite stores data in fixed-size "Whisper" files. We were storing data
every minute 1.5 years, which is now becoming minutely data for 40 days
and 5 minute samples for 1.5 years.

1) We have already started reducing retention size
2) We have about 704,000 “whisper” files
3) We are reducing the size of these files from 9MB to 2MB
4) This will help reduce disk space by almost 5TB
5) We’re about 12% done, this has been running for about 2 days
6) Expected time to finish = about 16 days

Eric is already looking to see if we can parallelize this, to make it go
faster

* Sending data to a hosted solution

collectd is the daemon that runs on the client and
decides where to send the data to.

1) This could be a client side change
2) relops can make this change when they want to and send metrics to
hostedgraphite.com, it will only need firewall flows. This doesn't need
us to make any changes server side.
3) We make a change to carbon-relay on the server side to send the data to our store and 


* Possible future steps

This depends on how much we're willing to spend etc and isn't fully
fleshed out, yet.

1) Come up with a storage solution that has better i/o?
2) Replace whisper? (https://github.com/graphite-project/ceres)
3) Federated storage (add more nodes, split the data)
Had a chat with Amy, Taras and Jake on IRC.

Jake is going to see how much puppet work it is to send data to hostedgraphite.com (tracked in Bug 975227)

I will check on how much our AWS bills might be affected by doing the above. 

Once we have these two, we might start a 30 day trial of hostedgraphite and see where that goes.
(In reply to Shyam Mani [:fox2mike] from comment #20)

> I will check on how much our AWS bills might be affected by doing the above. 

Has been scrapped. Once Jake has an estimate, we'll be good to move on this.
Shyam, just for due diligence have we looked at our internal NAS or otherwise options with available capacity?

This bug is now covering a few issues not just SPOF.  Can you and Eric prioritize which is most concerning and create blocker bugs?  By the sound of it, disk capacity and/or I/O seems to be most pressing. 

Is it possible to trim the hosts running collectd to those that we know must run it (ie our critical systems)?  Or can we spin up another graphite for less critical hosts on lesser HW?
(In reply to Albert from comment #22)
> Shyam, just for due diligence have we looked at our internal NAS or
> otherwise options with available capacity?

Yes, we'll do that. Spinning off some stuff to hostedgraphite gives us the time to evaluate options and build out the best possible solution while not hindering data collection.

> This bug is now covering a few issues not just SPOF.  Can you and Eric
> prioritize which is most concerning and create blocker bugs?  By the sound
> of it, disk capacity and/or I/O seems to be most pressing. 

Sure. 

> Is it possible to trim the hosts running collectd to those that we know must
> run it (ie our critical systems)?  Or can we spin up another graphite for
> less critical hosts on lesser HW?

We were looking into scaling graphite by having more hosts in the mix.
Looks like someone got a similar instance of graphite onto harddrives. https://answers.launchpad.net/graphite/+question/178969
Shyam, what is the status on graphite on-prem improvements in performance as the hosted testing?
Albert,Shyam - this bug has not made progress. Taras deployed a new hostedgraphite without collectd. https://bugzilla.mozilla.org/show_bug.cgi?id=968381

What is learned from this and how do we apply this to our Graphite needs in IT?
Flags: needinfo?(shyam)
Now that Q1 is done, I'm hoping to spend more time with Eric, Albert and David and review where we are and what improvements can be made. I'll report back here by next week.
Flags: needinfo?(shyam)
Severity: normal → enhancement
We have redundant storage but not redundant hosts, due to the somewhat hefty hardware required for our volume of metrics.  There doesn't seem to be much interest in purchasing another host in scl3 and phx1 for host redundancy so I'm going to close this for now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: