921801 - Eliminate SPOF in graphite

Assignee

Description

•

11 years ago

Graphite runs on a single blade which has redundant disks but is otherwise a single point of failure.  Investigate options for getting more hardware to eliminate this risk.

Eric Ziegenhorn :ericz

Assignee

Comment 1

•

11 years ago

graphite6.private.scl3 is what I was referring to.  We don't yet have a blade in PHX1, but are working on it.

Eric Ziegenhorn :ericz

Assignee

Comment 2

•

11 years ago

As per IRC, we'll likely duplicate SCL3 to PHX1 for now and down the road for redundancy we'll add more graphite servers.

Depends on: 919043

Shyam Mani [:fox2mike]

Comment 3

•

11 years ago

Corey,

Are we going to buy more blades/storblades for this?

Flags: needinfo?(cshields)

Corey Shields [:cshields]

Comment 4

•

11 years ago

Tell me what you need and we will figure it out.   I'd suggest some SSDs.

Flags: needinfo?(cshields)

Shyam Mani [:fox2mike]

Comment 5

•

11 years ago

Eric,

Once we have phx1 running with the tuned FS etc, let's tell Corey what you need to make this redundant.

Flags: needinfo?(eziegenhorn)

(dormant account)

Comment 6

•

10 years ago

Attached image cpu vs disk io(too much cpu, not enough disk speed) — Details

SSDs would be a good bet, the node can't keep up with writes. No wonder asking it to do reads fails

(dormant account)

Comment 7

•

10 years ago

Attached image Extremely poor read/write latencies — Details

IO write/read times.

Sorry for jumping in. I really appreciate IT setting up this service, but poor perf makes it really frustrating & time-consuming to get any data out of this.

(dormant account)

Updated

•

10 years ago

Attachment #8372524 - Attachment description: render.png → cpu vs disk io(too much cpu, not enough disk speed)

(dormant account)

Updated

•

10 years ago

Attachment #8372526 - Attachment description: download.png → Extremely poor read/write latencies

Shyam Mani [:fox2mike]

Comment 8

•

10 years ago

(In reply to Taras Glek (:taras) from comment #7)

> Sorry for jumping in. I really appreciate IT setting up this service, but
> poor perf makes it really frustrating & time-consuming to get any data out
> of this.

Taras, we're working on scaling this...unfortunately, the easy™ solution here (SSDs) is going to cost $91k per machine ;) So we're working on scalable solution...please bear with us while we figure out the best way to fix this.

(dormant account)

Comment 9

•

10 years ago

(In reply to Shyam Mani [:fox2mike] from comment #8) 
> Taras, we're working on scaling this...unfortunately, the easy™ solution
> here (SSDs) is going to cost $91k per machine ;) So we're working on
> scalable solution...please bear with us while we figure out the best way to
> fix this.

I see a couple of options:
1. Blindly throwing SSDs at the problem: 32tb of 1tb ssds costs ~$16-20K. 
2. 32tb of hds ~= 1.2K.
3. Keeping 1 month of data on ssds, rest on harddrives ~ $2K.
4. Keeping long-term data in S3, current days of incoming data on ssd on ec2 <$36K/year
5. setup  machine with 1tb of ssds & warehouse rest of data in a ceph storage cluster(mirror aws model above) ~10K

Options 4 and 5 would require a couple of weeks of development work, but would give us effectively infinite scalability by separating storage from analysis.

None of these options are close to 91k per machine.

(dormant account)

Comment 10

•

10 years ago

Just for kicks, I looked at using AWS CloudWatch hosted service for our needs,
we seem to be gathering 70metrics per machine(overkill?) which works out at 
70metrics*$0.50*500instances=$17500/month.

Shyam Mani [:fox2mike]

Comment 11

•

10 years ago

(In reply to Taras Glek (:taras) from comment #9)

> None of these options are close to 91k per machine.

Because these were priced with "enterprise" grade SSDs from HP.

(dormant account)

Comment 12

•

10 years ago

(In reply to Shyam Mani [:fox2mike] from comment #11)
> (In reply to Taras Glek (:taras) from comment #9)
> 
> > None of these options are close to 91k per machine.
> 
> Because these were priced with "enterprise" grade SSDs from HP.

I think we disagree on class of hw best suited for data warehousing. That's fine.

Given that graphite prefers happy to have multiple carbon hosts behind a single frontend: Why not keep releng data in a carbon instance in ec2? This avoids purchasing hardware, it'll reduce amount of data transfered between EC2 and our datacenter, etc. Besides, sounds like we aren't saving money over AWS by purchasing hw. Spinning up an AWS instance would give us increasing processing capacity and give us time to decide on a proper solution.

Peter Radcliffe [:pir]

Comment 13

•

10 years ago

Someone I know who runs large ganglia boxes says that the I/O rates burn out SSDs in months. Doesn't sound like a good plan on enterprise or other grade SSDs.

Adrian J Fernandez [:Aj]

Comment 14

•

10 years ago

Not a graphite expert or nor do I have idea of amount of data required to retain when viewing stats but perhaps cheaper and performance wise better to simply load the recently accessed data into ram (cache it) and keep everything else on spinning disks, with perhaps some SSD buffer for middle data (think of L2arc in ZFS).

      
     / \
    /   \
   / ram \
  / SSDs  \
 /  disks  \
/___________\

(dormant account)

Comment 15

•

10 years ago

(In reply to Peter Radcliffe [:pir] from comment #13)
> Someone I know who runs large ganglia boxes says that the I/O rates burn out
> SSDs in months. Doesn't sound like a good plan on enterprise or other grade
> SSDs.

Spoke to a few people about graphite deployments. Everyone swears by SSDs. A well-known portland startup collects 600,000 metrics, 30s resolution with 1 year of retention. They use a single big server with SSDs. They said it's been working fine since they switched it to SSDs, they had nothing but IO trouble with prior harddrive deployment.
No spof there, because metrics are also not core to their business.

Sylvie Veilleux [:sylvieV]

Comment 16

•

10 years ago

Albert, Shyam, Eric, Corey  -where are we with this test of the hosted graphite and improving performance on the existing services?

Sylvie Veilleux [:sylvieV]

Updated

•

10 years ago

Flags: needinfo?(cshields)

Flags: needinfo?(avillarde)

Shyam Mani [:fox2mike]

Comment 17

•

10 years ago

(In reply to SylvieV from comment #16)
> Albert, Shyam, Eric, Corey  -where are we with this test of the hosted
> graphite and improving performance on the existing services?

Sylvie,

I'll own this moving forward (too many things on Corey's plate). I'll sync up with Eric/Amy/Taras and give you an update.

Flags: needinfo?(eziegenhorn)

Flags: needinfo?(cshields)

Flags: needinfo?(avillarde)

Sylvie Veilleux [:sylvieV]

Comment 18

•

10 years ago

(In reply to Shyam Mani [:fox2mike] from comment #17)
> (In reply to SylvieV from comment #16)
> > Albert, Shyam, Eric, Corey  -where are we with this test of the hosted
> > graphite and improving performance on the existing services?
> 
> Sylvie,
> 
> I'll own this moving forward (too many things on Corey's plate). I'll sync
> up with Eric/Amy/Taras and give you an update.

Thanks - please post the plan of action and timelines in this bug.

Shyam Mani [:fox2mike]

Comment 19

•

10 years ago

* Current issues

1) Too slow for the giant releng queries.
2) SSDs with the current hardware is expensive

SSDs are the easier option, but at the amount we need, it's not quite
cost effective. We can try cheaper disks, but we're not sure about how
reliable that might be.

* Steps taken so far

Graphite stores data in fixed-size "Whisper" files. We were storing data
every minute 1.5 years, which is now becoming minutely data for 40 days
and 5 minute samples for 1.5 years.

1) We have already started reducing retention size
2) We have about 704,000 “whisper” files
3) We are reducing the size of these files from 9MB to 2MB
4) This will help reduce disk space by almost 5TB
5) We’re about 12% done, this has been running for about 2 days
6) Expected time to finish = about 16 days

Eric is already looking to see if we can parallelize this, to make it go
faster

* Sending data to a hosted solution

collectd is the daemon that runs on the client and
decides where to send the data to.

1) This could be a client side change
2) relops can make this change when they want to and send metrics to
hostedgraphite.com, it will only need firewall flows. This doesn't need
us to make any changes server side.
3) We make a change to carbon-relay on the server side to send the data to our store and 


* Possible future steps

This depends on how much we're willing to spend etc and isn't fully
fleshed out, yet.

1) Come up with a storage solution that has better i/o?
2) Replace whisper? (https://github.com/graphite-project/ceres)
3) Federated storage (add more nodes, split the data)

Shyam Mani [:fox2mike]

Comment 20

•

10 years ago

Had a chat with Amy, Taras and Jake on IRC.

Jake is going to see how much puppet work it is to send data to hostedgraphite.com (tracked in Bug 975227)

I will check on how much our AWS bills might be affected by doing the above. 

Once we have these two, we might start a 30 day trial of hostedgraphite and see where that goes.

Shyam Mani [:fox2mike]

Comment 21

•

10 years ago

(In reply to Shyam Mani [:fox2mike] from comment #20)

> I will check on how much our AWS bills might be affected by doing the above. 

Has been scrapped. Once Jake has an estimate, we'll be good to move on this.

Albert [:albert]

Comment 22

•

10 years ago

Shyam, just for due diligence have we looked at our internal NAS or otherwise options with available capacity?

This bug is now covering a few issues not just SPOF.  Can you and Eric prioritize which is most concerning and create blocker bugs?  By the sound of it, disk capacity and/or I/O seems to be most pressing. 

Is it possible to trim the hosts running collectd to those that we know must run it (ie our critical systems)?  Or can we spin up another graphite for less critical hosts on lesser HW?

Shyam Mani [:fox2mike]

Comment 23

•

10 years ago

(In reply to Albert from comment #22)
> Shyam, just for due diligence have we looked at our internal NAS or
> otherwise options with available capacity?

Yes, we'll do that. Spinning off some stuff to hostedgraphite gives us the time to evaluate options and build out the best possible solution while not hindering data collection.

> This bug is now covering a few issues not just SPOF.  Can you and Eric
> prioritize which is most concerning and create blocker bugs?  By the sound
> of it, disk capacity and/or I/O seems to be most pressing. 

Sure. 

> Is it possible to trim the hosts running collectd to those that we know must
> run it (ie our critical systems)?  Or can we spin up another graphite for
> less critical hosts on lesser HW?

We were looking into scaling graphite by having more hosts in the mix.

(dormant account)

Comment 24

•

10 years ago

Looks like someone got a similar instance of graphite onto harddrives. https://answers.launchpad.net/graphite/+question/178969

Sylvie Veilleux [:sylvieV]

Comment 25

•

10 years ago

Shyam, what is the status on graphite on-prem improvements in performance as the hosted testing?

Sylvie Veilleux [:sylvieV]

Comment 26

•

10 years ago

Albert,Shyam - this bug has not made progress. Taras deployed a new hostedgraphite without collectd. https://bugzilla.mozilla.org/show_bug.cgi?id=968381

What is learned from this and how do we apply this to our Graphite needs in IT?

Sylvie Veilleux [:sylvieV]

Updated

•

10 years ago

Flags: needinfo?(shyam)

Shyam Mani [:fox2mike]

Comment 27

•

10 years ago

Now that Q1 is done, I'm hoping to spend more time with Eric, Albert and David and review where we are and what improvements can be made. I'll report back here by next week.

Flags: needinfo?(shyam)

Eric Ziegenhorn :ericz

Assignee

Updated

•

10 years ago

Severity: normal → enhancement

Eric Ziegenhorn :ericz

Assignee

Comment 28

•

10 years ago

We have redundant storage but not redundant hosts, due to the somewhat hefty hardware required for our volume of metrics.  There doesn't seem to be much interest in purchasing another host in scl3 and phx1 for host redundancy so I'm going to close this for now.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

cpu vs disk io(too much cpu, not enough disk speed) 10 years ago (dormant account) 100.96 KB, image/png		Details
Extremely poor read/write latencies 10 years ago (dormant account) 104.06 KB, image/png		Details