Closed Bug 971883 Opened 10 years ago Closed 10 years ago

Reduce number of datapoints sent to graphite

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: taras.mozilla, Assigned: dividehex)

Details

Attachments

(3 files, 1 obsolete file)

Some of this may be due to use using collectd instead of Diamond(which seems to be the shiny new thing)

for EC2:
* Our cpu metrics are per core. This is not helpful..it costs a lot in diskspace and query times and adds no new info. Diamond offers per-core, per-socket tweaks eg https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is adding a lot of pointless metrics on multisocket machines.
* I have no use for the vmem metric. There might be something useful in there, but I don't see what.
* It might make sense to report ec2 machines of a certain class as one machine(eg average them out at statsd level)...graphite architecture does not handle hosts coming and going at all, esp whisper files
Assignee: server-ops → relops
Component: Server Operations → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → arich
Jake, taras has requested that reducing the number of metrics collected, especially in ec2, take precedence over the bare metal stuff since it's greatly impacting their ability to get a good handle on metrics.  Can you please look at adjusting what's collected so that it's less and more meaningful?  We don't necessarily have to rip out collectd to accomplish this.
Assignee: relops → jwatkins
My initial thoughts...


(In reply to Taras Glek (:taras) from comment #0)
> Some of this may be due to use using collectd instead of Diamond(which seems
> to be the shiny new thing)

I like shiny new things too but only if collectd absolutely cannot do what is needed (which seems unlikely).  I also suspect vetting a new collection agent would take more time and resources to bring into production to replace collectd than just fixing collectd by either configuration and/or code.  I'm always open minded though.

> for EC2:
> * Our cpu metrics are per core. This is not helpful..it costs a lot in
> diskspace and query times and adds no new info. Diamond offers per-core,
> per-socket tweaks eg
> https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is
> adding a lot of pointless metrics on multisocket machines.

Any metric can be aggregated at the collection client level (collectd) or at delivery (cabron) 
We could use the collectd Aggregation plugin to sum or avg them and then filter the per-core datapoints out.
https://collectd.org/wiki/index.php/Plugin:Aggregation
https://collectd.org/wiki/index.php/Plugin:Aggregation/Config


> * I have no use for the vmem metric. There might be something useful in
> there, but I don't see what.

Let's rip it out if it isn't useful.  We can always enable it again at a later point if need be.

> * It might make sense to report ec2 machines of a certain class as one
> machine(eg average them out at statsd level)...graphite architecture does
> not handle hosts coming and going at all, esp whisper files

Yes, this is a problem.  This might be something for the carbon-aggregator service, which we don't currently run but could.  It would match pattern rules of incoming datapoints and aggregate them down.  We could use facter to inject ec2 instance class type (read from meta-data) into the datapoint key so that all datapoints of that type get aggregated together when they hit carbon.  I'll need to check with :ericz but it might also make sense to host a carbon-aggregator and cabron relay service in aws therefore only the final aggregated datapoints are sent back to SCL3.
(In reply to Jake Watkins [:dividehex] from comment #2)
> I'll need to check with :ericz but it might also make sense to
> host a carbon-aggregator and cabron relay service in aws therefore only the
> final aggregated datapoints are sent back to SCL3.

:ericz, does this make sense?  how much work would it be to get carbon-aggregator running? (either in scl3 or aws)
Flags: needinfo?(eziegenhorn)
> * It might make sense to report ec2 machines of a certain class as one
> machine(eg average them out at statsd level)...graphite architecture does
> not handle hosts coming and going at all, esp whisper files

Note this is a problem we should solve, but in a followup bug. The rest of the problems are mechanical tasks that should be solved in this bug. I don't mind if cpu/etc aggregation is done with collectd. I have no interest in Diamond if collectd is easy enough to convince to do our bidding.

Setting up an aggregation point in aws makes sense, but should probably also happen in a followup bug. We should set up one per aws region.
Flags: needinfo?(eziegenhorn)
This patch adds the aggregation plugin and configures it aggregate separate cpu datapoints into averaged and summed metrics.  In addition it filters out the individual cpu datapoints from being sent to graphite.  VMEM plugin has also been removed.

An example can been seen on the graphite server under /test/relabs/hp5_relabs_releng_scl3_mozilla_com
Attachment #8375921 - Flags: review?(dustin)
Attachment #8375921 - Flags: review?(dustin) → review+
(In reply to Jake Watkins [:dividehex] from comment #5)
> An example can been seen on the graphite server under
> /test/relabs/hp5_relabs_releng_scl3_mozilla_com

This looks ok. The average metric is a nice touch, might be useful.
Looks like the aggregation is broken. It shows up on the test host, but it's especially bad on aws:
https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com.aggregation.cpu-sum.cpu.*))

Note how it manages to undercount on one interval and overcount on another.

This is even worse on AWS.
on hosted graphite http://goo.gl/OjUPpx
on our graphite http://goo.gl/kEI2MY
Attached image render (1).png (obsolete) —
snapshot of the buggy summation
Attachment #8381028 - Attachment is obsolete: true
Averages are buggy too.
(In reply to Taras Glek (:taras) from comment #8)
> Looks like the aggregation is broken. It shows up on the test host, but it's
> especially bad on aws:
> https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-
> 1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com.
> aggregation.cpu-sum.cpu.*))
> 
> Note how it manages to undercount on one interval and overcount on another.
> 
> This is even worse on AWS.
> on hosted graphite http://goo.gl/OjUPpx
> on our graphite http://goo.gl/kEI2MY

I just came across this; definitely a known bug in collectd.
https://github.com/collectd/collectd/issues/297
> I just came across this; definitely a known bug in collectd.
> https://github.com/collectd/collectd/issues/297

:taras, Since this is dependent on a bug fix in collectd, in the meantime would you prefer to disable aggregation and return to individual cpu metrics?
(In reply to Jake Watkins [:dividehex] from comment #14)
> > I just came across this; definitely a known bug in collectd.
> > https://github.com/collectd/collectd/issues/297
> 
> :taras, Since this is dependent on a bug fix in collectd, in the meantime
> would you prefer to disable aggregation and return to individual cpu metrics?

I'd prefer something that works. Collecting this many redundant metrics is too expensive.
eg, can we deploy diamond?
Deploying a completely different client is significantly out of scope for IT here, but if you want to dedicate engineering resources to getting diamond or something similar compiled and packaged for all of the platforms we need to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7, and 8), we're happy to help with the final deployment.
(In reply to Amy Rich [:arich] [:arr] from comment #17)
> Deploying a completely different client is significantly out of scope for IT
> here, but if you want to dedicate engineering resources to getting diamond
> or something similar compiled and packaged for all of the platforms we need
> to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7,
> and 8), we're happy to help with the final deployment.

Ok. Please turn off sending metrics to hosted graphite in meantime.
(In reply to Taras Glek (:taras) from comment #18)
> 
> Ok. Please turn off sending metrics to hosted graphite in meantime.

Metrics sent to hostedgraphite from collectd have been halted
https://bugzilla.mozilla.org/show_bug.cgi?id=975227#c11
diamond was deployed for aws, and we've reduced retention in the datacenters.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: