Closed Bug 971883 Opened 11 years ago Closed 11 years ago

Reduce number of datapoints sent to graphite

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: taras.mozilla, Assigned: dividehex)

Details

Attachments

(3 files, 1 obsolete file)

bug971883-collectd-cpu-aggr.patch 11 years ago Jake Watkins [:dividehex] 3.14 KB, patch	dustin : review+ dividehex : checked-in+	Details \| Diff \| Splinter Review
render (1).png 11 years ago (dormant account) 17.03 KB, image/png		Details
buggy summation snapshot 11 years ago (dormant account) 80.56 KB, image/png		Details
averages missing data too 11 years ago (dormant account) 32.22 KB, image/png		Details

(dormant account)

Reporter

Description

•

11 years ago

Some of this may be due to use using collectd instead of Diamond(which seems to be the shiny new thing) for EC2: * Our cpu metrics are per core. This is not helpful..it costs a lot in diskspace and query times and adds no new info. Diamond offers per-core, per-socket tweaks eg https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is adding a lot of pointless metrics on multisocket machines. * I have no use for the vmem metric. There might be something useful in there, but I don't see what. * It might make sense to report ec2 machines of a certain class as one machine(eg average them out at statsd level)...graphite architecture does not handle hosts coming and going at all, esp whisper files

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Assignee: server-ops → relops

Component: Server Operations → RelOps

Product: mozilla.org → Infrastructure & Operations

QA Contact: shyam → arich

Amy Rich [:arr] [:arich]

Comment 1

•

11 years ago

Jake, taras has requested that reducing the number of metrics collected, especially in ec2, take precedence over the bare metal stuff since it's greatly impacting their ability to get a good handle on metrics. Can you please look at adjusting what's collected so that it's less and more meaningful? We don't necessarily have to rip out collectd to accomplish this.

Assignee: relops → jwatkins

Jake Watkins [:dividehex]

Assignee

Comment 2

•

11 years ago

My initial thoughts... (In reply to Taras Glek (:taras) from comment #0) > Some of this may be due to use using collectd instead of Diamond(which seems > to be the shiny new thing) I like shiny new things too but only if collectd absolutely cannot do what is needed (which seems unlikely). I also suspect vetting a new collection agent would take more time and resources to bring into production to replace collectd than just fixing collectd by either configuration and/or code. I'm always open minded though. > for EC2: > * Our cpu metrics are per core. This is not helpful..it costs a lot in > diskspace and query times and adds no new info. Diamond offers per-core, > per-socket tweaks eg > https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is > adding a lot of pointless metrics on multisocket machines. Any metric can be aggregated at the collection client level (collectd) or at delivery (cabron) We could use the collectd Aggregation plugin to sum or avg them and then filter the per-core datapoints out. https://collectd.org/wiki/index.php/Plugin:Aggregation https://collectd.org/wiki/index.php/Plugin:Aggregation/Config > * I have no use for the vmem metric. There might be something useful in > there, but I don't see what. Let's rip it out if it isn't useful. We can always enable it again at a later point if need be. > * It might make sense to report ec2 machines of a certain class as one > machine(eg average them out at statsd level)...graphite architecture does > not handle hosts coming and going at all, esp whisper files Yes, this is a problem. This might be something for the carbon-aggregator service, which we don't currently run but could. It would match pattern rules of incoming datapoints and aggregate them down. We could use facter to inject ec2 instance class type (read from meta-data) into the datapoint key so that all datapoints of that type get aggregated together when they hit carbon. I'll need to check with :ericz but it might also make sense to host a carbon-aggregator and cabron relay service in aws therefore only the final aggregated datapoints are sent back to SCL3.

Jake Watkins [:dividehex]

Assignee

Comment 3

•

11 years ago

(In reply to Jake Watkins [:dividehex] from comment #2) > I'll need to check with :ericz but it might also make sense to > host a carbon-aggregator and cabron relay service in aws therefore only the > final aggregated datapoints are sent back to SCL3. :ericz, does this make sense? how much work would it be to get carbon-aggregator running? (either in scl3 or aws)

Jake Watkins [:dividehex]

Assignee

Updated

•

11 years ago

Flags: needinfo?(eziegenhorn)

(dormant account)

Reporter

Comment 4

•

11 years ago

> * It might make sense to report ec2 machines of a certain class as one > machine(eg average them out at statsd level)...graphite architecture does > not handle hosts coming and going at all, esp whisper files Note this is a problem we should solve, but in a followup bug. The rest of the problems are mechanical tasks that should be solved in this bug. I don't mind if cpu/etc aggregation is done with collectd. I have no interest in Diamond if collectd is easy enough to convince to do our bidding. Setting up an aggregation point in aws makes sense, but should probably also happen in a followup bug. We should set up one per aws region.

Eric Ziegenhorn :ericz

Updated

•

11 years ago

Flags: needinfo?(eziegenhorn)

Jake Watkins [:dividehex]

Assignee

Comment 5

•

11 years ago

Attached patch bug971883-collectd-cpu-aggr.patch — Details — Splinter Review

This patch adds the aggregation plugin and configures it aggregate separate cpu datapoints into averaged and summed metrics. In addition it filters out the individual cpu datapoints from being sent to graphite. VMEM plugin has also been removed. An example can been seen on the graphite server under /test/relabs/hp5_relabs_releng_scl3_mozilla_com

Attachment #8375921 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

11 years ago

Attachment #8375921 - Flags: review?(dustin) → review+

(dormant account)

Reporter

Comment 6

•

11 years ago

(In reply to Jake Watkins [:dividehex] from comment #5) > An example can been seen on the graphite server under > /test/relabs/hp5_relabs_releng_scl3_mozilla_com This looks ok. The average metric is a nice touch, might be useful.

Jake Watkins [:dividehex]

Assignee

Comment 7

•

11 years ago

Comment on attachment 8375921 [details] [diff] [review] bug971883-collectd-cpu-aggr.patch http://hg.mozilla.org/build/puppet/rev/98d89d0fc773

Attachment #8375921 - Flags: checked-in+

(dormant account)

Reporter

Comment 8

•

11 years ago

Looks like the aggregation is broken. It shows up on the test host, but it's especially bad on aws: https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com.aggregation.cpu-sum.cpu.*)) Note how it manages to undercount on one interval and overcount on another. This is even worse on AWS. on hosted graphite http://goo.gl/OjUPpx on our graphite http://goo.gl/kEI2MY

(dormant account)

Reporter

Comment 9

•

11 years ago

Attached image render (1).png (obsolete) — Details

snapshot of the buggy summation

(dormant account)

Reporter

Comment 10

•

11 years ago

Attached image buggy summation snapshot — Details

Attachment #8381028 - Attachment is obsolete: true

(dormant account)

Reporter

Comment 11

•

11 years ago

Attached image averages missing data too — Details

Averages are buggy too.

(dormant account)

Reporter

Comment 12

•

11 years ago

https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-2hours&target=stacked%28sum%28hosts.tst-linux64-spot-009_test_releng_use1_mozilla_com.aggregation-cpu-average.cpu-*%29%29 is the url for above link

Jake Watkins [:dividehex]

Assignee

Comment 13

•

11 years ago

(In reply to Taras Glek (:taras) from comment #8) > Looks like the aggregation is broken. It shows up on the test host, but it's > especially bad on aws: > https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=- > 1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com. > aggregation.cpu-sum.cpu.*)) > > Note how it manages to undercount on one interval and overcount on another. > > This is even worse on AWS. > on hosted graphite http://goo.gl/OjUPpx > on our graphite http://goo.gl/kEI2MY I just came across this; definitely a known bug in collectd. https://github.com/collectd/collectd/issues/297

Jake Watkins [:dividehex]

Assignee

Comment 14

•

11 years ago

> I just came across this; definitely a known bug in collectd. > https://github.com/collectd/collectd/issues/297 :taras, Since this is dependent on a bug fix in collectd, in the meantime would you prefer to disable aggregation and return to individual cpu metrics?

(dormant account)

Reporter

Comment 15

•

11 years ago

(In reply to Jake Watkins [:dividehex] from comment #14) > > I just came across this; definitely a known bug in collectd. > > https://github.com/collectd/collectd/issues/297 > > :taras, Since this is dependent on a bug fix in collectd, in the meantime > would you prefer to disable aggregation and return to individual cpu metrics? I'd prefer something that works. Collecting this many redundant metrics is too expensive.

(dormant account)

Reporter

Comment 16

•

11 years ago

eg, can we deploy diamond?

Amy Rich [:arr] [:arich]

Comment 17

•

11 years ago

Deploying a completely different client is significantly out of scope for IT here, but if you want to dedicate engineering resources to getting diamond or something similar compiled and packaged for all of the platforms we need to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7, and 8), we're happy to help with the final deployment.

(dormant account)

Reporter

Comment 18

•

11 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #17) > Deploying a completely different client is significantly out of scope for IT > here, but if you want to dedicate engineering resources to getting diamond > or something similar compiled and packaged for all of the platforms we need > to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7, > and 8), we're happy to help with the final deployment. Ok. Please turn off sending metrics to hosted graphite in meantime.

Jake Watkins [:dividehex]

Assignee

Comment 19

•

11 years ago

(In reply to Taras Glek (:taras) from comment #18) > > Ok. Please turn off sending metrics to hosted graphite in meantime. Metrics sent to hostedgraphite from collectd have been halted https://bugzilla.mozilla.org/show_bug.cgi?id=975227#c11

Amy Rich [:arr] [:arich]

Comment 20

•

11 years ago

diamond was deployed for aws, and we've reduced retention in the datacenters.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.