Closed
Bug 971883
Opened 10 years ago
Closed 10 years ago
Reduce number of datapoints sent to graphite
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: taras.mozilla, Assigned: dividehex)
Details
Attachments
(3 files, 1 obsolete file)
3.14 KB,
patch
|
dustin
:
review+
dividehex
:
checked-in+
|
Details | Diff | Splinter Review |
80.56 KB,
image/png
|
Details | |
32.22 KB,
image/png
|
Details |
Some of this may be due to use using collectd instead of Diamond(which seems to be the shiny new thing) for EC2: * Our cpu metrics are per core. This is not helpful..it costs a lot in diskspace and query times and adds no new info. Diamond offers per-core, per-socket tweaks eg https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is adding a lot of pointless metrics on multisocket machines. * I have no use for the vmem metric. There might be something useful in there, but I don't see what. * It might make sense to report ec2 machines of a certain class as one machine(eg average them out at statsd level)...graphite architecture does not handle hosts coming and going at all, esp whisper files
Updated•10 years ago
|
Assignee: server-ops → relops
Component: Server Operations → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → arich
Comment 1•10 years ago
|
||
Jake, taras has requested that reducing the number of metrics collected, especially in ec2, take precedence over the bare metal stuff since it's greatly impacting their ability to get a good handle on metrics. Can you please look at adjusting what's collected so that it's less and more meaningful? We don't necessarily have to rip out collectd to accomplish this.
Assignee: relops → jwatkins
Assignee | ||
Comment 2•10 years ago
|
||
My initial thoughts... (In reply to Taras Glek (:taras) from comment #0) > Some of this may be due to use using collectd instead of Diamond(which seems > to be the shiny new thing) I like shiny new things too but only if collectd absolutely cannot do what is needed (which seems unlikely). I also suspect vetting a new collection agent would take more time and resources to bring into production to replace collectd than just fixing collectd by either configuration and/or code. I'm always open minded though. > for EC2: > * Our cpu metrics are per core. This is not helpful..it costs a lot in > diskspace and query times and adds no new info. Diamond offers per-core, > per-socket tweaks eg > https://github.com/BrightcoveOS/Diamond/wiki/collectors-CPUCollector This is > adding a lot of pointless metrics on multisocket machines. Any metric can be aggregated at the collection client level (collectd) or at delivery (cabron) We could use the collectd Aggregation plugin to sum or avg them and then filter the per-core datapoints out. https://collectd.org/wiki/index.php/Plugin:Aggregation https://collectd.org/wiki/index.php/Plugin:Aggregation/Config > * I have no use for the vmem metric. There might be something useful in > there, but I don't see what. Let's rip it out if it isn't useful. We can always enable it again at a later point if need be. > * It might make sense to report ec2 machines of a certain class as one > machine(eg average them out at statsd level)...graphite architecture does > not handle hosts coming and going at all, esp whisper files Yes, this is a problem. This might be something for the carbon-aggregator service, which we don't currently run but could. It would match pattern rules of incoming datapoints and aggregate them down. We could use facter to inject ec2 instance class type (read from meta-data) into the datapoint key so that all datapoints of that type get aggregated together when they hit carbon. I'll need to check with :ericz but it might also make sense to host a carbon-aggregator and cabron relay service in aws therefore only the final aggregated datapoints are sent back to SCL3.
Assignee | ||
Comment 3•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #2) > I'll need to check with :ericz but it might also make sense to > host a carbon-aggregator and cabron relay service in aws therefore only the > final aggregated datapoints are sent back to SCL3. :ericz, does this make sense? how much work would it be to get carbon-aggregator running? (either in scl3 or aws)
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(eziegenhorn)
Reporter | ||
Comment 4•10 years ago
|
||
> * It might make sense to report ec2 machines of a certain class as one
> machine(eg average them out at statsd level)...graphite architecture does
> not handle hosts coming and going at all, esp whisper files
Note this is a problem we should solve, but in a followup bug. The rest of the problems are mechanical tasks that should be solved in this bug. I don't mind if cpu/etc aggregation is done with collectd. I have no interest in Diamond if collectd is easy enough to convince to do our bidding.
Setting up an aggregation point in aws makes sense, but should probably also happen in a followup bug. We should set up one per aws region.
Updated•10 years ago
|
Flags: needinfo?(eziegenhorn)
Assignee | ||
Comment 5•10 years ago
|
||
This patch adds the aggregation plugin and configures it aggregate separate cpu datapoints into averaged and summed metrics. In addition it filters out the individual cpu datapoints from being sent to graphite. VMEM plugin has also been removed. An example can been seen on the graphite server under /test/relabs/hp5_relabs_releng_scl3_mozilla_com
Attachment #8375921 -
Flags: review?(dustin)
Updated•10 years ago
|
Attachment #8375921 -
Flags: review?(dustin) → review+
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #5) > An example can been seen on the graphite server under > /test/relabs/hp5_relabs_releng_scl3_mozilla_com This looks ok. The average metric is a nice touch, might be useful.
Assignee | ||
Comment 7•10 years ago
|
||
Comment on attachment 8375921 [details] [diff] [review] bug971883-collectd-cpu-aggr.patch http://hg.mozilla.org/build/puppet/rev/98d89d0fc773
Attachment #8375921 -
Flags: checked-in+
Reporter | ||
Comment 8•10 years ago
|
||
Looks like the aggregation is broken. It shows up on the test host, but it's especially bad on aws: https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com.aggregation.cpu-sum.cpu.*)) Note how it manages to undercount on one interval and overcount on another. This is even worse on AWS. on hosted graphite http://goo.gl/OjUPpx on our graphite http://goo.gl/kEI2MY
Reporter | ||
Comment 9•10 years ago
|
||
snapshot of the buggy summation
Reporter | ||
Comment 10•10 years ago
|
||
Attachment #8381028 -
Attachment is obsolete: true
Reporter | ||
Comment 11•10 years ago
|
||
Averages are buggy too.
Reporter | ||
Comment 12•10 years ago
|
||
https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=-2hours&target=stacked%28sum%28hosts.tst-linux64-spot-009_test_releng_use1_mozilla_com.aggregation-cpu-average.cpu-*%29%29 is the url for above link
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #8) > Looks like the aggregation is broken. It shows up on the test host, but it's > especially bad on aws: > https://graphite.mozilla.org/render?width=1000&height=615&until=now&from=- > 1hours&target=stacked(sum(test.relabs.hp5_relabs_releng_scl3_mozilla_com. > aggregation.cpu-sum.cpu.*)) > > Note how it manages to undercount on one interval and overcount on another. > > This is even worse on AWS. > on hosted graphite http://goo.gl/OjUPpx > on our graphite http://goo.gl/kEI2MY I just came across this; definitely a known bug in collectd. https://github.com/collectd/collectd/issues/297
Assignee | ||
Comment 14•10 years ago
|
||
> I just came across this; definitely a known bug in collectd.
> https://github.com/collectd/collectd/issues/297
:taras, Since this is dependent on a bug fix in collectd, in the meantime would you prefer to disable aggregation and return to individual cpu metrics?
Reporter | ||
Comment 15•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #14) > > I just came across this; definitely a known bug in collectd. > > https://github.com/collectd/collectd/issues/297 > > :taras, Since this is dependent on a bug fix in collectd, in the meantime > would you prefer to disable aggregation and return to individual cpu metrics? I'd prefer something that works. Collecting this many redundant metrics is too expensive.
Reporter | ||
Comment 16•10 years ago
|
||
eg, can we deploy diamond?
Comment 17•10 years ago
|
||
Deploying a completely different client is significantly out of scope for IT here, but if you want to dedicate engineering resources to getting diamond or something similar compiled and packaged for all of the platforms we need to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7, and 8), we're happy to help with the final deployment.
Reporter | ||
Comment 18•10 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #17) > Deploying a completely different client is significantly out of scope for IT > here, but if you want to dedicate engineering resources to getting diamond > or something similar compiled and packaged for all of the platforms we need > to support (OS X 10.6 - 10.9, ubuntu, centos 6, and windows 2008r2, xp, 7, > and 8), we're happy to help with the final deployment. Ok. Please turn off sending metrics to hosted graphite in meantime.
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #18) > > Ok. Please turn off sending metrics to hosted graphite in meantime. Metrics sent to hostedgraphite from collectd have been halted https://bugzilla.mozilla.org/show_bug.cgi?id=975227#c11
Comment 20•10 years ago
|
||
diamond was deployed for aws, and we've reduced retention in the datacenters.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•