Closed Bug 968381 Opened 11 years ago Closed 10 years ago

Setup Diamond to send metrics to hosted graphite

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 8
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: taras.mozilla, Unassigned)

References

Details

Attachments

(7 files, 3 obsolete files)

10.80 KB, patch
dustin
: review+
jhopkins
: checked-in+
Details | Diff | Splinter Review
1.02 KB, text/plain
Details
957 bytes, patch
dustin
: review+
jhopkins
: checked-in+
Details | Diff | Splinter Review
2.39 KB, patch
dustin
: review+
jhopkins
: checked-in+
Details | Diff | Splinter Review
663 bytes, patch
dustin
: review+
jhopkins
: checked-in+
Details | Diff | Splinter Review
508 bytes, patch
rail
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
1.51 KB, patch
rail
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
things to improve: * cpu.steal is being reported double of what it seems to be. Idle, and some other measures were also doubled 2 months ago. Need to verify that each cpu measure adds up to ~100... See scale(..., 0.5) in https://github.com/tarasglek/graphite_charts/ for more details * disk io reported in bytes and broken down into ebs vs non-ebs(more detail here is probably not needed) * report 'events' like build start, stop, hg checkout, etc * report cost for spot workers * add amazon instance type to the graphite machine name(can infer cost, etc from this) * once we start pooling, should mention pool name in graphite id too, etc * should probably separate dns and graphite names(we'll have to do that for docker anyway).
Here is what our 'steal times' look like over the last 90 days (as of Feb. 10)*: http://imagebin.org/292613 As you can see in the graph above, the scale of the steal times data changed significantly on February 2nd. Philor pointed out that that was the day our ubuntu64_vm instances changed from m1.medium -> m3.medium. Taras then noticed that switching back to m1.medium changed the scale again. * original URL: https://graphite.mozilla.org/render?from=-90days&until=now&width=1800&height=900&target=avg%28hosts.tst-linux64-spot-0*_*_releng_*_mozilla_com.cpu.*.cpu.steal%29
Depends on: 971799
Lets morph this bug, collectd is too crappy. Enable: * NetworkCollector * CPUCollector with percore = False * DiskSpaceCollector We should send the stats to hosted graphite with a 30s interval. I'd like to get this done this week because collectd is not cutting it.
Assignee: nobody → jhopkins
Summary: send better info to graphite → Setup Diamond to send metrics to hosted graphite
Out of curiosity, what's wrong with Collectd?
(In reply to Gregory Szorc [:gps] from comment #3) > Out of curiosity, what's wrong with Collectd? https://collectd.org/wiki/index.php/Plugin:Aggregation/Config results in numbers that don't add up to 100%. See my complaints in https://bugzilla.mozilla.org/show_bug.cgi?id=971883
Boo :( You can always patch collectd. I've contributed - it's not too difficult. A few years ago, I used a Python tool (not Diamond) for collecting system metrics. Over time, we kept collecting more and more and Python was having a hard time keeping up and it was adding a lot of probe overhead / unwanted CPU usage. We switched to Collectd because it was more efficient and didn't interfere as much with other processes on the machine. YMMV.
Diamond RPM built and uploaded here: http://people.mozilla.org/~jhopkins/bug968381/ Though I did notice a Python 2.6 dependency in the package requirements: $ rpm -qp --requires diamond-3.4.266-0.noarch.rpm | grep python /usr/bin/python python(abi) = 2.6 python-configobj
(In reply to Gregory Szorc [:gps] from comment #5) > A few years ago, I used a Python tool (not Diamond) for collecting system > metrics. Over time, we kept collecting more and more and Python was having a > hard time keeping up and it was adding a lot of probe overhead / unwanted > CPU usage. We switched to Collectd because it was more efficient and didn't > interfere as much with other processes on the machine. YMMV. Yup. ATM we effectively have no actionable data, diamond seems like the fastest way to get there. Python is stupid for this, but it works. Optimizing collectd is an option we can consider once our problem is "collect metrics better", rather than "broken metrics"
Attached patch [puppet] wip (obsolete) — Splinter Review
Comment on attachment 8383212 [details] [diff] [review] [puppet] install and configure diamond package on CentOS dev-* machines to start with Review of attachment 8383212 [details] [diff] [review]: ----------------------------------------------------------------- Looks good with minor changes. You might want to build in a knob to disable this, so if something goes wrong you can just include diamond::disable or class {'diamond': enable => false; } ::: manifests/moco-nodes.pp @@ +138,4 @@ > "network::aws": stage => network, > } > include toplevel::slave::build::mock > + include diamond We shouldn't be including anything but toplevel classes in node definitions. If this is temporary to avoid burning down the world, then it's OK. ::: modules/diamond/manifests/config.pp @@ +8,5 @@ > + > + # hosted graphite settings > + $diamond_graphite_host = "carbon.hostedgraphite.com" > + $diamond_graphite_port = "2003" > + $diamond_graphite_path_prefix = "add664b4-7eb0-4586-93cb-11f1929e95a6" # trial api key These should be configuration options, and the API key a secret. ::: modules/diamond/manifests/service.pp @@ +2,5 @@ > +# License, v. 2.0. If a copy of the MPL was not distributed with this > +# file, You can obtain one at http://mozilla.org/MPL/2.0/. > +class diamond::service { > + case $::operatingsystem { > + CentOS : { This case, and the one in diamond::config should have a default that fail()s @@ +4,5 @@ > +class diamond::service { > + case $::operatingsystem { > + CentOS : { > + package { 'diamond': > + ensure => present, Fix formatting here
Attachment #8383212 - Flags: review?(dustin) → review+
Comment on attachment 8383757 [details] [diff] [review] [puppet] install and configure diamond package on CentOS dev-* machines to start with (updated) Review of attachment 8383757 [details] [diff] [review]: ----------------------------------------------------------------- Good with the addition to base.pp. ::: manifests/moco-config.pp @@ +229,5 @@ > + $diamond_graphite_host = "carbon.hostedgraphite.com" > + $diamond_graphite_port = "2003" > + $diamond_graphite_path_prefix = secret('diamond_api_key') > + $diamond_batch_size = 1 > + $diamond_poll_interval = 30 Looks good, but these vars also need to be in `modules/config/manifests/base.pp` with default values. ::: manifests/moco-nodes.pp @@ +138,4 @@ > "network::aws": stage => network, > } > include toplevel::slave::build::mock > + include diamond This is fine for now, since it only applies to Mozilla nodes, but as you broaden this out and move it to toplevel classes, you'll need to add a conditional on $config::use_diamond or something like that to control whether diamond gets installed -- otherwise QA, seamonkey, and relabs will all crash when you land the changes. ::: modules/diamond/manifests/disable.pp @@ +9,5 @@ > + ensure => stopped, > + enable => false, > + hasstatus => true, > + hasrestart => true, > + } trailing whitespace
Attachment #8383757 - Flags: review?(dustin) → review+
Comment on attachment 8383757 [details] [diff] [review] [puppet] install and configure diamond package on CentOS dev-* machines to start with (updated) http://hg.mozilla.org/build/puppet/rev/3674126ebf0e
Attachment #8383757 - Flags: checked-in+
merged to production
Comment on attachment 8383848 [details] [diff] [review] [puppet] enable diamond on amazon build instances More very-temporary-ness, I assume.
Attachment #8383848 - Flags: review?(dustin) → review+
> More very-temporary-ness, I assume. Yes, I have your review comments in mind regarding these imports.
Comment on attachment 8383848 [details] [diff] [review] [puppet] enable diamond on amazon build instances Landed on default and merged to production. http://hg.mozilla.org/build/puppet/rev/1a4f1788fdbc https://hg.mozilla.org/build/puppet/rev/bd853e969571
Attachment #8383848 - Flags: checked-in+
Note: will require a follow-up patch to address this failure: Fri Feb 28 12:01:40 -0800 2014 Puppet (err): Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp Fri Feb 28 12:01:40 -0800 2014 Puppet (err): Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp Wrapped exception: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 Fri Feb 28 12:01:40 -0800 2014 /Stage[main]/Diamond::Config/File[/etc/diamond/diamond.conf]/ensure (err): change from absent to file failed: Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp I've imported diamond::disable on the bld-* and try-* instances for now - to disable diamond - in http://hg.mozilla.org/build/puppet/rev/6c1cdf968dc3
reproduced problem on a dev host and fixed with this patch
Attachment #8383896 - Flags: review?(dustin)
Comment on attachment 8383896 [details] [diff] [review] [puppet] ensure /etc/diamond/ directory is created I'm reasonably confident that the diamond package creates /etc/diamond, so the real issue is that puppet is not installing the package before trying to configure it. So, a few things to change: 1. Packages should only be installed by classes in the 'package' module, so please add a new clsas, packages::diamond, which installs diamond. Include this from the diamond class. The more common Puppet idiom would be to create diamond::install and have *that* include packages::diamond, but that seems redundant here) 2. Change the config file to depend on Class['packages::diamond'] That should fix the dependency problem.
Attachment #8383896 - Flags: review?(dustin) → review-
Comment on attachment 8384605 [details] [diff] [review] [puppet] diamond config ensures diamond package installation first Review of attachment 8384605 [details] [diff] [review]: ----------------------------------------------------------------- Looks good - it's up to you whether to pin or not; the disavantage of not pinning is that if you drop a new version in to the repos, everything will start updating without any corresponding puppet change. ::: modules/packages/manifests/diamond.pp @@ +5,5 @@ > + case $::operatingsystem { > + CentOS: { > + package { > + "diamond": > + ensure => latest; You may want to pin the version here, just to make it easier to do a controlled upgrade.
Attachment #8384605 - Flags: review?(dustin) → review+
Comment on attachment 8384605 [details] [diff] [review] [puppet] diamond config ensures diamond package installation first Landed in http://hg.mozilla.org/build/puppet/rev/d10738e11c2f and merged to production. Leaving 'diamond' package unpinned for now. We can always pin it prior to a future upgrade if that works out to be the better option.
Attachment #8384605 - Flags: checked-in+
John, I'm sorry to ask for a change in spec late in the game. Can we set normalize=True in CPUCollector?
Attachment #8385100 - Flags: review?(dustin) → review+
Comment on attachment 8385100 [details] [diff] [review] [puppet] set normalize=True for CPUCollector Landed in https://hg.mozilla.org/build/puppet/rev/9d5dcaa882d3 and merged to production.
Attachment #8385100 - Flags: checked-in+
Note: if we need tst-* machine metrics, we will have to create and deploy a Ubuntu 'diamond' package to those systems.
(In reply to John Hopkins (:jhopkins) from comment #29) > Note: if we need tst-* machine metrics, we will have to create and deploy a > Ubuntu 'diamond' package to those systems. We do, but our tests are much lower priority than builds atm. No rush on this part.
Blocks: 984660
Assignee: jhopkins → nobody
We need to change our graphite hostname to mozilla.carbon.hostedgraphite.com due to connectivity issues in us-west-2 Can someone take care of that?
Do we change it across the board, or just for hosts in us-west-2?
(In reply to Chris AtLee [:catlee] from comment #32) > Do we change it across the board, or just for hosts in us-west-2? everywhere
Attachment #8418149 - Flags: review?(rail)
Attachment #8418151 - Flags: review?(rail)
Attachment #8418149 - Flags: review?(rail) → review+
Attachment #8418151 - Flags: review?(rail) → review+
Comment on attachment 8418151 [details] [diff] [review] update AWS routes and deployed
Attachment #8418151 - Flags: checked-in+
Comment on attachment 8418149 [details] [diff] [review] switch to new host and merged to production
Attachment #8418149 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: