Closed
Bug 968381
Opened 11 years ago
Closed 10 years ago
Setup Diamond to send metrics to hosted graphite
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: taras.mozilla, Unassigned)
References
Details
Attachments
(7 files, 3 obsolete files)
|
10.80 KB,
patch
|
dustin
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.02 KB,
text/plain
|
Details | |
|
957 bytes,
patch
|
dustin
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
|
2.39 KB,
patch
|
dustin
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
|
663 bytes,
patch
|
dustin
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
|
508 bytes,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.51 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
things to improve:
* cpu.steal is being reported double of what it seems to be. Idle, and some other measures were also doubled 2 months ago. Need to verify that each cpu measure adds up to ~100... See scale(..., 0.5) in https://github.com/tarasglek/graphite_charts/ for more details
* disk io reported in bytes and broken down into ebs vs non-ebs(more detail here is probably not needed)
* report 'events' like build start, stop, hg checkout, etc
* report cost for spot workers
* add amazon instance type to the graphite machine name(can infer cost, etc from this)
* once we start pooling, should mention pool name in graphite id too, etc
* should probably separate dns and graphite names(we'll have to do that for docker anyway).
Comment 1•11 years ago
|
||
Here is what our 'steal times' look like over the last 90 days (as of Feb. 10)*:
http://imagebin.org/292613
As you can see in the graph above, the scale of the steal times data changed significantly on February 2nd. Philor pointed out that that was the day our ubuntu64_vm instances changed from m1.medium -> m3.medium. Taras then noticed that switching back to m1.medium changed the scale again.
* original URL: https://graphite.mozilla.org/render?from=-90days&until=now&width=1800&height=900&target=avg%28hosts.tst-linux64-spot-0*_*_releng_*_mozilla_com.cpu.*.cpu.steal%29
| Reporter | ||
Comment 2•11 years ago
|
||
Lets morph this bug, collectd is too crappy.
Enable:
* NetworkCollector
* CPUCollector with percore = False
* DiskSpaceCollector
We should send the stats to hosted graphite with a 30s interval.
I'd like to get this done this week because collectd is not cutting it.
Assignee: nobody → jhopkins
Summary: send better info to graphite → Setup Diamond to send metrics to hosted graphite
Comment 3•11 years ago
|
||
Out of curiosity, what's wrong with Collectd?
| Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #3)
> Out of curiosity, what's wrong with Collectd?
https://collectd.org/wiki/index.php/Plugin:Aggregation/Config results in numbers that don't add up to 100%. See my complaints in https://bugzilla.mozilla.org/show_bug.cgi?id=971883
Comment 5•11 years ago
|
||
Boo :(
You can always patch collectd. I've contributed - it's not too difficult.
A few years ago, I used a Python tool (not Diamond) for collecting system metrics. Over time, we kept collecting more and more and Python was having a hard time keeping up and it was adding a lot of probe overhead / unwanted CPU usage. We switched to Collectd because it was more efficient and didn't interfere as much with other processes on the machine. YMMV.
Comment 6•11 years ago
|
||
Diamond RPM built and uploaded here:
http://people.mozilla.org/~jhopkins/bug968381/
Though I did notice a Python 2.6 dependency in the package requirements:
$ rpm -qp --requires diamond-3.4.266-0.noarch.rpm | grep python
/usr/bin/python
python(abi) = 2.6
python-configobj
| Reporter | ||
Comment 7•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #5)
> A few years ago, I used a Python tool (not Diamond) for collecting system
> metrics. Over time, we kept collecting more and more and Python was having a
> hard time keeping up and it was adding a lot of probe overhead / unwanted
> CPU usage. We switched to Collectd because it was more efficient and didn't
> interfere as much with other processes on the machine. YMMV.
Yup. ATM we effectively have no actionable data, diamond seems like the fastest way to get there. Python is stupid for this, but it works. Optimizing collectd is an option we can consider once our problem is "collect metrics better", rather than "broken metrics"
Comment 8•11 years ago
|
||
Comment 9•11 years ago
|
||
Attachment #8382996 -
Attachment is obsolete: true
Attachment #8383212 -
Flags: review?(dustin)
Comment 10•11 years ago
|
||
Comment on attachment 8383212 [details] [diff] [review]
[puppet] install and configure diamond package on CentOS dev-* machines to start with
Review of attachment 8383212 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good with minor changes.
You might want to build in a knob to disable this, so if something goes wrong you can just include diamond::disable or class {'diamond': enable => false; }
::: manifests/moco-nodes.pp
@@ +138,4 @@
> "network::aws": stage => network,
> }
> include toplevel::slave::build::mock
> + include diamond
We shouldn't be including anything but toplevel classes in node definitions. If this is temporary to avoid burning down the world, then it's OK.
::: modules/diamond/manifests/config.pp
@@ +8,5 @@
> +
> + # hosted graphite settings
> + $diamond_graphite_host = "carbon.hostedgraphite.com"
> + $diamond_graphite_port = "2003"
> + $diamond_graphite_path_prefix = "add664b4-7eb0-4586-93cb-11f1929e95a6" # trial api key
These should be configuration options, and the API key a secret.
::: modules/diamond/manifests/service.pp
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class diamond::service {
> + case $::operatingsystem {
> + CentOS : {
This case, and the one in diamond::config should have a default that fail()s
@@ +4,5 @@
> +class diamond::service {
> + case $::operatingsystem {
> + CentOS : {
> + package { 'diamond':
> + ensure => present,
Fix formatting here
Updated•11 years ago
|
Attachment #8383212 -
Flags: review?(dustin) → review+
Comment 11•11 years ago
|
||
Attachment #8383212 -
Attachment is obsolete: true
Attachment #8383757 -
Flags: review?(dustin)
Comment 12•11 years ago
|
||
Comment on attachment 8383757 [details] [diff] [review]
[puppet] install and configure diamond package on CentOS dev-* machines to start with (updated)
Review of attachment 8383757 [details] [diff] [review]:
-----------------------------------------------------------------
Good with the addition to base.pp.
::: manifests/moco-config.pp
@@ +229,5 @@
> + $diamond_graphite_host = "carbon.hostedgraphite.com"
> + $diamond_graphite_port = "2003"
> + $diamond_graphite_path_prefix = secret('diamond_api_key')
> + $diamond_batch_size = 1
> + $diamond_poll_interval = 30
Looks good, but these vars also need to be in `modules/config/manifests/base.pp` with default values.
::: manifests/moco-nodes.pp
@@ +138,4 @@
> "network::aws": stage => network,
> }
> include toplevel::slave::build::mock
> + include diamond
This is fine for now, since it only applies to Mozilla nodes, but as you broaden this out and move it to toplevel classes, you'll need to add a conditional on $config::use_diamond or something like that to control whether diamond gets installed -- otherwise QA, seamonkey, and relabs will all crash when you land the changes.
::: modules/diamond/manifests/disable.pp
@@ +9,5 @@
> + ensure => stopped,
> + enable => false,
> + hasstatus => true,
> + hasrestart => true,
> + }
trailing whitespace
Attachment #8383757 -
Flags: review?(dustin) → review+
Comment 13•11 years ago
|
||
Comment on attachment 8383757 [details] [diff] [review]
[puppet] install and configure diamond package on CentOS dev-* machines to start with (updated)
http://hg.mozilla.org/build/puppet/rev/3674126ebf0e
Attachment #8383757 -
Flags: checked-in+
Comment 14•11 years ago
|
||
merged to production
Comment 15•11 years ago
|
||
Comment 16•11 years ago
|
||
Attachment #8383848 -
Flags: review?(dustin)
Comment 17•11 years ago
|
||
Comment on attachment 8383848 [details] [diff] [review]
[puppet] enable diamond on amazon build instances
More very-temporary-ness, I assume.
Attachment #8383848 -
Flags: review?(dustin) → review+
Comment 18•11 years ago
|
||
> More very-temporary-ness, I assume.
Yes, I have your review comments in mind regarding these imports.
Comment 19•11 years ago
|
||
Comment on attachment 8383848 [details] [diff] [review]
[puppet] enable diamond on amazon build instances
Landed on default and merged to production.
http://hg.mozilla.org/build/puppet/rev/1a4f1788fdbc
https://hg.mozilla.org/build/puppet/rev/bd853e969571
Attachment #8383848 -
Flags: checked-in+
Comment 20•11 years ago
|
||
Note: will require a follow-up patch to address this failure:
Fri Feb 28 12:01:40 -0800 2014 Puppet (err): Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp
Fri Feb 28 12:01:40 -0800 2014 Puppet (err): Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp
Wrapped exception:
No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717
Fri Feb 28 12:01:40 -0800 2014 /Stage[main]/Diamond::Config/File[/etc/diamond/diamond.conf]/ensure (err): change from absent to file failed: Could not set 'file' on ensure: No such file or directory - /etc/diamond/diamond.conf.puppettmp_6717 at 18:/etc/puppet/production/modules/diamond/manifests/config.pp
I've imported diamond::disable on the bld-* and try-* instances for now - to disable diamond - in http://hg.mozilla.org/build/puppet/rev/6c1cdf968dc3
Comment 21•11 years ago
|
||
reproduced problem on a dev host and fixed with this patch
Attachment #8383896 -
Flags: review?(dustin)
Comment 22•11 years ago
|
||
Comment on attachment 8383896 [details] [diff] [review]
[puppet] ensure /etc/diamond/ directory is created
I'm reasonably confident that the diamond package creates /etc/diamond, so the real issue is that puppet is not installing the package before trying to configure it.
So, a few things to change:
1. Packages should only be installed by classes in the 'package' module, so please add a new clsas, packages::diamond, which installs diamond. Include this from the diamond class. The more common Puppet idiom would be to create diamond::install and have *that* include packages::diamond, but that seems redundant here)
2. Change the config file to depend on Class['packages::diamond']
That should fix the dependency problem.
Attachment #8383896 -
Flags: review?(dustin) → review-
Comment 23•11 years ago
|
||
Attachment #8383896 -
Attachment is obsolete: true
Attachment #8384605 -
Flags: review?(dustin)
Comment 24•11 years ago
|
||
Comment on attachment 8384605 [details] [diff] [review]
[puppet] diamond config ensures diamond package installation first
Review of attachment 8384605 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good - it's up to you whether to pin or not; the disavantage of not pinning is that if you drop a new version in to the repos, everything will start updating without any corresponding puppet change.
::: modules/packages/manifests/diamond.pp
@@ +5,5 @@
> + case $::operatingsystem {
> + CentOS: {
> + package {
> + "diamond":
> + ensure => latest;
You may want to pin the version here, just to make it easier to do a controlled upgrade.
Attachment #8384605 -
Flags: review?(dustin) → review+
Comment 25•11 years ago
|
||
Comment on attachment 8384605 [details] [diff] [review]
[puppet] diamond config ensures diamond package installation first
Landed in http://hg.mozilla.org/build/puppet/rev/d10738e11c2f and merged to production.
Leaving 'diamond' package unpinned for now. We can always pin it prior to a future upgrade if that works out to be the better option.
Attachment #8384605 -
Flags: checked-in+
| Reporter | ||
Comment 26•11 years ago
|
||
John,
I'm sorry to ask for a change in spec late in the game. Can we set normalize=True in CPUCollector?
Comment 27•11 years ago
|
||
Attachment #8385100 -
Flags: review?(dustin)
Updated•11 years ago
|
Attachment #8385100 -
Flags: review?(dustin) → review+
Comment 28•11 years ago
|
||
Comment on attachment 8385100 [details] [diff] [review]
[puppet] set normalize=True for CPUCollector
Landed in https://hg.mozilla.org/build/puppet/rev/9d5dcaa882d3 and merged to production.
Attachment #8385100 -
Flags: checked-in+
Comment 29•11 years ago
|
||
Note: if we need tst-* machine metrics, we will have to create and deploy a Ubuntu 'diamond' package to those systems.
| Reporter | ||
Comment 30•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #29)
> Note: if we need tst-* machine metrics, we will have to create and deploy a
> Ubuntu 'diamond' package to those systems.
We do, but our tests are much lower priority than builds atm. No rush on this part.
Updated•11 years ago
|
Assignee: jhopkins → nobody
| Reporter | ||
Comment 31•11 years ago
|
||
We need to change our graphite hostname to mozilla.carbon.hostedgraphite.com due to connectivity issues in us-west-2
Can someone take care of that?
Comment 32•11 years ago
|
||
Do we change it across the board, or just for hosts in us-west-2?
| Reporter | ||
Comment 33•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #32)
> Do we change it across the board, or just for hosts in us-west-2?
everywhere
Comment 34•11 years ago
|
||
Attachment #8418149 -
Flags: review?(rail)
Comment 35•11 years ago
|
||
Attachment #8418151 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #8418149 -
Flags: review?(rail) → review+
Updated•11 years ago
|
Attachment #8418151 -
Flags: review?(rail) → review+
Comment 36•11 years ago
|
||
Comment on attachment 8418151 [details] [diff] [review]
update AWS routes
and deployed
Attachment #8418151 -
Flags: checked-in+
Comment 37•11 years ago
|
||
Comment on attachment 8418149 [details] [diff] [review]
switch to new host
and merged to production
Attachment #8418149 -
Flags: checked-in+
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•