Closed Bug 870853 Opened 11 years ago Closed 10 years ago

move off of using ganglia to graphite/collectd

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

(Whiteboard: [2013Q4] [tracker])

Attachments

(13 files, 2 obsolete files)

10.56 KB, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
4.82 KB, patch
dustin
: review+
Callek
: feedback+
dividehex
: checked-in+
Details | Diff | Splinter Review
675 bytes, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
437 bytes, patch
rail
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
946 bytes, patch
coop
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
324 bytes, patch
coop
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
1.27 KB, patch
coop
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
21.98 KB, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
683 bytes, patch
coop
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
1.18 KB, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
41.12 KB, image/png
Details
5.67 KB, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
2.66 KB, patch
dustin
: review+
dividehex
: checked-in+
Details | Diff | Splinter Review
      No description provided.
Attached patch Base collectd puppet module (obsolete) — Splinter Review
This is the base manifest (and config templates) for the collectd module.  It's only written for centos right now but will be expanded to include other OSes as we build packages >= 5.1

For documentation, see:
https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/collectd
Attachment #762419 - Flags: review?(dustin)
+        Port "2003"
+        Prefix "test.dividehex."
+#       Postfix ""

That prefix is for debugging only and will change to:

+        Prefix "hosts.releng."
Comment on attachment 762419 [details] [diff] [review]
Base collectd puppet module

Looks good overall, just some nits:

$graphite_cluster_fqdn = "graphite1.private.scl3.mozilla.com"

I'd love to have this ="" in the base, and instead specify it in the org configs (e.g. moco/servo) since I *suspect* you won't open this to SeaMonkey.

Even if you do open it to SeaMonkey it sounds like the sort of thing where a good default is no default.

In that regard I also suspect we want to *not* install collectd if graphic_cluster is not defined/blank. No sense in collecting stuff if we're not reporting anywhere.

Lastly, +        Prefix "test.dividehex."
should be a config param as well, servo would want different than moco machines I expect. And SeaMonkey, if we have access to teh same graphite instance will certainly want/need a different Prefix.
Attachment #762419 - Flags: feedback-
Comment on attachment 762419 [details] [diff] [review]
Base collectd puppet module

Review of attachment 762419 [details] [diff] [review]:
-----------------------------------------------------------------

..plus what Callek said.

::: modules/collectd/manifests/params.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +class collectd::params {
> +    include packages::collectd

Why is this include needed?

It'd probably be good to import all of the configuration parameters callek mentioned into this class, e.g.,

  $prefix = $::config::collectd_prefix

Also, not really a big deal, but we tend to call these collection-of-variables classes whatever::settings, rather than whatever::params.    It might be nice to be consistent.

::: modules/collectd/templates/collectd.conf.erb
@@ +1,2 @@
> +#####  This file under configuration management control
> +#####  DO NOT EDIT MANUALLY

I'm not keen on these headers.  Nothing under /etc should be edited manually, whether or not under Puppet's control (in the former case, puppet will revert it; in the latter, the system will behave unexpectedly and that behavior will not persist over a reimage).
Attachment #762419 - Flags: review?(dustin) → review-
Attachment #762419 - Attachment is obsolete: true
Attachment #764306 - Flags: review?(dustin)
Attachment #764306 - Flags: feedback?(bugspam.Callek)
Comment on attachment 764306 [details] [diff] [review]
Base collectd puppet module

Review of attachment 764306 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm with that fix

::: modules/config/manifests/base.pp
@@ +34,4 @@
>      $buildbot_configs_hg_repo = "https://hg.mozilla.org/build/buildbot-configs"
>      $buildbot_configs_branch = "production"
>      $buildbot_mail_to = "nobody@mozilla.com"
> +    $collectd_graphite_cluster_fqdn = ""

$collectd_graphite_prefix should be here too
Attachment #764306 - Flags: review?(dustin) → review+
Attachment #764306 - Flags: feedback?(bugspam.Callek) → checked-in+
:callek brought up the point that if $collectd_graphite_cluster_fqdn goes undefined or as an empty string it should *NOT* fail() but should skip the collectd module all together.

I'll slip this change in on the next (ubuntu) patch.
collectd 5.3.0 ubuntu packages (amd64 and i386) have been merged into the repo.  Before this was rsync'd I backed up db/ and releng/dists/precise to /home/jwatkins/apt-repo-backup/

root@relabs07:~/data/apt# rsync .  jwatkins@releng-puppet1.srv.releng.scl3.mozilla.com:/data/repos/apt/  -avn --progress
The authenticity of host 'releng-puppet1.srv.releng.scl3.mozilla.com (10.26.48.45)' can't be established.
RSA key fingerprint is c4:e1:71:61:a6:cf:61:47:a4:07:15:82:b2:a8:5e:85.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'releng-puppet1.srv.releng.scl3.mozilla.com,10.26.48.45' (RSA) to the list of known hosts.
sending incremental file list
db/
db/checksums.db
db/packages.db
db/references.db
db/release.caches.db
db/version
releng/dists/precise/
releng/dists/precise/Release
releng/dists/precise/main/binary-amd64/
releng/dists/precise/main/binary-amd64/Packages
releng/dists/precise/main/binary-amd64/Packages.bz2
releng/dists/precise/main/binary-i386/
releng/dists/precise/main/binary-i386/Packages
releng/dists/precise/main/binary-i386/Packages.bz2
releng/dists/precise/main/source/
releng/dists/precise/main/source/Sources.gz
releng/pool/main/
releng/pool/main/c/
releng/pool/main/c/collectd/
releng/pool/main/c/collectd/collectd-core_5.3.0_amd64.deb
releng/pool/main/c/collectd/collectd-core_5.3.0_i386.deb
releng/pool/main/c/collectd/collectd-dbg_5.3.0_amd64.deb
releng/pool/main/c/collectd/collectd-dbg_5.3.0_i386.deb
releng/pool/main/c/collectd/collectd-dev_5.3.0_all.deb
releng/pool/main/c/collectd/collectd-utils_5.3.0_amd64.deb
releng/pool/main/c/collectd/collectd-utils_5.3.0_i386.deb
releng/pool/main/c/collectd/collectd_5.3.0.debian.tar.gz
releng/pool/main/c/collectd/collectd_5.3.0.dsc
releng/pool/main/c/collectd/collectd_5.3.0.orig.tar.bz2
releng/pool/main/c/collectd/collectd_5.3.0_amd64.deb
releng/pool/main/c/collectd/collectd_5.3.0_i386.deb
releng/pool/main/c/collectd/libcollectdclient-dev_5.3.0_amd64.deb
releng/pool/main/c/collectd/libcollectdclient-dev_5.3.0_i386.deb
releng/pool/main/c/collectd/libcollectdclient1_5.3.0_amd64.deb
releng/pool/main/c/collectd/libcollectdclient1_5.3.0_i386.deb

sent 4484250 bytes  received 19820 bytes  191662.55 bytes/sec
total size is 117288026202  speedup is 26040.45 (DRY RUN)
This patch adds ubuntu support and now does nothing if a graphite server isn't specified
Attachment #765720 - Flags: review?(dustin)
Attachment #765720 - Flags: feedback?(bugspam.Callek)
Comment on attachment 765720 [details] [diff] [review]
Patch for collectd ubuntu support

Review of attachment 765720 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/collectd/manifests/init.pp
@@ +5,4 @@
>      include collectd::settings
>  
> +    # do not configure unless graphite server is defined
> +    if $::config::collectd_graphite_cluster_fqdn or !$::config::collectd_graphite_cluster_fqdn == "" {

s/or/and/
Attachment #765720 - Flags: feedback?(bugspam.Callek) → feedback+
Comment on attachment 765720 [details] [diff] [review]
Patch for collectd ubuntu support

Review of attachment 765720 [details] [diff] [review]:
-----------------------------------------------------------------

with callek's change
Attachment #765720 - Flags: review?(dustin) → review+
Comment on attachment 765720 [details] [diff] [review]
Patch for collectd ubuntu support

Checked in with 's/or/and'
:callek, good catch btw
Attachment #765720 - Flags: checked-in+
As discussed today in the relops meeting, we can start deploying the collectd module to select servers

We'll start with the mobile imaging servers
Attachment #770632 - Flags: review?(dustin)
Attachment #770632 - Flags: review?(dustin) → review+
Comment on attachment 770632 [details] [diff] [review]
include collectd module in mobile imaging server node defs

landed
Attachment #770632 - Flags: checked-in+
Blocks: 892003
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
BTW, I've been observing these kind of messages in the logs:

Jul 23 05:02:57 buildbot-master79 /usr/sbin/gmond[1074]: Error creating multicast server mcast_join=127.0.0.1 port=8649 mcast_if=NULL family='inet4'. Exiting.#012

And puppet tries to start gmond every run.
Rail: this bug isn't for ganglia (gmond), it's for collectd.

There's no statistics gathering that works on the AWS regions because there are no ganglia servers there and collectd has not been added to them.  If there are manifests that try to install/start gmond for AWS hosts, it should be disabled.  If you would like us to add collectd to all buildbot masters (including those in AWS), please let us know here.
Flags: needinfo?(rail)
Oh, one of the patches made me think that this bug is related.

Having some stats would be great though.
Flags: needinfo?(rail)
Attachment #783921 - Flags: review?(rail)
Attachment #783921 - Flags: review?(rail) → review+
Comment on attachment 783921 [details] [diff] [review]
bug870853-buildmasters.patch

pushed to buildmasters
Attachment #783921 - Flags: checked-in+
I think we can target the rest of the silos for servers that are hosted by Mozilla: foopys, puppetmasters, signing machines

Am I missing any from that list?
At least some of the signing servers are OS X, so we'd need to put some more work into getting that rolling.  More than happy to do that if we have your okay to push it out to the OS X signing servers.  I think at that point we can declare it in the toplevel server definition.

We also have the ability to push it out to linux slaves (builders, since there's no timing tests to worry about) as well if you'd like.
This adds collectd to the puppetmasters.  Once we get collectd dmgs built for all of our osx flavors, we can pull the per node includes and just slip it into the toplevel::server module
Attachment #789154 - Flags: review?(coop)
Attachment #789154 - Flags: review?(coop) → review+
Attachment #789154 - Flags: checked-in+
collectd-libvirt is failing to install on the aws puppetmasters.  There are some multilib version conflicts.
These multilib version conflicts stem from version-release mismatches between the i386 and the x86_64 versions available in the repo.

For examlple,

[root@releng-puppet1.srv.releng.use1.mozilla.com ~]# yum info libgcrypt
Loaded plugins: security
Installed Packages
Name        : libgcrypt
Arch        : x86_64
Version     : 1.4.5
Release     : 9.el6_2.2
Size        : 524 k
Repo        : installed
From repo   : base
Summary     : A general-purpose cryptography library
URL         : http://www.gnupg.org/
License     : LGPLv2+
Description : Libgcrypt is a general purpose crypto library based on the code used
            : in GNU Privacy Guard.  This is a development version.

Available Packages
Name        : libgcrypt
Arch        : i686
Version     : 1.4.5
Release     : 9.el6
Size        : 228 k
Repo        : base
Summary     : A general-purpose cryptography library
URL         : http://www.gnupg.org/
License     : LGPLv2+
Description : Libgcrypt is a general purpose crypto library based on the code used
            : in GNU Privacy Guard.  This is a development version.
Huh - the only instance of that release is

/data/repos/yum/mirrors/centos/6/2012-03-07/updates/i386/Packages/libgcrypt-1.4.5-9.el6_2.2.i686.rpm

Looking at /var/log/yum.log, the correct version was initially installed, and then after puppet was installed a 'yum upgrade' back in May resulted in an upgrade to 9.el6_2.2, from some repository other than ours (most likely upstream).  This is a problem we've run into before, and was particularly fun on Ubuntu which uses its upstream security repo even if you tell it not to.

I suspect that a manual 'yum downgrade libgcrypt' will fix this.
I downgraded the offending packages on all 4 aws puppetmastsers.

For releng-puppet2.srv.releng.usw2.mozilla.com, releng-puppet2.srv.releng.use1.mozilla.com, releng-puppet1.srv.releng.usw2.mozilla.com
These packages were downgraded:
zlib kexec-tools python python-libs libtasn1 libgcrypt gnutls cyrus-sasl-lib cyrus-sasl-plain cyrus-sasl db4 db4-utils

For releng-puppet1.srv.releng.use1.mozilla.com these packages were downgraded:
zlib kexec-tools python python-libs libtasn1 libgcrypt gnutls

[jwatkins@releng-puppet2.srv.releng.usw2.mozilla.com ~]$ sudo yum downgrade zlib kexec-tools python python-libs libtasn1 libgcrypt gnutls cyrus-sasl-lib cyrus-sasl-plain cyrus-sasl db4 db4-utils
Loaded plugins: security
Setting up Downgrade Process
Resolving Dependencies
--> Running transaction check
---> Package cyrus-sasl.x86_64 0:2.1.23-13.el6 will be a downgrade
---> Package cyrus-sasl.x86_64 0:2.1.23-13.el6_3.1 will be erased
---> Package cyrus-sasl-lib.x86_64 0:2.1.23-13.el6 will be a downgrade
---> Package cyrus-sasl-lib.x86_64 0:2.1.23-13.el6_3.1 will be erased
---> Package cyrus-sasl-plain.x86_64 0:2.1.23-13.el6 will be a downgrade
---> Package cyrus-sasl-plain.x86_64 0:2.1.23-13.el6_3.1 will be erased
---> Package db4.x86_64 0:4.7.25-16.el6 will be a downgrade
---> Package db4.x86_64 0:4.7.25-17.el6 will be erased
---> Package db4-utils.x86_64 0:4.7.25-16.el6 will be a downgrade
---> Package db4-utils.x86_64 0:4.7.25-17.el6 will be erased
---> Package gnutls.x86_64 0:2.8.5-4.el6 will be a downgrade
---> Package gnutls.x86_64 0:2.8.5-10.el6_4.1 will be erased
---> Package kexec-tools.x86_64 0:2.0.0-209.el6 will be a downgrade
---> Package kexec-tools.x86_64 0:2.0.0-258.el6 will be erased
---> Package libgcrypt.x86_64 0:1.4.5-9.el6 will be a downgrade
---> Package libgcrypt.x86_64 0:1.4.5-9.el6_2.2 will be erased
---> Package libtasn1.x86_64 0:2.3-3.el6 will be a downgrade
---> Package libtasn1.x86_64 0:2.3-3.el6_2.1 will be erased
---> Package python.x86_64 0:2.6.6-29.el6 will be a downgrade
---> Package python.x86_64 0:2.6.6-36.el6 will be erased
---> Package python-libs.x86_64 0:2.6.6-29.el6 will be a downgrade
---> Package python-libs.x86_64 0:2.6.6-36.el6 will be erased
---> Package zlib.x86_64 0:1.2.3-27.el6 will be a downgrade
---> Package zlib.x86_64 0:1.2.3-29.el6 will be erased
--> Finished Dependency Resolution

Dependencies Resolved

===========================================================================================================================
 Package                            Arch                     Version                          Repository              Size
===========================================================================================================================
Downgrading:
 cyrus-sasl                         x86_64                   2.1.23-13.el6                    base                    78 k
 cyrus-sasl-lib                     x86_64                   2.1.23-13.el6                    base                   136 k
 cyrus-sasl-plain                   x86_64                   2.1.23-13.el6                    base                    30 k
 db4                                x86_64                   4.7.25-16.el6                    base                   565 k
 db4-utils                          x86_64                   4.7.25-16.el6                    base                   130 k
 gnutls                             x86_64                   2.8.5-4.el6                      base                   343 k
 kexec-tools                        x86_64                   2.0.0-209.el6                    base                   255 k
 libgcrypt                          x86_64                   1.4.5-9.el6                      base                   228 k
 libtasn1                           x86_64                   2.3-3.el6                        base                   238 k
 python                             x86_64                   2.6.6-29.el6                     base                   4.8 M
 python-libs                        x86_64                   2.6.6-29.el6                     base                   621 k
 zlib                               x86_64                   1.2.3-27.el6                     base                    72 k

Transaction Summary
===========================================================================================================================
Downgrade    12 Package(s)

Total download size: 7.4 M
Is this ok [y/N]: y
Downloading Packages:
(1/12): cyrus-sasl-2.1.23-13.el6.x86_64.rpm                                                         |  78 kB     00:00
(2/12): cyrus-sasl-lib-2.1.23-13.el6.x86_64.rpm                                                     | 136 kB     00:00
(3/12): cyrus-sasl-plain-2.1.23-13.el6.x86_64.rpm                                                   |  30 kB     00:00
(4/12): db4-4.7.25-16.el6.x86_64.rpm                                                                | 565 kB     00:00
(5/12): db4-utils-4.7.25-16.el6.x86_64.rpm                                                          | 130 kB     00:00
(6/12): gnutls-2.8.5-4.el6.x86_64.rpm                                                               | 343 kB     00:00
(7/12): kexec-tools-2.0.0-209.el6.x86_64.rpm                                                        | 255 kB     00:00
(8/12): libgcrypt-1.4.5-9.el6.x86_64.rpm                                                            | 228 kB     00:00
(9/12): libtasn1-2.3-3.el6.x86_64.rpm                                                               | 238 kB     00:00
(10/12): python-2.6.6-29.el6.x86_64.rpm                                                             | 4.8 MB     00:00
(11/12): python-libs-2.6.6-29.el6.x86_64.rpm                                                        | 621 kB     00:00
(12/12): zlib-1.2.3-27.el6.x86_64.rpm                                                               |  72 kB     00:00
---------------------------------------------------------------------------------------------------------------------------
Total                                                                                       32 MB/s | 7.4 MB     00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : db4-4.7.25-16.el6.x86_64                                                                               1/24
  Installing : zlib-1.2.3-27.el6.x86_64                                                                               2/24
  Installing : cyrus-sasl-lib-2.1.23-13.el6.x86_64                                                                    3/24
  Installing : python-libs-2.6.6-29.el6.x86_64                                                                        4/24
  Installing : python-2.6.6-29.el6.x86_64                                                                             5/24
  Installing : libgcrypt-1.4.5-9.el6.x86_64                                                                           6/24
  Installing : libtasn1-2.3-3.el6.x86_64                                                                              7/24
  Installing : gnutls-2.8.5-4.el6.x86_64                                                                              8/24
  Installing : cyrus-sasl-plain-2.1.23-13.el6.x86_64                                                                  9/24
  Installing : cyrus-sasl-2.1.23-13.el6.x86_64                                                                       10/24
  Installing : kexec-tools-2.0.0-209.el6.x86_64                                                                      11/24
  Installing : db4-utils-4.7.25-16.el6.x86_64                                                                        12/24
  Cleanup    : python-libs-2.6.6-36.el6.x86_64                                                                       13/24
  Cleanup    : python-2.6.6-36.el6.x86_64                                                                            14/24
  Cleanup    : gnutls-2.8.5-10.el6_4.1.x86_64                                                                        15/24
  Cleanup    : cyrus-sasl-2.1.23-13.el6_3.1.x86_64                                                                   16/24
  Cleanup    : kexec-tools-2.0.0-258.el6.x86_64                                                                      17/24
  Cleanup    : cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64                                                             18/24
  Cleanup    : cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64                                                               19/24
  Cleanup    : db4-utils-4.7.25-17.el6.x86_64                                                                        20/24
  Cleanup    : db4-4.7.25-17.el6.x86_64                                                                              21/24
  Cleanup    : zlib-1.2.3-29.el6.x86_64                                                                              22/24
  Cleanup    : libgcrypt-1.4.5-9.el6_2.2.x86_64                                                                      23/24
  Cleanup    : libtasn1-2.3-3.el6_2.1.x86_64                                                                         24/24
  Verifying  : db4-utils-4.7.25-16.el6.x86_64                                                                         1/24
  Verifying  : cyrus-sasl-plain-2.1.23-13.el6.x86_64                                                                  2/24
  Verifying  : zlib-1.2.3-27.el6.x86_64                                                                               3/24
  Verifying  : cyrus-sasl-2.1.23-13.el6.x86_64                                                                        4/24
  Verifying  : kexec-tools-2.0.0-209.el6.x86_64                                                                       5/24
  Verifying  : libtasn1-2.3-3.el6.x86_64                                                                              6/24
  Verifying  : db4-4.7.25-16.el6.x86_64                                                                               7/24
  Verifying  : gnutls-2.8.5-4.el6.x86_64                                                                              8/24
  Verifying  : libgcrypt-1.4.5-9.el6.x86_64                                                                           9/24
  Verifying  : python-2.6.6-29.el6.x86_64                                                                            10/24
  Verifying  : python-libs-2.6.6-29.el6.x86_64                                                                       11/24
  Verifying  : cyrus-sasl-lib-2.1.23-13.el6.x86_64                                                                   12/24
  Verifying  : cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64                                                               13/24
  Verifying  : libgcrypt-1.4.5-9.el6_2.2.x86_64                                                                      14/24
  Verifying  : db4-4.7.25-17.el6.x86_64                                                                              15/24
  Verifying  : python-2.6.6-36.el6.x86_64                                                                            16/24
  Verifying  : zlib-1.2.3-29.el6.x86_64                                                                              17/24
  Verifying  : kexec-tools-2.0.0-258.el6.x86_64                                                                      18/24
  Verifying  : python-libs-2.6.6-36.el6.x86_64                                                                       19/24
  Verifying  : libtasn1-2.3-3.el6_2.1.x86_64                                                                         20/24
  Verifying  : gnutls-2.8.5-10.el6_4.1.x86_64                                                                        21/24
  Verifying  : cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64                                                             22/24
  Verifying  : db4-utils-4.7.25-17.el6.x86_64                                                                        23/24
  Verifying  : cyrus-sasl-2.1.23-13.el6_3.1.x86_64                                                                   24/24

Removed:
  cyrus-sasl.x86_64 0:2.1.23-13.el6_3.1                         cyrus-sasl-lib.x86_64 0:2.1.23-13.el6_3.1
  cyrus-sasl-plain.x86_64 0:2.1.23-13.el6_3.1                   db4.x86_64 0:4.7.25-17.el6
  db4-utils.x86_64 0:4.7.25-17.el6                              gnutls.x86_64 0:2.8.5-10.el6_4.1
  kexec-tools.x86_64 0:2.0.0-258.el6                            libgcrypt.x86_64 0:1.4.5-9.el6_2.2
  libtasn1.x86_64 0:2.3-3.el6_2.1                               python.x86_64 0:2.6.6-36.el6
  python-libs.x86_64 0:2.6.6-36.el6                             zlib.x86_64 0:1.2.3-29.el6

Installed:
  cyrus-sasl.x86_64 0:2.1.23-13.el6     cyrus-sasl-lib.x86_64 0:2.1.23-13.el6    cyrus-sasl-plain.x86_64 0:2.1.23-13.el6
  db4.x86_64 0:4.7.25-16.el6            db4-utils.x86_64 0:4.7.25-16.el6         gnutls.x86_64 0:2.8.5-4.el6
  kexec-tools.x86_64 0:2.0.0-209.el6    libgcrypt.x86_64 0:1.4.5-9.el6           libtasn1.x86_64 0:2.3-3.el6
  python.x86_64 0:2.6.6-29.el6          python-libs.x86_64 0:2.6.6-29.el6        zlib.x86_64 0:1.2.3-27.el6

Complete!
Deploys collected to the signing[456].srv.releng.scl3.mozilla.com
Attachment #790822 - Flags: review?(coop)
Attachment #790822 - Flags: review?(coop) → review+
Attachment #790822 - Flags: checked-in+
Jake: can you please push collectd out to the linux builders (not testers)?  This probably requires some coordination with rail for AWS stuff to make sure everything works.  Let's test this out on a few nodes in each datacenter to test.
Assignee: jwatkins → arich
Status: NEW → ASSIGNED
Assignee: arich → jwatkins
:coop, can you update this bug with the linux builders I can use (to test collectd) when you get a chance?  I will need at least an IX linux bld and a HP builder.  If you can also get me an aws node, that would be great too!
Flags: needinfo?(coop)
Still waiting for builds to finish on both boxes, but I've set aside the follwoing machines for you:

bld-centos6-hp-006
bld-linux64-ix-027

I'll comment again when the builds are done.
Flags: needinfo?(coop)
bld-centos6-hp-006 is ready now, but bld-linux64-ix-027 is still building.

Jake: how do you want me to test/verify once collectd is installed?
(In reply to Chris Cooper [:coop] from comment #31)
> bld-centos6-hp-006 is ready now, but bld-linux64-ix-027 is still building.
> 
> Jake: how do you want me to test/verify once collectd is installed?

Honestly, I'm not sure there is anything to test on your end.  I'm just going to make sure collectd installs properly and ensure we don't run into any problems like we had with multilib discrepancies seen in the aws puppetmasters.
collectd is installed and running on bld-centos6-hp-006 without issue.
(In reply to Chris Cooper [:coop] from comment #30)
> bld-linux64-ix-027

This one is ready now too.
bld-linux64-ix-027 is also installed and running without issue.
To move forward here, we really need to test the deployment on a linux builder in aws.  :Rail (or :Coop), could one of you lend me an one of these nodes from each aws DC (use1 & usw2)? Thanks
Flags: needinfo?(rail)
Flags: needinfo?(coop)
(In reply to Jake Watkins [:dividehex] from comment #36)
> To move forward here, we really need to test the deployment on a linux
> builder in aws.  :Rail (or :Coop), could one of you lend me an one of these
> nodes from each aws DC (use1 & usw2)? Thanks

I will grab you two slaves.

Also, both bld-centos6-hp-006 and bld-linux64-ix-027 are running normally and builds (and build times) don't seem to be impacted.
Flags: needinfo?(rail)
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #37)
> (In reply to Jake Watkins [:dividehex] from comment #36)
> > To move forward here, we really need to test the deployment on a linux
> > builder in aws.  :Rail (or :Coop), could one of you lend me an one of these
> > nodes from each aws DC (use1 & usw2)? Thanks
> 
> I will grab you two slaves.

bld-linux64-ec2-199.build.releng.use1.mozilla.com = 10.134.53.219
bld-linux64-ec2-300.build.releng.usw2.mozilla.com = 10.132.54.25

300 is available now, 199 is still building (ETA 1h).
(In reply to Chris Cooper [:coop] from comment #38) 
> 300 is available now, 199 is still building (ETA 1h).

bld-linux64-ec2-199 is ready now too.
:coop, Thanks! I have installed collectd on both ec2 nodes without any issues.
:coop, if you are ready, I can push this into production for linux bld slaves
Attachment #802559 - Flags: review?(coop)
Attachment #802559 - Flags: review?(coop) → review+
Attachment #802559 - Flags: checked-in+
I also landed collectd::disable in case we need to back out and disable collectd at any point.
Whiteboard: [2013Q4] [tracker]
Depends on: 917082
* Adds darwin support for collectd
* splits out the base plugins from common.conf
* common plugins are include based on os profile
* Adds modules 'logfile' and 'csv' primarily for debugging
* unixsock plugin refactored
* syslog loglevel can be adjusted
Attachment #807518 - Flags: review?(dustin)
Forgot to mention collectd-dmg.sh is also included

Collectd packages are built for 10.6 thru 10.9 and have been tested on all but 10.9.  The dmgs are already in the public puppetagain dmg repo.
It looks like the 'Disk' modules isn't working on any version of OSX.  It doesn't spit out any error but also doesn't output any data.  This might be a bug in the module and possibly be related to this, https://github.com/collectd/collectd/issues/245
I recompiled collectd with debugging enabled and got a little more useful info out of the logs.  This just reenforces my belief it is a bug in the module itself.

[2013-09-19 19:02:03] [debug] plugin_read_thread: Handling `disk'.
[2013-09-19 19:02:03] [debug] disk plugin: CFDictionaryGetValue(kIOBSDNameKey) failed.
[2013-09-19 19:02:03] [debug] disk plugin: CFDictionaryGetValue(kIOBSDNameKey) failed.
[2013-09-19 19:02:03] [debug] IORegistryEntryGetChildEntry (disk) failed: 0xe00002c0
[2013-09-19 19:02:03] [debug] disk plugin: CFDictionaryGetValue(kIOBSDNameKey) failed.
[2013-09-19 19:02:03] [debug] plugin_read_thread: Effective interval of the disk plugin is 10.000 seconds.
[2013-09-19 19:02:03] [debug] plugin_read_thread: Next read of the disk plugin at 1379642533.175.
[2013-09-19 19:02:03] [debug] pid = 13; name = diskarbitrationd;
[2013-09-19 19:02:03] [debug] pid = 3615; name = diskimages-helpe;
Depends on: 918988
Comment on attachment 807518 [details] [diff] [review]
bug870853-darwin.patch

Review of attachment 807518 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great - just a few syntactic things, plus some trailing whitespace.

::: modules/collectd/manifests/plugins/csv.pp
@@ +10,5 @@
> +
> +    $plugin_name = 'csv'
> +
> +    case $::operatingsystem {
> +        /(CentOS|Ubuntu)/: {

You can do this with just a comma, too:
    CentOS, Ubuntu: {

::: modules/collectd/manifests/util.pp
@@ +4,5 @@
> +class collectd::util {
> +    include collectd
> +    include collectd::settings
> +
> +    define config_gen ($arg_array) {

Hm, I didn't even know this syntax worked.  It's certainly unusual - this would ordinarily be in `modules/collectd/manifests/util/config_gen.pp`.  Is there any strong reason to put it here?
Attachment #807518 - Flags: review?(dustin) → review+
Comment on attachment 807518 [details] [diff] [review]
bug870853-darwin.patch

Checked-in with recommended changes
Attachment #807518 - Flags: checked-in+
Enables collectd on OSX signing servers
Attachment #808770 - Flags: review?(coop)
Attachment #808770 - Flags: review?(coop) → review+
Attachment #808770 - Flags: checked-in+
What has been collected as OS metrics and where can I find / see them? 
What are the systems/functions in releng that are not yet instrumented for OS level metrics?
OS metrics are available for all linux builder systems, linux servers, OS X signing servers, and (starting today) OS X builders (not test machines) being managed by puppet, including those in AWS. 

Metrics for all of the above are being stored in graphite (the same system we're using to store metrics for other IT systems): https://graphite.mozilla.org/

We are still in the process of working on a viable solution for Windows (see bug 918988) and expect that to have that done in early Q4. Test machines are pending joint work with releng to make sure the software doesn't impact test numbers (or that we can at least filter the noise).
adds collectd to bld-lion and slaveapi
Attachment #809415 - Flags: review?(coop)
Attachment #809415 - Flags: review?(coop) → review?(dustin)
Attachment #809415 - Flags: review?(dustin) → review+
Attachment #809415 - Flags: checked-in+
This is an example of the data we get from graphite.  This is a puppet server which was heavily loaded in August, and became *very* heavily overloaded as September began.  The sudden dip on the right occurred when we added some CPU resources to the host this morning.
Bug 918677 comment 8 suggests some additional, higher-level host metrics that we could feed into collectd to help determine the root cause of some ongoing, difficult-to-diagnose issues.

It would also be helpful to pull data out of slavealloc, buildapi, and slaveapi, as these tools have a higher-level view of the releng automation: total number of buildslaves, number in various states, number and type of running and pending jobs, and so on.
Please add rules for collectd on b-linux64-hp* as well.  I'm installing them this week.
(In reply to Amy Rich [:arich] [:arr] from comment #55)
> Please add rules for collectd on b-linux64-hp* as well.  I'm installing them
> this week.

Collectd was included for these nodes in changeset e582c0fa64f9
Now that the collectd module supports OSX and is rolled out to all servers and build slaves, we can clean up and consolidate the includes down to toplevel::server and toplevel::slave::build
Attachment #809917 - Flags: review?(dustin)
Attachment #809917 - Flags: review?(dustin) → review+
Attachment #809917 - Flags: checked-in+
Depends on: 920626
Depends on: 920629
Depends on: 925859
Depends on: 925864
Attached patch bug870853-consolidate2.patch (obsolete) — Splinter Review
Consolidates collectd include into toplevel::slave
Attachment #832403 - Flags: review?(dustin)
Comment on attachment 832403 [details] [diff] [review]
bug870853-consolidate2.patch

We can go all the way up to toplevel::base, rather than stopping at toplevel::slave and toplevel::server.
Attachment #832403 - Flags: review?(dustin) → review-
moves include collectd to toplevel::base
Attachment #832403 - Attachment is obsolete: true
Attachment #832513 - Flags: review?(dustin)
Attachment #832513 - Flags: review?(dustin) → review+
Attachment #832513 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: