Closed
Bug 920626
Opened 11 years ago
Closed 11 years ago
deploy collectd to releng POSIX test systems
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: dividehex)
References
Details
We have already deployed collectd statistics gathering software to the linux and OS X build machines and most linux servers. We need to deploy the same software to the test machines.
These systems include:
tst-linux64*
tst-linux32*
talos-linux64*
talos-linux32*
talos-r4-snow*
talos-r4-lion*
talos-mtnlion-r5*
Assignee | ||
Comment 1•11 years ago
|
||
It looks like we are going to be clear to start rolling out collectd to testers very soon. After notifications go out and we have the green light, we will roll out in batches in order to monitor the ramp up of incoming metrics to graphite6 and to monitor when wsp db creates complete for the each batch. We will start with the posix list and then move to the windows systems in bug920629.
The order will be as follows:
Batch 1
talos-linux64*
talos-linux32*
Batch 2
talos-r4-snow*
Batch 3
talos-r4-lion*
Batch 4
talos-mtnlion-r5*
Batch 5
tst-linux64*
tst-linux32*
Assignee | ||
Comment 2•11 years ago
|
||
We will be tracking deployment in this etherpad - https://etherpad.mozilla.org/HpKyc03bLk
Assignee | ||
Comment 3•11 years ago
|
||
Batch 1 has been pushed.
talos-linux64*
talos-linux32*
Changeset: https://hg.mozilla.org/build/puppet/rev/e06ab0edb563
Commit Time: Wed Nov 06 08:06:19 2013 -0800
Assignee | ||
Comment 4•11 years ago
|
||
Batch 2 and 3 have been deployed but DB creates are taking longer than expected. We will continue the rollout tomorrow 11/7 at the same time (8am)
Comment 5•11 years ago
|
||
A some questions on this slip:
1. how many batches are there? The etherpad (in comment 2) lists 9, but comment 1 only lists 5.
2. what does this unexpected database load imply about the ability of the metrics system to handle the load of this many systems? Do we need to re-evaluate the load this many new machines will produce, as mentioned in comment 1 text?
3. if the central database can not keep up with the load, what is the impact on the monitored machines? Do the monitored machines queue the data, or is it dropped on the floor by the server?
Flags: needinfo?(jwatkins)
Comment 6•11 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #4)
> Batch 2 and 3 have been deployed but DB creates are taking longer than
> expected. We will continue the rollout tomorrow 11/7 at the same time (8am)
This rollout has gone beyond the projected window, and now are into times where releases are scheduled. As mentioned in email, please recheck with buildduty on whether it is still clear to proceed each morning. (If you've only done 3 of 9 batches, it sounds like you could have another 2 days of work.)
Corey - needinfo'ing you so you're in the loop.
Flags: needinfo?(cshields)
Assignee | ||
Comment 7•11 years ago
|
||
(In reply to Hal Wine [:hwine] (use needinfo) from comment #5)
> A some questions on this slip:
>
> 1. how many batches are there? The etherpad (in comment 2) lists 9, but
> comment 1 only lists 5.
You can ignore comment 2, that list got refactored early on. The etherpad is the most current working list. When all is said and done, I'll post the final list here after the work is complete. Sorry for the confusion.
> 2. what does this unexpected database load imply about the ability of the
> metrics system to handle the load of this many systems? Do we need to
> re-evaluate the load this many new machines will produce, as mentioned in
> comment 1 text?
The unexpected database load only relates to the initial creation of fixed size database files for each metric collected. The carbon cache daemons rate limit the creation of new db files to a limited number per min and any new metrics queued to be created past that limit are dropped. As future metrics come in, they are queue as new and go through the same rate limit until all new bd files exist. So the "unexpected load" is actually a self imposed governor to limit disk io that would otherwise saturate the disks and which is only in effect during the initial deployment of collectd. There is no need to re-evaluate the load of the new machines. The system was already stress tested and showed it could handle the expected number of new metrics.
> 3. if the central database can not keep up with the load, what is the impact
> on the monitored machines? Do the monitored machines queue the data, or is
> it dropped on the floor by the server?
Metrics are dropped in multiple places if flows gets backed up. Both client and server side.
Flags: needinfo?(jwatkins)
Comment 8•11 years ago
|
||
Thanks for the explanations -- it all makes sense now :)
Assignee | ||
Comment 9•11 years ago
|
||
This work was completed on 11/08/2013. The final change times are posted below.
(c&p from the etherpad)
=============================================================
Batch 1 - STATUS: deployed
talos-linux64*
talos-linux32*
Production Changeset: https://hg.mozilla.org/build/puppet/rev/b27387ed8e93
Commit Time: 2013-11-06 11:52 -0500
Batch 2 - STATUS: deployed
talos-r4-snow*
Production Changeset: https://hg.mozilla.org/build/puppet/rev/f2b9989c04e8
Commit Time: 2013-11-06 11:11 -0800
Batch 3 - STATUS: deployed
talos-r4-lion*
Production Changeset: https://hg.mozilla.org/build/puppet/rev/441cb8c62518
Commit Time: 2013-11-06 12:38 -0800
Batch 4 - STATUS: deployed
talos-mtnlion-r5*
Production Changeset: http://hg.mozilla.org/build/puppet/rev/d622b5831fb4
Commit Time: 2013-11-07 08:26 -0800
Batch 5 - STATUS: deployed
t-xp32*
GPO Update Time: 2013-11-07 10:33 -0800
Batch 6 STATUS: deployed
t-w864*
GPO Update Time: 2013-11-07 13:24 -0800
Batch 7 STATUS: deployed
t-w732*
GPO Update Time: 2013-11-07 15:31 -0800
Batch 8 STATUS: deployed
w64-ix*
Commit Time: 2013-11-08 14:39 -0800
Batch 9 STATUS: deployed
tst-linux64*
tst-linux32*
Production Changeset: http://hg.mozilla.org/build/puppet/rev/ce98e94579f0
Commit Time: 2013-11-07 17:23 -0800
==========================================================
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Flags: needinfo?(cshields)
You need to log in
before you can comment on or make changes to this bug.
Description
•