Closed Bug 920626 Opened 11 years ago Closed 11 years ago

deploy collectd to releng POSIX test systems

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dividehex)

References

Details

We have already deployed collectd statistics gathering software to the linux and OS X build machines and most linux servers. We need to deploy the same software to the test machines. These systems include: tst-linux64* tst-linux32* talos-linux64* talos-linux32* talos-r4-snow* talos-r4-lion* talos-mtnlion-r5*
Depends on: 921783
Depends on: 925857
It looks like we are going to be clear to start rolling out collectd to testers very soon. After notifications go out and we have the green light, we will roll out in batches in order to monitor the ramp up of incoming metrics to graphite6 and to monitor when wsp db creates complete for the each batch. We will start with the posix list and then move to the windows systems in bug920629. The order will be as follows: Batch 1 talos-linux64* talos-linux32* Batch 2 talos-r4-snow* Batch 3 talos-r4-lion* Batch 4 talos-mtnlion-r5* Batch 5 tst-linux64* tst-linux32*
We will be tracking deployment in this etherpad - https://etherpad.mozilla.org/HpKyc03bLk
Batch 1 has been pushed. talos-linux64* talos-linux32* Changeset: https://hg.mozilla.org/build/puppet/rev/e06ab0edb563 Commit Time: Wed Nov 06 08:06:19 2013 -0800
Batch 2 and 3 have been deployed but DB creates are taking longer than expected. We will continue the rollout tomorrow 11/7 at the same time (8am)
A some questions on this slip: 1. how many batches are there? The etherpad (in comment 2) lists 9, but comment 1 only lists 5. 2. what does this unexpected database load imply about the ability of the metrics system to handle the load of this many systems? Do we need to re-evaluate the load this many new machines will produce, as mentioned in comment 1 text? 3. if the central database can not keep up with the load, what is the impact on the monitored machines? Do the monitored machines queue the data, or is it dropped on the floor by the server?
Flags: needinfo?(jwatkins)
(In reply to Jake Watkins [:dividehex] from comment #4) > Batch 2 and 3 have been deployed but DB creates are taking longer than > expected. We will continue the rollout tomorrow 11/7 at the same time (8am) This rollout has gone beyond the projected window, and now are into times where releases are scheduled. As mentioned in email, please recheck with buildduty on whether it is still clear to proceed each morning. (If you've only done 3 of 9 batches, it sounds like you could have another 2 days of work.) Corey - needinfo'ing you so you're in the loop.
Flags: needinfo?(cshields)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #5) > A some questions on this slip: > > 1. how many batches are there? The etherpad (in comment 2) lists 9, but > comment 1 only lists 5. You can ignore comment 2, that list got refactored early on. The etherpad is the most current working list. When all is said and done, I'll post the final list here after the work is complete. Sorry for the confusion. > 2. what does this unexpected database load imply about the ability of the > metrics system to handle the load of this many systems? Do we need to > re-evaluate the load this many new machines will produce, as mentioned in > comment 1 text? The unexpected database load only relates to the initial creation of fixed size database files for each metric collected. The carbon cache daemons rate limit the creation of new db files to a limited number per min and any new metrics queued to be created past that limit are dropped. As future metrics come in, they are queue as new and go through the same rate limit until all new bd files exist. So the "unexpected load" is actually a self imposed governor to limit disk io that would otherwise saturate the disks and which is only in effect during the initial deployment of collectd. There is no need to re-evaluate the load of the new machines. The system was already stress tested and showed it could handle the expected number of new metrics. > 3. if the central database can not keep up with the load, what is the impact > on the monitored machines? Do the monitored machines queue the data, or is > it dropped on the floor by the server? Metrics are dropped in multiple places if flows gets backed up. Both client and server side.
Flags: needinfo?(jwatkins)
Thanks for the explanations -- it all makes sense now :)
This work was completed on 11/08/2013. The final change times are posted below. (c&p from the etherpad) ============================================================= Batch 1 - STATUS: deployed talos-linux64* talos-linux32* Production Changeset: https://hg.mozilla.org/build/puppet/rev/b27387ed8e93 Commit Time: 2013-11-06 11:52 -0500 Batch 2 - STATUS: deployed talos-r4-snow* Production Changeset: https://hg.mozilla.org/build/puppet/rev/f2b9989c04e8 Commit Time: 2013-11-06 11:11 -0800 Batch 3 - STATUS: deployed talos-r4-lion* Production Changeset: https://hg.mozilla.org/build/puppet/rev/441cb8c62518 Commit Time: 2013-11-06 12:38 -0800 Batch 4 - STATUS: deployed talos-mtnlion-r5* Production Changeset: http://hg.mozilla.org/build/puppet/rev/d622b5831fb4 Commit Time: 2013-11-07 08:26 -0800 Batch 5 - STATUS: deployed t-xp32* GPO Update Time: 2013-11-07 10:33 -0800 Batch 6 STATUS: deployed t-w864* GPO Update Time: 2013-11-07 13:24 -0800 Batch 7 STATUS: deployed t-w732* GPO Update Time: 2013-11-07 15:31 -0800 Batch 8 STATUS: deployed w64-ix* Commit Time: 2013-11-08 14:39 -0800 Batch 9 STATUS: deployed tst-linux64* tst-linux32* Production Changeset: http://hg.mozilla.org/build/puppet/rev/ce98e94579f0 Commit Time: 2013-11-07 17:23 -0800 ==========================================================
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Flags: needinfo?(cshields)
You need to log in before you can comment on or make changes to this bug.