Closed Bug 925857 Opened 11 years ago Closed 11 years ago

determine the impact on graphite6 when enabling metrics on remaining releng nodes

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: ericz)

References

Details

We are currently poised to enable collectd on all releng posix system and metrics collective on all releng windows systems. Before pulling the trigger, we should determine what the impact will be on graphite6 and if any tuning should be done before hand? Some open questions: Will graphite6 resources (cpu load, disk io, mem usage, etc) be able to handle the new metrics all together? At the current MAX_CREATES_PER_MINUTE, how long will it take until all metrics are being stored? Can we maximize the MAX_CREATES_PER_MINUTE rate limit based on known resource limits in order to shorten the time it takes to stop dropping new metrics? Should we roll out releng nodes by groups and wait for all creates to complete? Maybe monitor resources for a time period after all creates have completed rolling out a new group? Thoughts, comments, other things to consider before we enable metrics on the remaining releng systems?
CC'ing atoll.
Assignee: server-ops → eziegenhorn
Is this the group of 1200 nodes that :arr originally estimated Releng would be turning on? Just want to get an estimate for the number of metrics we are looking at here.
Yes, same machines.
All of the caches and relays we'll be using for this are using < 20% cpu on average, and a small <500MB bit of memory. There are 7GB of free memory and 36GB of cache in memory. So we have plenty of headroom for these additional metrics. I'm assuming approximately 100 metrics per machine which is 120,000 new metrics. So 120k new metrics divided by 8 caches creating them is 15,000 metrics/cache. MAX_CREATES_PER_MINUTE is currently 60 so that would theoretically take 250 minutes or a bit less than 5 hours to create all the new metrics -- it won't happen all at once. I've been experimenting a bit with this today by sending 130k new metrics and it looks like A) the impact on the system is minimal and B) the creates are happening about half as slow as is theoretically possible. That still seems to me acceptable to wait that long for a huge group of new metrics. And if we roll them out in batches it will be even more manageable so I like that idea.
So with determining that the expected impact is low, having a ballpark figure for the creation time that seems acceptable and planning on a batched rollout for safety, I think we've covered this bug. Please reopen if there are any other concerns.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.