philor suggested a bunch of useful metrics he watches in https://bugzilla.mozilla.org/show_bug.cgi?id=1304158#c4 > The reasons I'm able to alert on them when nagios fails are: > > * I don't alert on total pending, I alert on the ratio of pending to the > size of the pool > > * for pools with any pending at all, I alert on the ratio of the size of the > pool to the number of slaves that have done a job in the last 4 hours > > * not my strongest alert, but I do alert on a single pool having a wildly > different pending::pool ratio than the rest > > * by far my most useful, I have separate backlog age alerts for Try+fuzzer > and non-Try I think we should be able to automate these.
After some investigation on the existing alerts and the new alerts that we want to implement I found that we may encounter some problems if we implement these alerts such as: -when we alert on ratio of pending to the size of the pool we will have some problems with the large pools where the number of machines is huge(2600 tst-linux64-spot machines), 2600 machines for this example is the maximum number of all machines in the pool but in most cases only some of them are active so we will not get the accurate overview on this pool -the alerts checks the url: https://secure.pub.build.mozilla.org/builddata/reports/allthethings.json, and pools are distinguished by a slavepool id, so we have some pools which are formed by one or more platforms (example of pool ['av-linux64-ec2', 'av-linux64-spot', 'b-2008-ec2', 'b-2008-spot', 'bld-linux64-ec2', 'bld-linux64-spot', 'bld-lion-r5']) the problem here is that if we have pending jobs on this pool the alert on the ratio of pending to the size of the pool will not reflect where exactly is the problem or maybe we will not even get the alert as the ratio will be under the warning threshold So we must analyze which alerts should be keeped and which should be changed.
I think we should start with the fixed-sized hardware pools, and maybe not even worry about the others? That's where our bottleneck is most often going to be. For any platform that can expand into AWS, I'm not worried about capacity, provided we have get notified when that capacity is exhausted or we are unable to spin up more (pricing issues).
Taking a look at the past months, I'd say the current BB infra is more than enough to handle the load and we should be safe with the alerts we already have in place. Given the fact that most of our automation runs in TC now, I think these extra metrics are no longer needed.
Status: NEW → RESOLVED
Last Resolved: 4 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.