Right now check_pending_builds just sums all pending jobs and isn't a great indicator of where a problem may lie (or if it's just normal load). In talking with catlee, we should modify this so that we can check per platform to help us pinpoint issues quicker and ignore "normal load." He also suggested that we set thresholds differently based on time of day. My suggestion is that we pass in some base thresholds and a key that matches ... the platform type? Something we can easily correlate to platform, anyway. This way we can break it down by platform (passed into the script on multiple nagios checks) and add some buffer to the threshold based on time of day within the script logic if we want.
Created attachment 8670221 [details] [diff] [review] check_pending_builds.py.patch Attached a first draft of the script
Below you can find the ouput from the python script C:\Python27\python.exe C:/Users/vlad.ciobancai/PycharmProjects/untitled/get_num_pending.py OK Pending Builds: 48 for platform b2g-inbound OK Pending Builds: 1 for platform date OK Pending Builds: 1 for platform Unknown OK Pending Builds: 2 for platform mozilla-aurora WARNING Pending Builds: 3772 for platform try OK Pending Builds: 5 for platform try-comm-central OK Pending Builds: 2 for platform mozilla-central OK Pending Builds: 40 for platform comm-central OK Pending Builds: 63 for platform fx-team OK Pending Builds: 135 for platform mozilla-inbound
Comment on attachment 8670221 [details] [diff] [review] check_pending_builds.py.patch What does the output look when you run this? Do the thresholds need to be adjusted? My understanding is that currently the threshold is for all types of builds so my presumption is that they would need to be set on a per platform basis.
The output from the script is pasted above . I wanted first to be sure if it's ok the approach that I start
Oh okay, sorry I didn't see the earlier output, my bugmail was sorted weird in my inbox. In any case, I think we want to break the pending counts by platform like they are broken down in this report (i.e. t-w732-ix, t-w864-ix and so on) https://secure.pub.build.mozilla.org/builddata/reports/slave_health/ which is more useful than those by branch (mozilla-central, mozilla-inbound etc)
Comment on attachment 8670221 [details] [diff] [review] check_pending_builds.py.patch new patches are being worked on, will review those when they are available
Created attachment 8679426 [details] new_check_pending_builds.py A sample output of the script: CRITICAL Pending Builds: 9764 Top Builds by Platform: win32 --> 3859 builds win64 --> 2267 builds linux64 --> 896 builds linux --> 668 builds macosx64 --> 522 builds
Attachment #8679426 - Flags: review?(kmoir)
Comment on attachment 8679426 [details] new_check_pending_builds.py I looks good,should we adjust the warning thresholds to be smaller because they are broken out by platform?
Attachment #8679426 - Flags: review?(kmoir) → feedback+
Created attachment 8679537 [details] new_check_pending_builds.py So I updated the script to do the following: - return a CRITICAL alert when the number of pending builds for a certain platform is equal or greater than 2000 - return a WARNING alert when the number of pending builds for a certain platform is equal or greater than 1200 - return OK if the number of pending builds for a certain platform is lower than 1200 In all of the cases above, it will also print the top 3 platforms by pending jobs.
For example: CRITICAL Pending Builds: 2365 Top Builds by Platform: win32 --> 2365 win64 --> 1714 android-api-11 --> 176
Created attachment 8680680 [details] new_check_pending_builds.py Adjusted the way the results are reported a little bit. Example: CRITICAL Pending Builds: 5583 on 'win32' Top Builds by Platform: win32 --> 5583 win64 --> 3471 macosx64 --> 224
Created attachment 8682587 [details] [diff] [review] bug1204970.patch diff between old and new in patch form
Created attachment 8682592 [details] [diff] [review] bug1204970.patch
Attachment #8682587 - Attachment is obsolete: true
Comment on attachment 8682592 [details] [diff] [review] bug1204970.patch reverted this because some changes are needed on that nagios alerts before it can be deployed to production
Attachment #8682592 - Flags: checked-in+ → checked-in-
Created attachment 8689523 [details] check_pending_builds.py Updated the script to provide some optimization.
Attachment #8680680 - Attachment is obsolete: true
Created attachment 8689525 [details] [diff] [review] bug1204970.patch the new version of the patch
The bad news is that this check doesn't appear to be managed by puppet at all. I had :aselagea copy the new file in place, and running puppet didn't overwrite it. We've seen a couple 10s nrpe timeouts so far, so I'm not sure if this check is taking significantly longer than the old one, or this machine is just crufty and needs a reboot after 266 days of uptime. For the time being I've copied the old script to my personal tmp directory on cruncher if we need to revert.
Since we're seeing lots of NRPE timeouts, I've set it to 60s in commands.pp check_pending_builds => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_pending_builds',
Created attachment 8711618 [details] check_pending_builds.py Following Callek's suggestions, I refined the script a little bit. Attached the new version. Sample output: OK Pending Builds: 16 on ['t-w732-ix']
Created attachment 8711622 [details] [diff] [review] bug_1204970.patch Here's the new patch.
Created attachment 8711706 [details] [diff] [review] bug_1204970_new.patch Attached the diff between the latest version and the last reviewed version.
Attachment #8711706 - Flags: review?(bugspam.Callek)
Comment on attachment 8711706 [details] [diff] [review] bug_1204970_new.patch Review of attachment 8711706 [details] [diff] [review]: ----------------------------------------------------------------- ::: check_pending_builds.py @@ +1,1 @@ > +#!/usr/local/bin/python2.7 this slightly concerns me, perhaps #!/usr/bin/env python2.7 instead? (please verify it works first) -- was there a specific reason for this change? (e.g. if no specific reason might as well undo it) @@ +65,2 @@ > duplicate = 1 > break I think this duplicate check can be improved for speed, but since the basic logic was already here, its not blocking anything in terms of my review. @@ -118,5 @@ > - "on '%s'" % sort_count_by_platform > - print 'Top Builds by Platform:' > - for k in range(0, len(sort_count_by_platform)): > - if k < 3: > - print sort_count_by_platform[k], '-->', sort_count_by_platform[k] top builds was a useful output when manually running imo, but not necessary for a script strictly used as a nagios check.
Attachment #8711706 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8711622 [details] [diff] [review] bug_1204970.patch clearing review, since I didn't review this one officially. But I agree to just land the final version of the check into braindump/ assuming all lines have had review (which aiui they have)
Updated the script to use "#!/usr/bin/env python" (just as it was on version 1.0) and checked that it works. Pushed the new version to braindump and also replaced the existing script on cruncher with this one.
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #24) > Updated the script to use "#!/usr/bin/env python" (just as it was on version > 1.0) and checked that it works. Pushed the new version to braindump and also > replaced the existing script on cruncher with this one. Turns out that shebang change doesn't actually work as I suggested, - well it works manually from command line, but not from nagios (perhaps some PATH issue) Anyway, I reverted it on the live copy, not the repo copy.
What is the state of this bug? Can it be closed? I noticed in #buildduty that we have alerts per platform for tests working.
The check now uses the new version of the script.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.