Closed
Bug 1204970
Opened 9 years ago
Closed 9 years ago
modify check_pending_builds to report more granularly on pending builds/tests
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: aselagea)
References
Details
Attachments
(3 files, 8 obsolete files)
4.51 KB,
text/plain
|
Details | |
5.75 KB,
patch
|
Details | Diff | Splinter Review | |
6.74 KB,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
Right now check_pending_builds just sums all pending jobs and isn't a great indicator of where a problem may lie (or if it's just normal load).
In talking with catlee, we should modify this so that we can check per platform to help us pinpoint issues quicker and ignore "normal load."
He also suggested that we set thresholds differently based on time of day.
My suggestion is that we pass in some base thresholds and a key that matches ... the platform type? Something we can easily correlate to platform, anyway. This way we can break it down by platform (passed into the script on multiple nagios checks) and add some buffer to the threshold based on time of day within the script logic if we want.
Comment 1•9 years ago
|
||
Attached a first draft of the script
Attachment #8670221 -
Flags: review?(kmoir)
Updated•9 years ago
|
Assignee: nobody → vlad.ciobancai
Comment 2•9 years ago
|
||
Below you can find the ouput from the python script
C:\Python27\python.exe C:/Users/vlad.ciobancai/PycharmProjects/untitled/get_num_pending.py
OK Pending Builds: 48 for platform b2g-inbound
OK Pending Builds: 1 for platform date
OK Pending Builds: 1 for platform Unknown
OK Pending Builds: 2 for platform mozilla-aurora
WARNING Pending Builds: 3772 for platform try
OK Pending Builds: 5 for platform try-comm-central
OK Pending Builds: 2 for platform mozilla-central
OK Pending Builds: 40 for platform comm-central
OK Pending Builds: 63 for platform fx-team
OK Pending Builds: 135 for platform mozilla-inbound
Comment 3•9 years ago
|
||
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch
What does the output look when you run this? Do the thresholds need to be adjusted? My understanding is that currently the threshold is for all types of builds so my presumption is that they would need to be set on a per platform basis.
Comment 4•9 years ago
|
||
The output from the script is pasted above . I wanted first to be sure if it's ok the approach that I start
Comment 5•9 years ago
|
||
Oh okay, sorry I didn't see the earlier output, my bugmail was sorted weird in my inbox. In any case,
I think we want to break the pending counts by platform like they are broken down in this report (i.e. t-w732-ix, t-w864-ix and so on)
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/
which is more useful than those by branch (mozilla-central, mozilla-inbound etc)
Comment 6•9 years ago
|
||
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch
new patches are being worked on, will review those when they are available
Attachment #8670221 -
Flags: review?(kmoir)
Assignee | ||
Comment 7•9 years ago
|
||
A sample output of the script:
CRITICAL Pending Builds: 9764
Top Builds by Platform:
win32 --> 3859 builds
win64 --> 2267 builds
linux64 --> 896 builds
linux --> 668 builds
macosx64 --> 522 builds
Attachment #8679426 -
Flags: review?(kmoir)
Comment 8•9 years ago
|
||
Comment on attachment 8679426 [details]
new_check_pending_builds.py
I looks good,should we adjust the warning thresholds to be smaller because they are broken out by platform?
Attachment #8679426 -
Flags: review?(kmoir) → feedback+
Assignee | ||
Comment 9•9 years ago
|
||
So I updated the script to do the following:
- return a CRITICAL alert when the number of pending builds for a certain platform is equal or greater than 2000
- return a WARNING alert when the number of pending builds for a certain platform is equal or greater than 1200
- return OK if the number of pending builds for a certain platform is lower than 1200
In all of the cases above, it will also print the top 3 platforms by pending jobs.
Attachment #8679426 -
Attachment is obsolete: true
Attachment #8679537 -
Flags: review?(kmoir)
Assignee | ||
Comment 10•9 years ago
|
||
For example:
CRITICAL Pending Builds: 2365
Top Builds by Platform:
win32 --> 2365
win64 --> 1714
android-api-11 --> 176
Assignee | ||
Comment 11•9 years ago
|
||
Adjusted the way the results are reported a little bit.
Example:
CRITICAL Pending Builds: 5583 on 'win32'
Top Builds by Platform:
win32 --> 5583
win64 --> 3471
macosx64 --> 224
Attachment #8679537 -
Attachment is obsolete: true
Attachment #8679537 -
Flags: review?(kmoir)
Attachment #8680680 -
Flags: review?(kmoir)
Updated•9 years ago
|
Attachment #8680680 -
Flags: review?(kmoir) → review+
Comment 12•9 years ago
|
||
diff between old and new in patch form
Comment 13•9 years ago
|
||
Attachment #8682587 -
Attachment is obsolete: true
Updated•9 years ago
|
Attachment #8682592 -
Flags: checked-in+
Comment 14•9 years ago
|
||
Comment on attachment 8682592 [details] [diff] [review]
bug1204970.patch
reverted this because some changes are needed on that nagios alerts before it can be deployed to production
Attachment #8682592 -
Flags: checked-in+ → checked-in-
Assignee | ||
Comment 15•9 years ago
|
||
Updated the script to provide some optimization.
Attachment #8680680 -
Attachment is obsolete: true
Assignee | ||
Comment 16•9 years ago
|
||
the new version of the patch
Attachment #8682592 -
Attachment is obsolete: true
Attachment #8689525 -
Flags: review?(kmoir)
Updated•9 years ago
|
Attachment #8689525 -
Flags: review?(kmoir) → review+
Updated•9 years ago
|
Assignee: vlad.ciobancai → alin.selagea
Reporter | ||
Comment 17•9 years ago
|
||
The bad news is that this check doesn't appear to be managed by puppet at all. I had :aselagea copy the new file in place, and running puppet didn't overwrite it.
We've seen a couple 10s nrpe timeouts so far, so I'm not sure if this check is taking significantly longer than the old one, or this machine is just crufty and needs a reboot after 266 days of uptime.
For the time being I've copied the old script to my personal tmp directory on cruncher if we need to revert.
Reporter | ||
Comment 18•9 years ago
|
||
Since we're seeing lots of NRPE timeouts, I've set it to 60s in commands.pp
check_pending_builds => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_pending_builds',
Assignee | ||
Comment 19•9 years ago
|
||
Following Callek's suggestions, I refined the script a little bit. Attached the new version.
Sample output:
OK Pending Builds: 16 on ['t-w732-ix']
Attachment #8670221 -
Attachment is obsolete: true
Attachment #8689523 -
Attachment is obsolete: true
Attachment #8711618 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 20•9 years ago
|
||
Here's the new patch.
Attachment #8689525 -
Attachment is obsolete: true
Attachment #8711622 -
Flags: review?(bugspam.Callek)
Assignee | ||
Updated•9 years ago
|
Attachment #8711618 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 21•9 years ago
|
||
Attached the diff between the latest version and the last reviewed version.
Attachment #8711706 -
Flags: review?(bugspam.Callek)
Comment 22•9 years ago
|
||
Comment on attachment 8711706 [details] [diff] [review]
bug_1204970_new.patch
Review of attachment 8711706 [details] [diff] [review]:
-----------------------------------------------------------------
::: check_pending_builds.py
@@ +1,1 @@
> +#!/usr/local/bin/python2.7
this slightly concerns me, perhaps
#!/usr/bin/env python2.7
instead?
(please verify it works first) -- was there a specific reason for this change? (e.g. if no specific reason might as well undo it)
@@ +65,2 @@
> duplicate = 1
> break
I think this duplicate check can be improved for speed, but since the basic logic was already here, its not blocking anything in terms of my review.
@@ -118,5 @@
> - "on '%s'" % sort_count_by_platform[0][0]
> - print 'Top Builds by Platform:'
> - for k in range(0, len(sort_count_by_platform)):
> - if k < 3:
> - print sort_count_by_platform[k][0], '-->', sort_count_by_platform[k][1]
top builds was a useful output when manually running imo, but not necessary for a script strictly used as a nagios check.
Attachment #8711706 -
Flags: review?(bugspam.Callek) → review+
Comment 23•9 years ago
|
||
Comment on attachment 8711622 [details] [diff] [review]
bug_1204970.patch
clearing review, since I didn't review this one officially. But I agree to just land the final version of the check into braindump/ assuming all lines have had review (which aiui they have)
Attachment #8711622 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 24•9 years ago
|
||
Updated the script to use "#!/usr/bin/env python" (just as it was on version 1.0) and checked that it works. Pushed the new version to braindump and also replaced the existing script on cruncher with this one.
Comment 25•9 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #24)
> Updated the script to use "#!/usr/bin/env python" (just as it was on version
> 1.0) and checked that it works. Pushed the new version to braindump and also
> replaced the existing script on cruncher with this one.
Turns out that shebang change doesn't actually work as I suggested, - well it works manually from command line, but not from nagios (perhaps some PATH issue)
Anyway, I reverted it on the live copy, not the repo copy.
Comment 26•9 years ago
|
||
What is the state of this bug? Can it be closed? I noticed in #buildduty that we have alerts per platform for tests working.
Assignee | ||
Comment 27•9 years ago
|
||
The check now uses the new version of the script.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•