modify check_pending_builds to report more granularly on pending builds/tests

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
7 months ago

People

(Reporter: arich, Assigned: aselagea)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 8 obsolete attachments)

(Reporter)

Description

3 years ago
Right now check_pending_builds just sums all pending jobs and isn't a great indicator of where a problem may lie (or if it's just normal load).

In talking with catlee, we should modify this so that we can check per platform to help us pinpoint issues quicker and ignore "normal load." 

He also suggested that we set thresholds differently based on time of day.

My suggestion is that we pass in some base thresholds and a key that matches ... the platform type? Something we can easily correlate to platform, anyway. This way we can break it down by platform (passed into the script on multiple nagios checks) and add some buffer to the threshold based on time of day within the script logic if we want.
Created attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

Attached a first draft of the script
Attachment #8670221 - Flags: review?(kmoir)
Assignee: nobody → vlad.ciobancai
Below you can find the ouput from the python script

C:\Python27\python.exe C:/Users/vlad.ciobancai/PycharmProjects/untitled/get_num_pending.py
OK Pending Builds: 48 for platform b2g-inbound
OK Pending Builds: 1 for platform date
OK Pending Builds: 1 for platform Unknown
OK Pending Builds: 2 for platform mozilla-aurora
WARNING Pending Builds: 3772 for platform try
OK Pending Builds: 5 for platform try-comm-central
OK Pending Builds: 2 for platform mozilla-central
OK Pending Builds: 40 for platform comm-central
OK Pending Builds: 63 for platform fx-team
OK Pending Builds: 135 for platform mozilla-inbound

Comment 3

3 years ago
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

What does the output look when you run this? Do the thresholds need to be adjusted?  My understanding is that currently the threshold is for all types of builds so my presumption is that they would need to be set on a per platform basis.
The output from the script is pasted above . I wanted first to be sure if it's ok the approach that I start

Comment 5

3 years ago
Oh okay, sorry I didn't see the earlier output, my bugmail was sorted weird in my inbox.  In any case, 
I think we want to break the pending counts by platform like they are broken down in this report (i.e. t-w732-ix, t-w864-ix and so on)

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/

which is more useful than those by branch (mozilla-central, mozilla-inbound etc)

Comment 6

3 years ago
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

new patches are being worked on, will review those when they are available
Attachment #8670221 - Flags: review?(kmoir)
(Assignee)

Comment 7

3 years ago
Created attachment 8679426 [details]
new_check_pending_builds.py

A sample output of the script:

CRITICAL Pending Builds: 9764
Top Builds by Platform:
win32 --> 3859 builds
win64 --> 2267 builds
linux64 --> 896 builds
linux --> 668 builds
macosx64 --> 522 builds
Attachment #8679426 - Flags: review?(kmoir)

Comment 8

3 years ago
Comment on attachment 8679426 [details]
new_check_pending_builds.py

I looks good,should we adjust the warning thresholds to be smaller because they are broken out by platform?
Attachment #8679426 - Flags: review?(kmoir) → feedback+
(Assignee)

Comment 9

3 years ago
Created attachment 8679537 [details]
new_check_pending_builds.py

So I updated the script to do the following: 
    - return a CRITICAL alert when the number of pending builds for a certain platform is equal or greater than 2000 
    - return a WARNING alert when the number of pending builds for a certain platform is equal or greater than 1200
    - return OK if the number of pending builds for a certain platform is lower than 1200

In all of the cases above, it will also print the top 3 platforms by pending jobs.
Attachment #8679426 - Attachment is obsolete: true
Attachment #8679537 - Flags: review?(kmoir)
(Assignee)

Comment 10

3 years ago
For example:

CRITICAL Pending Builds: 2365
Top Builds by Platform:
win32 --> 2365
win64 --> 1714
android-api-11 --> 176
(Assignee)

Comment 11

3 years ago
Created attachment 8680680 [details]
new_check_pending_builds.py

Adjusted the way the results are reported a little bit.
Example:

CRITICAL Pending Builds: 5583 on 'win32'
Top Builds by Platform:
win32 --> 5583
win64 --> 3471
macosx64 --> 224
Attachment #8679537 - Attachment is obsolete: true
Attachment #8679537 - Flags: review?(kmoir)
Attachment #8680680 - Flags: review?(kmoir)

Updated

3 years ago
Attachment #8680680 - Flags: review?(kmoir) → review+
Created attachment 8682587 [details] [diff] [review]
bug1204970.patch

diff between old and new in patch form
Created attachment 8682592 [details] [diff] [review]
bug1204970.patch
Attachment #8682587 - Attachment is obsolete: true

Updated

3 years ago
Attachment #8682592 - Flags: checked-in+
Comment on attachment 8682592 [details] [diff] [review]
bug1204970.patch

reverted this because some changes are needed on that nagios alerts before it can be deployed to production
Attachment #8682592 - Flags: checked-in+ → checked-in-
(Assignee)

Comment 15

3 years ago
Created attachment 8689523 [details]
check_pending_builds.py

Updated the script to provide some optimization.
Attachment #8680680 - Attachment is obsolete: true
(Assignee)

Comment 16

3 years ago
Created attachment 8689525 [details] [diff] [review]
bug1204970.patch

the new version of the patch
Attachment #8682592 - Attachment is obsolete: true
Attachment #8689525 - Flags: review?(kmoir)

Updated

3 years ago
Attachment #8689525 - Flags: review?(kmoir) → review+
Assignee: vlad.ciobancai → alin.selagea
(Reporter)

Comment 17

3 years ago
The bad news is that this check doesn't appear to be managed by puppet at all. I had :aselagea copy the new file in place, and running puppet didn't overwrite it.

We've seen a couple 10s nrpe timeouts so far, so I'm not sure if this check is taking significantly longer than the old one, or this machine is just crufty and needs a reboot after 266 days of uptime.

For the time being I've copied the old script to my personal tmp directory on cruncher if we need to revert.
(Reporter)

Comment 18

3 years ago
Since we're seeing lots of NRPE timeouts, I've set it to 60s in commands.pp

        check_pending_builds                     => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_pending_builds',
(Assignee)

Comment 19

3 years ago
Created attachment 8711618 [details]
check_pending_builds.py

Following Callek's suggestions, I refined the script a little bit. Attached the new version.

Sample output:
    OK Pending Builds: 16 on ['t-w732-ix']
Attachment #8670221 - Attachment is obsolete: true
Attachment #8689523 - Attachment is obsolete: true
Attachment #8711618 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 20

3 years ago
Created attachment 8711622 [details] [diff] [review]
bug_1204970.patch

Here's the new patch.
Attachment #8689525 - Attachment is obsolete: true
Attachment #8711622 - Flags: review?(bugspam.Callek)
(Assignee)

Updated

3 years ago
Attachment #8711618 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 21

3 years ago
Created attachment 8711706 [details] [diff] [review]
bug_1204970_new.patch

Attached the diff between the latest version and the last reviewed version.
Attachment #8711706 - Flags: review?(bugspam.Callek)
Comment on attachment 8711706 [details] [diff] [review]
bug_1204970_new.patch

Review of attachment 8711706 [details] [diff] [review]:
-----------------------------------------------------------------

::: check_pending_builds.py
@@ +1,1 @@
> +#!/usr/local/bin/python2.7

this slightly concerns me, perhaps

#!/usr/bin/env python2.7

instead?

(please verify it works first) -- was there a specific reason for this change? (e.g. if no specific reason might as well undo it)

@@ +65,2 @@
>                  duplicate = 1
>                  break

I think this duplicate check can be improved for speed, but since the basic logic was already here, its not blocking anything in terms of my review.

@@ -118,5 @@
> -            "on '%s'" % sort_count_by_platform[0][0]
> -        print 'Top Builds by Platform:'
> -        for k in range(0, len(sort_count_by_platform)):
> -            if k < 3:
> -                print sort_count_by_platform[k][0], '-->', sort_count_by_platform[k][1]

top builds was a useful output when manually running imo, but not necessary for a script strictly used as a nagios check.
Attachment #8711706 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8711622 [details] [diff] [review]
bug_1204970.patch

clearing review, since I didn't review this one officially. But I agree to just land the final version of the check into braindump/ assuming all lines have had review (which aiui they have)
Attachment #8711622 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 24

3 years ago
Updated the script to use "#!/usr/bin/env python" (just as it was on version 1.0) and checked that it works. Pushed the new version to braindump and also replaced the existing script on cruncher with this one.
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #24)
> Updated the script to use "#!/usr/bin/env python" (just as it was on version
> 1.0) and checked that it works. Pushed the new version to braindump and also
> replaced the existing script on cruncher with this one.

Turns out that shebang change doesn't actually work as I suggested, - well it works manually from command line, but not from nagios (perhaps some PATH issue)

Anyway, I reverted it on the live copy, not the repo copy.
What is the state of this bug? Can it be closed?  I noticed in #buildduty that we have alerts per platform for tests working.

Updated

3 years ago
Blocks: 1245823
(Assignee)

Comment 27

3 years ago
The check now uses the new version of the script.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.