Closed Bug 697627 Opened 13 years ago Closed 11 years ago

nagios checks for pending builds

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: zeller)

References

Details

(Whiteboard: [nagios])

Attachments

(3 files, 13 obsolete files)

1.86 KB, text/plain
Details
1.62 KB, text/plain
hwine
: review+
hwine
: checked-in+
Details
1.65 KB, text/plain
hwine
: review+
hwine
: checked-in+
Details
bug 697374 tracked a problem where we had thousands of pending jobs (because 32-bit Linux test slaves weren't accepting new ones), and didn't notice for awhile. This could've been caught sooner if we had some Nagios checks in place. A couple of ideas that were thrown out:
* A check that fails when we have over X pending jobs (to account for some unavoidable load). This might legitimately fail sometimes, but we could downtime or ack it in those situations. X should probably be around 500.
* A check that fails when there are more than X pending and less than Y running. I'm not sure what X and Y should be here.

Regardless of what we do, we probably need something on buildapi that returns the # of pending and running in a simple form, so that it's easily understandable by whatever script is doing the checks.
Attached file nagios check script
We use the attached perl script to check the Thunderbird buildbot build queue.  It scrapes the /one_box_per_builder page.
Priority: -- → P3
Whiteboard: [nagios]
Found in triage. 

Sounds like this is actually two items:
1) buildapi work to make the length of pending builds queue be measurable from commandline
2) nagios work to call buildapi and then alert us when length > some-threshold-to-be-determined.

Moving to Tools for now... needinfo to hwine, zeller to see if this seems accurate, and if so, what they think are next steps here.
Component: Release Engineering → Release Engineering: Developer Tools
Flags: needinfo?(jozeller)
Flags: needinfo?(hwine)
QA Contact: hwine
This looks accurate to me. I can work on having something available for 1 today.

I need these few questions answered before I punch out this expansion:
1) When you say measurable from commandline, is json okay? Or is something else preferable?
2) Are we interested in finding out the length of pending builds by branch? Or by revision?
3) Do we also want length of running? Length of completed? (Not difficult)
4) Can/should this be a new REST URL? Perhaps:
    - If we want length by branch, then: /self-serve/{branch}/status
    - If we want length by revision, then: /self-serve/{branch}/rev/{revision}/status
Flags: needinfo?(jozeller)
Flags: needinfo?(joduinn)
Flags: needinfo?(bhearsum)
Attached patch bug697627-buildapi.patch (obsolete) — Splinter Review
I went ahead and punched out a draft of this expansion. This is not heavily tested, but was correct in the few instances I tried it in. I assumed that we were interested in searching by branch, having the result returned in json and adding a new endpoint at /self-serve/{branch}/status.

If all that matches the answers to my needinfo in comment 3, then I will go ahead and start the feedback, testing and review process on this patch to get it out asap.
(In reply to John Zeller [:zeller] from comment #3)
> This looks accurate to me. I can work on having something available for 1
> today.
You're way ahead of your support lines -- let's get the requirements clear first. ATM, there is no definition of "done".

> 
> I need these few questions answered before I punch out this expansion:
> 1) When you say measurable from commandline, is json okay? Or is something
> else preferable?
It will need to interface with some script that will package it for nagios consumption -- see the prior attachment for an example of what would call this API point.

> 2) Are we interested in finding out the length of pending builds by branch?
> Or by revision?
Neither -- see comment 0 for description of problem. We need by machine pool type.

> 3) Do we also want length of running? Length of completed? (Not difficult)
Pending first

> 4) Can/should this be a new REST URL? Perhaps:
Yes, new URL, as this is a completely new query type. That is assuming it can be done at all.

>     - If we want length by branch, then: /self-serve/{branch}/status
>     - If we want length by revision, then:
> /self-serve/{branch}/rev/{revision}/status
None of the above
Flags: needinfo?(hwine)
(In reply to Hal Wine [:hwine] from comment #5)
> (In reply to John Zeller [:zeller] from comment #3)
> > This looks accurate to me. I can work on having something available for 1
> > today.
> You're way ahead of your support lines -- let's get the requirements clear
> first. ATM, there is no definition of "done".
> 
> > 
> > I need these few questions answered before I punch out this expansion:
> > 1) When you say measurable from commandline, is json okay? Or is something
> > else preferable?
> It will need to interface with some script that will package it for nagios
> consumption -- see the prior attachment for an example of what would call
> this API point.
> 
> > 2) Are we interested in finding out the length of pending builds by branch?
> > Or by revision?
> Neither -- see comment 0 for description of problem. We need by machine pool
> type.
> 
> > 3) Do we also want length of running? Length of completed? (Not difficult)
> Pending first
> 
> > 4) Can/should this be a new REST URL? Perhaps:
> Yes, new URL, as this is a completely new query type. That is assuming it
> can be done at all.
> 
> >     - If we want length by branch, then: /self-serve/{branch}/status
> >     - If we want length by revision, then:
> > /self-serve/{branch}/rev/{revision}/status
> None of the above

You're right, good call. What needs to be done here? And who can we ask for some solid answers to the questions above?
(In reply to John Zeller [:zeller] from comment #3)
> This looks accurate to me. I can work on having something available for 1
> today.
> 
> I need these few questions answered before I punch out this expansion:
> 1) When you say measurable from commandline, is json okay? Or is something
> else preferable?

I don't know what you mean by this...

> 2) Are we interested in finding out the length of pending builds by branch?
> Or by revision?

These seem like nice to haves, but not requirements.

> 3) Do we also want length of running? Length of completed? (Not difficult)

Again, nice to have.

> 4) Can/should this be a new REST URL? Perhaps:
>     - If we want length by branch, then: /self-serve/{branch}/status
>     - If we want length by revision, then:
> /self-serve/{branch}/rev/{revision}/status

If we're quibbling on names, I think that /self-serve/pending and similar would be more appropriate. "status" is very generic, and would quickly turn into a mess, I think.
Flags: needinfo?(bhearsum)
Product: mozilla.org → Release Engineering
Attached patch bug697627.patch (obsolete) — Splinter Review
This is a stupid simple patch that adds functionality to the already existing route /pending with the parameter total=all. Currently all this does is return json in the form {'all': 49}.

There is a LOT of room to improve of this, but this patch is looking to start with providing the very most basic information, which is how many pending jobs are currently existing on all machines regardless of pool, branch, or anything else.

Note: I did not add this to a new REST on top of /pending (ie /pending/all) because there is already a route for /pending/{branch}/{platform} and all would be read as a branch. I decided that hacking through that would be a waste given that it undermines the already determined functionality of that URL. However, if there are better suggestions, or if someone thinks that this should be somewhere completely different, then by all means, throw down some ideas! :)
Attachment #787393 - Attachment is obsolete: true
Attachment #791112 - Flags: feedback?(nthomas)
Attachment #791112 - Flags: feedback?(bhearsum)
Assignee: nobody → jozeller
Flags: needinfo?(joduinn)
Attached patch bug697627.patch (obsolete) — Splinter Review
While writing up unittests for this, I found a bug that caused the base URL /pending not to be able to load. I fixed it in this updated patch.
Attachment #791112 - Attachment is obsolete: true
Attachment #791112 - Flags: feedback?(nthomas)
Attachment #791112 - Flags: feedback?(bhearsum)
Attachment #791442 - Flags: review?(nthomas)
Attachment #791442 - Flags: checked-in?
Attached patch bug697627-unittests.patch (obsolete) — Splinter Review
Adding unittests to document the patch bug697627.patch
Attachment #791443 - Flags: review?(nthomas)
Attachment #791443 - Flags: checked-in?
Attached file check_pending_builds.py (obsolete) —
This should take care of a nagios alert plugin, at least foundation ally. This still has a few questions that need answers.

1) Once the buildapi extension in buig697627.patch is live, it'll be found on buildapi/pending?total=all (This could change). However, I'm currently testing this on my local machine instance of buildapi, and so the url I am calling must change in this nagios file. What URL do I use for making a call to this in buildapi?
1.1) Does this call require authentication? If so, how do I go about setting that up properly?
2) Currently the critical and warning thresholds have set defaults at 5000 and 2000 respectively. Should these even have defaults? If so, what are appropriate ones? PS This already takes in thresholds as arguments.
Attachment #791553 - Flags: feedback?(catlee)
Flags: needinfo?(catlee)
Comment on attachment 791553 [details]
check_pending_builds.py

I'd like some more feedback on this nagios plugin please! :)
Attachment #791553 - Flags: feedback?(rail)
Attachment #791553 - Flags: feedback?(jhopkins)
Attachment #791553 - Flags: feedback?(hwine)
Attachment #791553 - Attachment mime type: text/x-python → text/plain
Comment on attachment 791553 [details]
check_pending_builds.py

>#!/usr/bin/python

this should be #!/usr/bin/env python
>
>import sys
>from optparse import OptionParser

we're on python 2.7 now, so should use argparse, as it's compatible with python 3.x

>import urllib2 as url
>import simplejson as json

The module simplejson isn't part of the standard library -- there's an idiom for using it when available, but falling back to the version shipped with the standard library.

>
>def get_pending_builds():
>    response = url.urlopen('http://127.0.0.1:5000/pending?total=all')

While we don't know yet what the final URL will be, it's better to have it defined once as a global, rather than a hard coded magic value. Here, it makes sense to add the unique query parameters. Hint -- if it's a parameter to this function, unit tests will be much easier to write.

>    result = json.loads(response.read())
>    return result.get('all')
>
>def pending_builds_status(num_pending, critical, warning):

"Intentional naming" is a good concept - "num_pending" is good, the other 2 argument names are not as good.

>    if num_pending > critical:
>        return 'CRITICAL - Pending Builds: %i' % num_pending
>    elif num_pending > warning:
>        return 'WARNING - Pending Builds: %i' % num_pending
>    else:
>        return 'OK - Pending Builds: %i' % num_pending

I'm guessing you don't have unit tests running yet. The above will fail.

>
>if __name__ == '__main__':
>    parser = OptionParser(usage="Usage: %prog [options]",  version="%prog 1.0")

rather than hard code a magic value, you could set the __version__ globally above, and use that value here
__doc__ & __version__ are idiomatic global variables that are used by introspection code when available.

>    parser.add_option('-c', '--critical', action='store', type='int', dest='critical', default=5000,
>                      help='Set CRITICAL level (as integer eg. 5000)')
>    parser.add_option('-w', '--WARNING', action='store', type='int', dest='warning', default=2000,
>                      help='Set WARNING level (as integer eg. 2000)')
>    (options, args) = parser.parse_args()
>
>    output = ""
why?
>    output = pending_builds_status(get_pending_builds(), options.critical, options.warning)
>
>    if len(output) < 2:
>        print 'Please use -h for options'
>        print options("-h")
>    if output[:2] == 'OK':
>        print output
>        sys.exit(0)
>    elif output[:2] == 'WA':
>        print output
>        sys.exit(1)
>    elif output[:2] == 'CR':
>        print output
>        sys.exit(2)
Okay, the above needs rework -- consider how you could avoid all that testing (hint: you've already used the technique in this script). And "magic hard coded values".

Also, as written, there is no provision for error handling. That's one major reason to use python for this, and Nagios even has support for it -- but you're not using that yet.

I'd like to see a version that passes some unit tests.
Attachment #791553 - Flags: feedback?(hwine) → feedback-
Comment on attachment 791442 [details] [diff] [review]
bug697627.patch

Review of attachment 791442 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not convinced we need this new point on the API when 
  https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js
is a snapshot updated every 60 seconds, and is already available without any http auth issues. Can't the status plugin suck that data down (it never gets that big) and the the loop over the nested dict to count the number of pending jobs ?
Attachment #791442 - Flags: review?(nthomas)
Attachment #791442 - Flags: review-
Attachment #791442 - Flags: checked-in?
Comment on attachment 791443 [details] [diff] [review]
bug697627-unittests.patch

No longer needed if my previous comment stands up to scrutiny.
Attachment #791443 - Flags: review?(nthomas)
Attachment #791443 - Flags: checked-in?
Comment on attachment 791553 [details]
check_pending_builds.py

removing f? as hwine has already given detailed feedback
Attachment #791553 - Flags: feedback?(jhopkins)
(In reply to Nick Thomas [:nthomas] from comment #14)
> Comment on attachment 791442 [details] [diff] [review]
> bug697627.patch
> 
> Review of attachment 791442 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm not convinced we need this new point on the API when 
>   https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js
> is a snapshot updated every 60 seconds, and is already available without any
> http auth issues. Can't the status plugin suck that data down (it never gets
> that big) and the the loop over the nested dict to count the number of
> pending jobs ?

Wow! Yeah I think I agree, this makes much more sense. Heine if you agree I will go ahead and rewrite that plugin for this service.
Flags: needinfo?(hwine)
Sounds like a plan -- don't forget to check "v1" into a repo prior to starting on v2, just in case....
Flags: needinfo?(hwine)
Attachment #791442 - Flags: review-
Attached file check_pending_builds.py (obsolete) —
This new nagios plugin uses the already existing service found here: https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js

hwine: I made all of the adjustments that you mentioned in comment 13, with the exception of a couple that I could use more info on.

>    if num_pending > critical:
>        return 'CRITICAL - Pending Builds: %i' % num_pending
>    elif num_pending > warning:
>        return 'WARNING - Pending Builds: %i' % num_pending
>    else:
>        return 'OK - Pending Builds: %i' % num_pending

You mentioned that this would fail... I can't seem to figure out what you may be seeing that I have missed here. I wrote up a few tests to check things like a range of num_pending, but alas I cannot seem to figure out what you were pointing to here. Could you clarify?

> Also, as written, there is no provision for error handling. That's one major 
> reason to use python for this, and Nagios even has support for it -- but you're > not using that yet.

So there is Nagios supported error handling? All I seem to be able to find so far is this http://nagios.sourceforge.net/docs/3_0/eventhandlers.html , am I on the right track? Either way, I'm gonna keep reading, but I wanted to see if there was something simple I am missing.
Attachment #791553 - Attachment is obsolete: true
Attachment #791553 - Flags: feedback?(rail)
Attachment #791553 - Flags: feedback?(catlee)
Attachment #793225 - Flags: review?(hwine)
Flags: needinfo?(hwine)
Attached file check_pending_builds.py (obsolete) —
Added error handling for urllib2.URLError if urlopen is ever to fail. I then also added the use of returning UNKNOWN value 3 when a URLError happens or when the num_pending value is NoneType.
Attachment #793225 - Attachment is obsolete: true
Attachment #793225 - Flags: review?(hwine)
Attachment #793601 - Flags: review?(hwine)
Attached file check_pending_builds.py (obsolete) —
Cleaned up this plugin quite a bit. Functionally produces the same result, but in a much cleaner fashion.

As a recap, this plugin grabs the list of pending builds globally, determines how many there are, then checks this number against a preset or passed in warning and critical threshold. The plugin then returns either a 0, 1, 2, or 3 if the number of pending jobs are either OK, WARNING, CRITICAL or UNKNOWN.

This plugin also uses argparse which is part of the python standard library and works with python 3.x.
Attachment #793601 - Attachment is obsolete: true
Attachment #793601 - Flags: review?(hwine)
Attachment #793740 - Flags: review?(hwine)
Latest plugin looks good! A few random comments:

- `import urllib2 as url` is a bit strange - why did you feel like you needed to rename urllib2?

- run through a pep8 checker to fix up some code formatting issues

- how will nagios handle the output from your top-level exception handler?

- should we add a retry around the urllib2.urlopen call, in case the network or service is flapping? nagios will re-check anyway, so we may be fine without retries.
Flags: needinfo?(catlee)
Attached file check_pending_builds.py (obsolete) —
Reformatted code to match pep8 and removed urllib2 as url... reverted back to urllib2.
Attachment #793740 - Attachment is obsolete: true
Attachment #793740 - Flags: review?(hwine)
Attachment #795688 - Flags: review?(hwine)
(In reply to Chris AtLee [:catlee] from comment #22)
> Latest plugin looks good! A few random comments:
> 
> - `import urllib2 as url` is a bit strange - why did you feel like you
> needed to rename urllib2?
> 
Yeah, this was unnecessary, so I removed it.

> - run through a pep8 checker to fix up some code formatting issues
Done

> 
> - how will nagios handle the output from your top-level exception handler?
Nagios handles the top-level exception handler as 'UNKNOWN'

> 
> - should we add a retry around the urllib2.urlopen call, in case the network
> or service is flapping? nagios will re-check anyway, so we may be fine
> without retries.
Retry should be unnecessary as nagios handles the rechecks :)
Attached file test_check_pending_builds.py (obsolete) —
These are the unittests to use for testing the nagios plugin attached above, attachment 795688 [details]
Attachment #791443 - Attachment is obsolete: true
Attachment #797528 - Flags: review?(hwine)
Attached file check_pending_builds.py (obsolete) —
I am simply resubmitting this nagios plugin again to then mark old attachments obsolete. Hoping that this cleans up the attachments are a bit :)

The old comments still apply:
Reformatted code to match pep8 and removed urllib2 as url... reverted back to urllib2.
Attachment #791442 - Attachment is obsolete: true
Attachment #795688 - Attachment is obsolete: true
Attachment #795688 - Flags: review?(hwine)
Attachment #797530 - Flags: review?(hwine)
Attachment #797528 - Attachment mime type: text/x-python → text/plain
Attachment #797530 - Attachment mime type: text/x-python → text/plain
Comment on attachment 797530 [details]
check_pending_builds.py

Good as is -- and you should understand why the nits :)

>if __name__ == '__main__':
>    parser = argparse.ArgumentParser(
>        prog="check_pending_builds.py", usage="Usage: %(prog)s [options]",  version="%(prog)s " + __version__)

nit: keyword argument 'prog' has a reasonable default, so should be overridden only when the default doesn't work.

>    parser.add_argument(
>        '-c', '--critical', action='store', type=int, dest='critical_threshold', default=5000,
>        help='Set CRITICAL level (as integer eg. 5000)')
>    parser.add_argument(
>        '-w', '--warning', action='store', type=int, dest='warning_threshold', default=2000,
>        help='Set WARNING level (as integer eg. 2000)')

nit: the help is using two different names for each variable, which is confusing. Consider using 'metavar'

>    try:
>        num_pending = get_num_pending_builds()
>        status = pending_builds_status(
>            num_pending, args.critical_threshold, args.warning_threshold)
>        print '%s Pending Builds: %i' % (status, num_pending)
>        sys.exit(status_code.get(status, 3))

nit: '3' is a magic number -- what would happen if you used: sys.exit(status_code[status])

>    except Exception as e:
>        print e
>        sys.exit(status_code.get('UNKNOWN'))
Attachment #797530 - Flags: review?(hwine) → review+
Comment on attachment 797528 [details]
test_check_pending_builds.py

Mostly good - lets get the test thresholds not equal to defaults before committing. With that change, r+

>import unittest
>from check_pending_builds import pending_builds_status
>
># NOTE: This test is to be run with unittest and NOT nosetests

nit: there's nothing with this test that precludes running via nosetests. The nosetest issue is with upstream suite of tests. So, let's remove the statement from here. Add to a top level README, if anything.

>## THINGS NOT BEING TESTED FOR ###
># 1) UNKNOWN since that is in main and only executed when an exception occurs
># 2) Ranges of warning and critical thresholds since testing boundaries for 1 set, should hold true for any set

Another thing that is not being tested is that pending_build_status honors the values being passed are tested. I.e. you're testing with the default values.

>class TestSequenceFunctions(unittest.TestCase):
>
>    warning_threshold = 2000
>    critical_threshold = 5000

So why these values? Is one set of values sufficient? I'd change these to be not the default.

>
>    def test_ok(self):
>        """Test OK from 0 to warning_threshold - 1"""
>        for x in range(self.warning_threshold):
>            assert pending_builds_status(x, self.critical_threshold, self.warning_threshold) == 'OK'

There's no real value in the extended range -- the range function allows specification of a start point as well. Same with other functions.
Attachment #797528 - Flags: review?(hwine) → review+
Attached file check_pending_builds.py (obsolete) —
Fixed nits :) And made the output less redundant.

> nit: '3' is a magic number -- what would happen if you used: sys.exit(status_code[status])

I left the default return status as 3 here because with just sys.exit(status_code[status]), if status is not as expected by something in the status_code dict, then it just gives a None, which then sys.exit(None)'s. I can change this if you think it's unnecessary to test for something other than what is in the status_code dict.
Attachment #797530 - Attachment is obsolete: true
Attachment #797949 - Flags: review?(hwine)
> >## THINGS NOT BEING TESTED FOR ###
> ># 1) UNKNOWN since that is in main and only executed when an exception occurs
> ># 2) Ranges of warning and critical thresholds since testing boundaries for 1 set, should hold true for any set
> 
> Another thing that is not being tested is that pending_build_status honors
> the values being passed are tested. I.e. you're testing with the default
> values.

What should I do instead here?

> >class TestSequenceFunctions(unittest.TestCase):
> >
> >    warning_threshold = 2000
> >    critical_threshold = 5000
> 
> So why these values? Is one set of values sufficient? I'd change these to be
> not the default.

I picked these values arbitrarily. I also removed the defaults in the nagios plugin itself. Does this still apply then?

> >    def test_ok(self):
> >        """Test OK from 0 to warning_threshold - 1"""
> >        for x in range(self.warning_threshold):
> >            assert pending_builds_status(x, self.critical_threshold, self.warning_threshold) == 'OK'
> 
> There's no real value in the extended range -- the range function allows
> specification of a start point as well. Same with other functions.

The only reason I gave a range so big here is because it's cleaner than specifying to test 0, threshold-1, threshold and threshold+1. Better to be explicit about the specific values I am really caring about?
Attached file check_pending_builds.py (obsolete) —
Fixed nits :) Should be good to go!
Attachment #797949 - Attachment is obsolete: true
Attachment #797949 - Flags: review?(hwine)
Attachment #798093 - Flags: review?(hwine)
Changes defaults and made the test cases only test boundary cases rather than a crazy large range.
Attachment #797528 - Attachment is obsolete: true
Attachment #798094 - Flags: review?(hwine)
Attachment #798093 - Attachment mime type: text/x-python → text/plain
Added sys.exit(status_code[status])
Attachment #798093 - Attachment is obsolete: true
Attachment #798093 - Flags: review?(hwine)
Attachment #798101 - Flags: review?(hwine)
Attachment #798094 - Flags: review?(hwine) → review+
Attachment #798093 - Flags: review+
Attachment #798101 - Attachment mime type: text/x-python → text/plain
Comment on attachment 798094 [details]
test_check_pending_builds.py

Ready to go! :)
Attachment #798094 - Flags: checked-in?
Attachment #798101 - Flags: review?(hwine) → review+
Attachment #798101 - Flags: checked-in?
Comment on attachment 798094 [details]
test_check_pending_builds.py

https://hg.mozilla.org/build/braindump/rev/fb1dfe7d15d7
Attachment #798094 - Flags: checked-in? → checked-in+
Comment on attachment 798101 [details]
check_pending_builds.py

https://hg.mozilla.org/build/braindump/rev/fb1dfe7d15d7
Attachment #798101 - Flags: checked-in? → checked-in+
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(hwine)
Resolution: WORKSFORME → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: