Closed Bug 1220191 Opened 9 years ago Closed 8 years ago

create nagios checks for buildbot backlog age and master lag

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: aselagea)

References

Details

Attachments

(2 files, 6 obsolete files)

Today we noticed that our windows backlog was very long (2 days), and that our master log was also high (indicating that we needed more masters).

If possible, we should have automated warnings of these two conditions so we can take corrective action sooner.
age of the backlog could be done as an extension of the current pending check, the logic & code would be very similar.

for master lag, I'd like to start reporting this to statsd/graphite from the log parsing on the masters. can we alert based on some threshold in statsd?
For the master lag, do we have some metrics on this at the moment?
Flags: needinfo?(catlee)
Assignee: nobody → alin.selagea
Yes, in graphite, if you go to 'User graphs -> catlee@mozilla.com -> master_lag', you can see some.
I know that IT has set up alerts based on graphite data. Ashish might be a good person to ping to ask about how that was done.
Flags: needinfo?(catlee)
What is graphite's host?
Alin is going to work on this. Once we have the location of the lag graph, please update the pending counts page https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts to reflect the steps to take in the case of the high lag. As well, the mana will need to be updated to reflect what action to take on the nagios alert.
I would also need to know if we have some metrics on the age of the backlog..do we have some sort of graph for this? 

Thanks.
Flags: needinfo?(catlee)
No, we don't have metrics on age of backlog yet. This would be very similar to calculating the size of the backlog though. You should be close with the work going on in bug 1204970.
Flags: needinfo?(catlee)
Attached file check_backlog_age.py (obsolete) —
Added the python script to check the age of the backlog. Sample output:

OK Backlog Age: 1h:52m:22s

Process finished with exit code 0
Attachment #8682427 - Flags: review?(catlee)
Comment on attachment 8682427 [details]
check_backlog_age.py

Looks good overall, thanks for putting this together!

I'd like to see a few minor fixes first before we land this.

Can you add a #!/usr/bin/env python header at the top please? and a standard MPL license header

>import sys
>import argparse
>import urllib2
>import json
>
>from datetime import datetime
>import calendar
>
>__version__ = "1.0"
>waiting_time = []
>pending_url = 'https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js'
>status_code = {'OK': 0, 'WARNING': 1, "CRITICAL": 2, "UNKNOWN": 3}
>
>
>def get_utc_unix_time():
>    d = datetime.utcnow()
>    unix_time = calendar.timegm(d.utctimetuple())
>    return unix_time

I think time.time() does what you want?


>def get_max_waiting_time(unix_time):

does unix_time refer to 'now'?
maybe have this function return the earliest submitted_at, and then you can subtract from the current time elsewhere.

>    response = urllib2.urlopen(pending_url)
>    result = json.loads(response.read())
>    for key in result['pending'].keys():
>        for k1 in result['pending'][key].keys():
>            for k2 in range(0, (len(result['pending'][key][k1]))):
>                waiting_time.append(unix_time - result['pending'][key][k1][k2]['submitted_at'])
>    return max(waiting_time)
>
>
>def pending_builds_status(max_waiting_time, critical_threshold, warning_threshold):
>    if max_waiting_time >= critical_threshold:
>        return 'CRITICAL'
>    elif max_waiting_time >= warning_threshold:
>        return 'WARNING'
>    else:
>        return 'OK'
>
>if __name__ == '__main__':
>    parser = argparse.ArgumentParser(version="%(prog)s " + __version__)
>    parser.add_argument(
>        '-c', '--critical', action='store', type=int, dest='critical_threshold', default=36000, metavar="CRITICAL",
>        help='Set CRITICAL level as integer eg. 36000')
>    parser.add_argument(
>        '-w', '--warning', action='store', type=int, dest='warning_threshold', default=18000, metavar="WARNING",
>        help='Set WARNING level as integer eg. 18000')
>    args = parser.parse_args()
>
>    try:
>        unix_time = get_utc_unix_time()
>        max_waiting_time = get_max_waiting_time(unix_time)
>        status = pending_builds_status(max_waiting_time, args.critical_threshold, args.warning_threshold)
>        m, s = divmod(max_waiting_time, 60)
>        h, m = divmod(m, 60)
>        time = "%dh:%02dm:%02ds" % (h, m, s)

this logic would be great in a function

>        print '%s Backlog Age: %s' % (status, time)
>        sys.exit(status_code[status])
>    except Exception as e:
>        print e
>        sys.exit(status_code.get('UNKNOWN'))
>
Attachment #8682427 - Flags: review?(catlee) → review-
Since our working hours do not seem to overlap, could you please help with some tips/suggestions on how to set up alerts based on graphite data? Or maybe point me to a person that could help during my working hours(GMT+2)?

Thanks!
Flags: needinfo?(ashish)
cc'ing :pir, who is in London.
Flags: needinfo?(ashish)
Attached file check_backlog_age.py (obsolete) —
updated the script
Attachment #8682427 - Attachment is obsolete: true
Attachment #8683016 - Flags: review?(catlee)
Sorry, I know nothing about alerts based on graphite data.
Tried check_graphite_data with info in Comment 3:

> [ashish@nagios1.private.scl3 ~]$ /usr/lib64/nagios/plugins/mozilla/check_graphite_data --url="https://graphite-scl3.mozilla.org/render/?target=highestCurrent(hosts.buildbot-master*.statsd.latency.total_master_lag-percentile-95,5)" --critupper --warn 125 --crit 150
> Current value: 148.0704, warn threshold: 125.0, crit threshold: 150.0
> [ashish@nagios1.private.scl3 ~]$ echo $?
> 1

I tried with random thresholds, just to verify. Ideally I would like this setup on nagios-releng, but that'll need an ACL to graphite-scl3 (request in a separate bug). But I'm all clear to set this up with graphite data.
Depends on: 1222474
What do we want the check thresholds to be for warn and crit? The flow is in place and ashish is ready to set up the check as soon as we give him some numbers.
Flags: needinfo?(catlee)
Flags: needinfo?(alin.selagea)
master lag warning should be 300s (5 minutes), and critical should be 600s (10 minutes).

coop may have a better idea of what backlog age thresholds should be. warning at 6h and critical at 12h?
Flags: needinfo?(catlee) → needinfo?(coop)
(In reply to Chris AtLee [:catlee] from comment #17) 
> coop may have a better idea of what backlog age thresholds should be.
> warning at 6h and critical at 12h?

That's a reasonable starting point. We can always tweak them as required.
Flags: needinfo?(coop)
Alright, I've set up the buildbot master lag check to check the metric every 2 mins and alert if the thresholds in Comment 17 are breached.
Flags: needinfo?(alin.selagea)
I've modified the check slightly so that we're not checking as often and waiting a bit before alerting (this is more of a long term issue than an immediate "everything is on fire" check).

-            normal_check_interval => 2,
-            max_check_attempts => 1,
+            normal_check_interval => 5,
+            max_check_attempts => 5,


Committed revision 110283.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Thanks :arr. This bug is still pending work to setup the the buildbot age check from Comment 13...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 8683016 [details]
check_backlog_age.py

def get_min_submitted_at():
    response = urllib2.urlopen(pending_url)
    result = json.loads(response.read())
    for key in result['pending'].keys():
        for k1 in result['pending'][key].keys():
            for k2 in range(0, (len(result['pending'][key][k1]))):
                submitted_at.append(result['pending'][key][k1][k2]['submitted_at'])
    return min(submitted_at)

this function is a bit confusing. would something like this work?

def get_min_submitted_at():
    response = urllib2.urlopen(pending_url)
    result = json.loads(response.read())
    for branch in result['pending'].keys():
        for revision in result['pending'][branch].keys():
            for request in result['pending'][branch][revision]
                submitted_at.append(request['submitted_at'])
    return min(submitted_at)
Attachment #8683016 - Flags: review?(catlee) → review-
Attached file check_backlog_age.py (obsolete) —
Thanks, :catlee. Updated the script to reflect your suggestion, I also modified the thresholds for issuing alerts in nagios:
    
WARNING: 21600 seconds --> 6 hours
CRITICAL: 43200 --> 12 hours
Attachment #8683016 - Attachment is obsolete: true
Attachment #8688909 - Flags: review?(catlee)
Attachment #8688909 - Flags: review?(catlee) → review+
I guess it would be useful if we had the alert for the backlog age too. 
:ashish could you set up this alert please? Or is there someone else who we should ask for that?

Thanks :)
Flags: needinfo?(ashish)
Sorry, I forgot about this bug! I can help set that up too.
Flags: needinfo?(ashish)
Last request - would it be possible to remove dependency on argparse? That would avoid installing the package on all Nagios servers (since they're all built and maintained alike). Thanks!
Where would this check run? What version of python do they have installed?
Running it from the nagios server is the simplest. Running it from a remote server will need it to be deployed via NRPE (with puppet code not in my control). FWIW:

> [ashish@nagios1.private.releng.scl3 ~]$ python --version
> Python 2.6.6
Attached file check_backlog_age.py (obsolete) —
:catlee would something like this be enough? Or am I missing something here?

Thanks.
Attachment #8688909 - Attachment is obsolete: true
Attachment #8698432 - Flags: review?(catlee)
Attached patch bug_1220191.patch (obsolete) — Splinter Review
Diff file with respect to the previous version of the patch.
Attached file check_backlog_age.py (obsolete) —
Updated the script to replace 'argparser' and use 'optparser' instead (which is available in python 2.6).
Attachment #8698432 - Attachment is obsolete: true
Attachment #8698432 - Flags: review?(catlee)
Attachment #8698965 - Flags: review?(catlee)
Also uploaded the diff file.
Attachment #8698932 - Attachment is obsolete: true
Blocks: 1225475
Added the patch for the new script.
Attachment #8698965 - Attachment is obsolete: true
Attachment #8698965 - Flags: review?(catlee)
Attachment #8709422 - Flags: review?(catlee)
Attachment #8709422 - Flags: review?(catlee) → review+
Attachment #8709422 - Flags: checked-in+
The script has been modified to use 'optparser'(available in Python 2.6) instead of 'argparser'.

@ashish: when possible, can we please have this nagios alert implemented too? :)

Thanks!
Flags: needinfo?(ashish)
For any checks to be implemented we'll need an example of the script's use, options, etc, and target hosts to run it against as well as what levels it should alert at and documentation for what we should do when it does alert, who to escalate to, etc.
I described the check here: https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age

NOTE: I presumed that it will be implemented the same way as check_pending_builds since the checks are very similar, so that'll mean creating the following config file: "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this also needs to be created).

Let me know if you have further suggestions/questions.

Thanks.
To be consistent with all the others, the check should be /usr/lib64/nagios/plugins/custom/check_backlog_age with no .py extension.
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #36)
> I described the check here:
> https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age

"check SlaveHealth interface to see the number of pending jobs on each pool, restart broken slaves if not taking jobs for longer periods of time"

If that's something you want the MOC to do then it needs a lot more specific detail on how to do those things (step by step instructions for someone who doesn't know what any of those things mean or where to find them and what "check" and "longer periods of time" mean in practice).

If it's not something you want the MOC to do it needs to be labelled as such otherwise oncall staff will see that and spend time trying to work out what it means and what the process is.



> NOTE: I presumed that it will be implemented the same way as
> check_pending_builds since the checks are very similar, so that'll mean
> creating the following config file:
> "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script
> placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this
> also needs to be created).

That's useful, thanks.
What ages should it go to warning and critical at?
@pir: Restarting the slaves when they stop taking jobs for longer periods of time (e.g. more than 10 hours) is generally done by the Buildduty team, but other Releng folks can do that as well, sorry for the confusion. I will update the mana page with this info. 
The list of steps to be done when such problems occur may also vary from case to case, so more thorough investigation is probably needed.

@catlee: I initially thought of the following thresholds for the critical&warning states (also specified as default in the script):
    - CRITICAL: 43200 seconds --> 12 hours 
    - WARNING: 21600 seconds --> 6 hours

Do you prefer other values for this?
Flags: needinfo?(catlee)
I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would that happen too frequently?
Flags: needinfo?(catlee)
Flags: needinfo?(ashish)
@catlee: I was wondering if there are plans to implement this nagios check in the near future. My opinion is that 12h for CRITICAL and 3h for WARNING would be fine at the moment.

Also, we have another two bugs in the buildduty queue which basically point to the same thing as this one: bug 978956 and bug 1017551. If it's OK with you, I think we can add them to this bug's list of dependencies and mark them as resolved when the check is implemented.

If you have other suggestions, please feel free to mention them :-). Thanks.
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #41)
> I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would
> that happen too frequently?

There's no practical difference between CRITICAL and WARNING for us, they both page people. 3h seems low.


This bug has been rather orphaned here given that Ashish is not in the MOC any more, this is not in the MOC queue and holds various bits of information about various different checks so is harder to get context from.

We're rather short handed at the moment so if you still want a check implemented I would suggest the best way to make sure that happens is to open a dependent bug in Infrastructure & Operations > MOC: Service Requests with a clear description of what is still required, pointer to the docs, etc, and someone in the team can pick it up.
:aselagea: let's go with 3 and 12 and we can tweak that later if need be.
:pir: the expectation is not that the moc will be responding to any pages. This check should only report to #buildduty. The machine is managed by IT puppet, though, so the buildduty folks just needed the help because they don't have access to that.
Depends on: 1248589
(In reply to Amy Rich [:arr] [:arich] from comment #44)
> :pir: the expectation is not that the moc will be responding to any pages.
> This check should only report to #buildduty. The machine is managed by IT
> puppet, though, so the buildduty folks just needed the help because they
> don't have access to that.

Ah, that makes a lot more sense. I've commented to that effect on the child bug and will get someone to work on implementing the check. We're rather shorthanded at the moment though.
Both checks are implemented and working.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Flags: needinfo?(catlee)
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: