Closed
Bug 1220191
Opened 9 years ago
Closed 9 years ago
create nagios checks for buildbot backlog age and master lag
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: aselagea)
References
Details
Attachments
(2 files, 6 obsolete files)
645 bytes,
patch
|
Details | Diff | Splinter Review | |
2.71 KB,
patch
|
catlee
:
review+
aselagea
:
checked-in+
|
Details | Diff | Splinter Review |
Today we noticed that our windows backlog was very long (2 days), and that our master log was also high (indicating that we needed more masters).
If possible, we should have automated warnings of these two conditions so we can take corrective action sooner.
Comment 1•9 years ago
|
||
age of the backlog could be done as an extension of the current pending check, the logic & code would be very similar.
for master lag, I'd like to start reporting this to statsd/graphite from the log parsing on the masters. can we alert based on some threshold in statsd?
Assignee | ||
Comment 2•9 years ago
|
||
For the master lag, do we have some metrics on this at the moment?
Flags: needinfo?(catlee)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → alin.selagea
Comment 3•9 years ago
|
||
Yes, in graphite, if you go to 'User graphs -> catlee@mozilla.com -> master_lag', you can see some.
Reporter | ||
Comment 4•9 years ago
|
||
I know that IT has set up alerts based on graphite data. Ashish might be a good person to ping to ask about how that was done.
Flags: needinfo?(catlee)
Comment 5•9 years ago
|
||
What is graphite's host?
Comment 6•9 years ago
|
||
Alin is going to work on this. Once we have the location of the lag graph, please update the pending counts page https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts to reflect the steps to take in the case of the high lag. As well, the mana will need to be updated to reflect what action to take on the nagios alert.
Assignee | ||
Comment 7•9 years ago
|
||
I would also need to know if we have some metrics on the age of the backlog..do we have some sort of graph for this?
Thanks.
Flags: needinfo?(catlee)
Comment 8•9 years ago
|
||
No, we don't have metrics on age of backlog yet. This would be very similar to calculating the size of the backlog though. You should be close with the work going on in bug 1204970.
Flags: needinfo?(catlee)
Assignee | ||
Comment 9•9 years ago
|
||
Added the python script to check the age of the backlog. Sample output:
OK Backlog Age: 1h:52m:22s
Process finished with exit code 0
Attachment #8682427 -
Flags: review?(catlee)
Comment 10•9 years ago
|
||
Comment on attachment 8682427 [details]
check_backlog_age.py
Looks good overall, thanks for putting this together!
I'd like to see a few minor fixes first before we land this.
Can you add a #!/usr/bin/env python header at the top please? and a standard MPL license header
>import sys
>import argparse
>import urllib2
>import json
>
>from datetime import datetime
>import calendar
>
>__version__ = "1.0"
>waiting_time = []
>pending_url = 'https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js'
>status_code = {'OK': 0, 'WARNING': 1, "CRITICAL": 2, "UNKNOWN": 3}
>
>
>def get_utc_unix_time():
> d = datetime.utcnow()
> unix_time = calendar.timegm(d.utctimetuple())
> return unix_time
I think time.time() does what you want?
>def get_max_waiting_time(unix_time):
does unix_time refer to 'now'?
maybe have this function return the earliest submitted_at, and then you can subtract from the current time elsewhere.
> response = urllib2.urlopen(pending_url)
> result = json.loads(response.read())
> for key in result['pending'].keys():
> for k1 in result['pending'][key].keys():
> for k2 in range(0, (len(result['pending'][key][k1]))):
> waiting_time.append(unix_time - result['pending'][key][k1][k2]['submitted_at'])
> return max(waiting_time)
>
>
>def pending_builds_status(max_waiting_time, critical_threshold, warning_threshold):
> if max_waiting_time >= critical_threshold:
> return 'CRITICAL'
> elif max_waiting_time >= warning_threshold:
> return 'WARNING'
> else:
> return 'OK'
>
>if __name__ == '__main__':
> parser = argparse.ArgumentParser(version="%(prog)s " + __version__)
> parser.add_argument(
> '-c', '--critical', action='store', type=int, dest='critical_threshold', default=36000, metavar="CRITICAL",
> help='Set CRITICAL level as integer eg. 36000')
> parser.add_argument(
> '-w', '--warning', action='store', type=int, dest='warning_threshold', default=18000, metavar="WARNING",
> help='Set WARNING level as integer eg. 18000')
> args = parser.parse_args()
>
> try:
> unix_time = get_utc_unix_time()
> max_waiting_time = get_max_waiting_time(unix_time)
> status = pending_builds_status(max_waiting_time, args.critical_threshold, args.warning_threshold)
> m, s = divmod(max_waiting_time, 60)
> h, m = divmod(m, 60)
> time = "%dh:%02dm:%02ds" % (h, m, s)
this logic would be great in a function
> print '%s Backlog Age: %s' % (status, time)
> sys.exit(status_code[status])
> except Exception as e:
> print e
> sys.exit(status_code.get('UNKNOWN'))
>
Attachment #8682427 -
Flags: review?(catlee) → review-
Assignee | ||
Comment 11•9 years ago
|
||
Since our working hours do not seem to overlap, could you please help with some tips/suggestions on how to set up alerts based on graphite data? Or maybe point me to a person that could help during my working hours(GMT+2)?
Thanks!
Flags: needinfo?(ashish)
Assignee | ||
Comment 13•9 years ago
|
||
updated the script
Attachment #8682427 -
Attachment is obsolete: true
Attachment #8683016 -
Flags: review?(catlee)
Comment 14•9 years ago
|
||
Sorry, I know nothing about alerts based on graphite data.
Comment 15•9 years ago
|
||
Tried check_graphite_data with info in Comment 3:
> [ashish@nagios1.private.scl3 ~]$ /usr/lib64/nagios/plugins/mozilla/check_graphite_data --url="https://graphite-scl3.mozilla.org/render/?target=highestCurrent(hosts.buildbot-master*.statsd.latency.total_master_lag-percentile-95,5)" --critupper --warn 125 --crit 150
> Current value: 148.0704, warn threshold: 125.0, crit threshold: 150.0
> [ashish@nagios1.private.scl3 ~]$ echo $?
> 1
I tried with random thresholds, just to verify. Ideally I would like this setup on nagios-releng, but that'll need an ACL to graphite-scl3 (request in a separate bug). But I'm all clear to set this up with graphite data.
Reporter | ||
Comment 16•9 years ago
|
||
What do we want the check thresholds to be for warn and crit? The flow is in place and ashish is ready to set up the check as soon as we give him some numbers.
Flags: needinfo?(catlee)
Flags: needinfo?(alin.selagea)
Comment 17•9 years ago
|
||
master lag warning should be 300s (5 minutes), and critical should be 600s (10 minutes).
coop may have a better idea of what backlog age thresholds should be. warning at 6h and critical at 12h?
Flags: needinfo?(catlee) → needinfo?(coop)
Comment 18•9 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #17)
> coop may have a better idea of what backlog age thresholds should be.
> warning at 6h and critical at 12h?
That's a reasonable starting point. We can always tweak them as required.
Flags: needinfo?(coop)
Comment 19•9 years ago
|
||
Alright, I've set up the buildbot master lag check to check the metric every 2 mins and alert if the thresholds in Comment 17 are breached.
Flags: needinfo?(alin.selagea)
Reporter | ||
Comment 20•9 years ago
|
||
I've modified the check slightly so that we're not checking as often and waiting a bit before alerting (this is more of a long term issue than an immediate "everything is on fire" check).
- normal_check_interval => 2,
- max_check_attempts => 1,
+ normal_check_interval => 5,
+ max_check_attempts => 5,
Committed revision 110283.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 21•9 years ago
|
||
Thanks :arr. This bug is still pending work to setup the the buildbot age check from Comment 13...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 22•9 years ago
|
||
Comment on attachment 8683016 [details]
check_backlog_age.py
def get_min_submitted_at():
response = urllib2.urlopen(pending_url)
result = json.loads(response.read())
for key in result['pending'].keys():
for k1 in result['pending'][key].keys():
for k2 in range(0, (len(result['pending'][key][k1]))):
submitted_at.append(result['pending'][key][k1][k2]['submitted_at'])
return min(submitted_at)
this function is a bit confusing. would something like this work?
def get_min_submitted_at():
response = urllib2.urlopen(pending_url)
result = json.loads(response.read())
for branch in result['pending'].keys():
for revision in result['pending'][branch].keys():
for request in result['pending'][branch][revision]
submitted_at.append(request['submitted_at'])
return min(submitted_at)
Attachment #8683016 -
Flags: review?(catlee) → review-
Assignee | ||
Comment 23•9 years ago
|
||
Thanks, :catlee. Updated the script to reflect your suggestion, I also modified the thresholds for issuing alerts in nagios:
WARNING: 21600 seconds --> 6 hours
CRITICAL: 43200 --> 12 hours
Attachment #8683016 -
Attachment is obsolete: true
Attachment #8688909 -
Flags: review?(catlee)
Updated•9 years ago
|
Attachment #8688909 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 24•9 years ago
|
||
I guess it would be useful if we had the alert for the backlog age too.
:ashish could you set up this alert please? Or is there someone else who we should ask for that?
Thanks :)
Flags: needinfo?(ashish)
Comment 25•9 years ago
|
||
Sorry, I forgot about this bug! I can help set that up too.
Flags: needinfo?(ashish)
Comment 26•9 years ago
|
||
Last request - would it be possible to remove dependency on argparse? That would avoid installing the package on all Nagios servers (since they're all built and maintained alike). Thanks!
Comment 27•9 years ago
|
||
Where would this check run? What version of python do they have installed?
Comment 28•9 years ago
|
||
Running it from the nagios server is the simplest. Running it from a remote server will need it to be deployed via NRPE (with puppet code not in my control). FWIW:
> [ashish@nagios1.private.releng.scl3 ~]$ python --version
> Python 2.6.6
Assignee | ||
Comment 29•9 years ago
|
||
:catlee would something like this be enough? Or am I missing something here?
Thanks.
Attachment #8688909 -
Attachment is obsolete: true
Attachment #8698432 -
Flags: review?(catlee)
Assignee | ||
Comment 30•9 years ago
|
||
Diff file with respect to the previous version of the patch.
Assignee | ||
Comment 31•9 years ago
|
||
Updated the script to replace 'argparser' and use 'optparser' instead (which is available in python 2.6).
Attachment #8698432 -
Attachment is obsolete: true
Attachment #8698432 -
Flags: review?(catlee)
Attachment #8698965 -
Flags: review?(catlee)
Assignee | ||
Comment 32•9 years ago
|
||
Also uploaded the diff file.
Attachment #8698932 -
Attachment is obsolete: true
Assignee | ||
Comment 33•9 years ago
|
||
Added the patch for the new script.
Attachment #8698965 -
Attachment is obsolete: true
Attachment #8698965 -
Flags: review?(catlee)
Attachment #8709422 -
Flags: review?(catlee)
Updated•9 years ago
|
Attachment #8709422 -
Flags: review?(catlee) → review+
Assignee | ||
Updated•9 years ago
|
Attachment #8709422 -
Flags: checked-in+
Assignee | ||
Comment 34•9 years ago
|
||
The script has been modified to use 'optparser'(available in Python 2.6) instead of 'argparser'.
@ashish: when possible, can we please have this nagios alert implemented too? :)
Thanks!
Flags: needinfo?(ashish)
Comment 35•9 years ago
|
||
For any checks to be implemented we'll need an example of the script's use, options, etc, and target hosts to run it against as well as what levels it should alert at and documentation for what we should do when it does alert, who to escalate to, etc.
Assignee | ||
Comment 36•9 years ago
|
||
I described the check here: https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age
NOTE: I presumed that it will be implemented the same way as check_pending_builds since the checks are very similar, so that'll mean creating the following config file: "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this also needs to be created).
Let me know if you have further suggestions/questions.
Thanks.
Reporter | ||
Comment 37•9 years ago
|
||
To be consistent with all the others, the check should be /usr/lib64/nagios/plugins/custom/check_backlog_age with no .py extension.
Comment 38•9 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #36)
> I described the check here:
> https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age
"check SlaveHealth interface to see the number of pending jobs on each pool, restart broken slaves if not taking jobs for longer periods of time"
If that's something you want the MOC to do then it needs a lot more specific detail on how to do those things (step by step instructions for someone who doesn't know what any of those things mean or where to find them and what "check" and "longer periods of time" mean in practice).
If it's not something you want the MOC to do it needs to be labelled as such otherwise oncall staff will see that and spend time trying to work out what it means and what the process is.
> NOTE: I presumed that it will be implemented the same way as
> check_pending_builds since the checks are very similar, so that'll mean
> creating the following config file:
> "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script
> placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this
> also needs to be created).
That's useful, thanks.
Comment 39•9 years ago
|
||
What ages should it go to warning and critical at?
Assignee | ||
Comment 40•9 years ago
|
||
@pir: Restarting the slaves when they stop taking jobs for longer periods of time (e.g. more than 10 hours) is generally done by the Buildduty team, but other Releng folks can do that as well, sorry for the confusion. I will update the mana page with this info.
The list of steps to be done when such problems occur may also vary from case to case, so more thorough investigation is probably needed.
@catlee: I initially thought of the following thresholds for the critical&warning states (also specified as default in the script):
- CRITICAL: 43200 seconds --> 12 hours
- WARNING: 21600 seconds --> 6 hours
Do you prefer other values for this?
Flags: needinfo?(catlee)
Comment 41•9 years ago
|
||
I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would that happen too frequently?
Flags: needinfo?(catlee)
Updated•9 years ago
|
Flags: needinfo?(ashish)
Assignee | ||
Comment 42•9 years ago
|
||
@catlee: I was wondering if there are plans to implement this nagios check in the near future. My opinion is that 12h for CRITICAL and 3h for WARNING would be fine at the moment.
Also, we have another two bugs in the buildduty queue which basically point to the same thing as this one: bug 978956 and bug 1017551. If it's OK with you, I think we can add them to this bug's list of dependencies and mark them as resolved when the check is implemented.
If you have other suggestions, please feel free to mention them :-). Thanks.
Flags: needinfo?(catlee)
Comment 43•9 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #41)
> I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would
> that happen too frequently?
There's no practical difference between CRITICAL and WARNING for us, they both page people. 3h seems low.
This bug has been rather orphaned here given that Ashish is not in the MOC any more, this is not in the MOC queue and holds various bits of information about various different checks so is harder to get context from.
We're rather short handed at the moment so if you still want a check implemented I would suggest the best way to make sure that happens is to open a dependent bug in Infrastructure & Operations > MOC: Service Requests with a clear description of what is still required, pointer to the docs, etc, and someone in the team can pick it up.
Reporter | ||
Comment 44•9 years ago
|
||
:aselagea: let's go with 3 and 12 and we can tweak that later if need be.
:pir: the expectation is not that the moc will be responding to any pages. This check should only report to #buildduty. The machine is managed by IT puppet, though, so the buildduty folks just needed the help because they don't have access to that.
Comment 45•9 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #44)
> :pir: the expectation is not that the moc will be responding to any pages.
> This check should only report to #buildduty. The machine is managed by IT
> puppet, though, so the buildduty folks just needed the help because they
> don't have access to that.
Ah, that makes a lot more sense. I've commented to that effect on the child bug and will get someone to work on implementing the check. We're rather shorthanded at the moment though.
Assignee | ||
Comment 47•9 years ago
|
||
Both checks are implemented and working.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Flags: needinfo?(catlee)
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•