Closed Bug 1220191 Opened 9 years ago Closed 9 years ago

create nagios checks for buildbot backlog age and master lag

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: arich, Assigned: aselagea)

References

Details

Attachments

(2 files, 6 obsolete files)

check_backlog_age.py 9 years ago Alin Selagea [:aselagea] 2.07 KB, text/plain	catlee : review-	Details
check_backlog_age.py 9 years ago Alin Selagea [:aselagea] 2.42 KB, text/plain	catlee : review-	Details
check_backlog_age.py 9 years ago Alin Selagea [:aselagea] 2.40 KB, text/plain	catlee : review+	Details
check_backlog_age.py 9 years ago Alin Selagea [:aselagea] 1.97 KB, text/plain		Details
bug_1220191.patch 9 years ago Alin Selagea [:aselagea] 843 bytes, patch		Details \| Diff \| Splinter Review
check_backlog_age.py 9 years ago Alin Selagea [:aselagea] 2.42 KB, text/plain		Details
bug_1220191.patch 9 years ago Alin Selagea [:aselagea] 645 bytes, patch		Details \| Diff \| Splinter Review
check_backlog_age.patch 9 years ago Alin Selagea [:aselagea] 2.71 KB, patch	catlee : review+ aselagea : checked-in+	Details \| Diff \| Splinter Review

Amy Rich [:arr] [:arich]

Reporter

Description

•

9 years ago

Today we noticed that our windows backlog was very long (2 days), and that our master log was also high (indicating that we needed more masters). If possible, we should have automated warnings of these two conditions so we can take corrective action sooner.

Chris AtLee [:catlee]

Comment 1

•

9 years ago

age of the backlog could be done as an extension of the current pending check, the logic & code would be very similar. for master lag, I'd like to start reporting this to statsd/graphite from the log parsing on the masters. can we alert based on some threshold in statsd?

Alin Selagea [:aselagea]

Assignee

Comment 2

•

9 years ago

For the master lag, do we have some metrics on this at the moment?

Flags: needinfo?(catlee)

Alin Selagea [:aselagea]

Assignee

Updated

•

9 years ago

Assignee: nobody → alin.selagea

Chris AtLee [:catlee]

Comment 3

•

9 years ago

Yes, in graphite, if you go to 'User graphs -> catlee@mozilla.com -> master_lag', you can see some.

Amy Rich [:arr] [:arich]

Reporter

Comment 4

•

9 years ago

I know that IT has set up alerts based on graphite data. Ashish might be a good person to ping to ask about how that was done.

Flags: needinfo?(catlee)

Armen [:armenzg]

Comment 5

•

9 years ago

What is graphite's host?

Kim Moir [:kmoir] ET

Comment 6

•

9 years ago

Alin is going to work on this. Once we have the location of the lag graph, please update the pending counts page https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts to reflect the steps to take in the case of the high lag. As well, the mana will need to be updated to reflect what action to take on the nagios alert.

Alin Selagea [:aselagea]

Assignee

Comment 7

•

9 years ago

I would also need to know if we have some metrics on the age of the backlog..do we have some sort of graph for this? Thanks.

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 8

•

9 years ago

No, we don't have metrics on age of backlog yet. This would be very similar to calculating the size of the backlog though. You should be close with the work going on in bug 1204970.

Flags: needinfo?(catlee)

Alin Selagea [:aselagea]

Assignee

Comment 9

•

9 years ago

Attached file check_backlog_age.py (obsolete) — Details

Added the python script to check the age of the backlog. Sample output: OK Backlog Age: 1h:52m:22s Process finished with exit code 0

Attachment #8682427 - Flags: review?(catlee)

Chris AtLee [:catlee]

Comment 10

•

9 years ago

Comment on attachment 8682427 [details] check_backlog_age.py Looks good overall, thanks for putting this together! I'd like to see a few minor fixes first before we land this. Can you add a #!/usr/bin/env python header at the top please? and a standard MPL license header >import sys >import argparse >import urllib2 >import json > >from datetime import datetime >import calendar > >__version__ = "1.0" >waiting_time = [] >pending_url = 'https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js' >status_code = {'OK': 0, 'WARNING': 1, "CRITICAL": 2, "UNKNOWN": 3} > > >def get_utc_unix_time(): > d = datetime.utcnow() > unix_time = calendar.timegm(d.utctimetuple()) > return unix_time I think time.time() does what you want? >def get_max_waiting_time(unix_time): does unix_time refer to 'now'? maybe have this function return the earliest submitted_at, and then you can subtract from the current time elsewhere. > response = urllib2.urlopen(pending_url) > result = json.loads(response.read()) > for key in result['pending'].keys(): > for k1 in result['pending'][key].keys(): > for k2 in range(0, (len(result['pending'][key][k1]))): > waiting_time.append(unix_time - result['pending'][key][k1][k2]['submitted_at']) > return max(waiting_time) > > >def pending_builds_status(max_waiting_time, critical_threshold, warning_threshold): > if max_waiting_time >= critical_threshold: > return 'CRITICAL' > elif max_waiting_time >= warning_threshold: > return 'WARNING' > else: > return 'OK' > >if __name__ == '__main__': > parser = argparse.ArgumentParser(version="%(prog)s " + __version__) > parser.add_argument( > '-c', '--critical', action='store', type=int, dest='critical_threshold', default=36000, metavar="CRITICAL", > help='Set CRITICAL level as integer eg. 36000') > parser.add_argument( > '-w', '--warning', action='store', type=int, dest='warning_threshold', default=18000, metavar="WARNING", > help='Set WARNING level as integer eg. 18000') > args = parser.parse_args() > > try: > unix_time = get_utc_unix_time() > max_waiting_time = get_max_waiting_time(unix_time) > status = pending_builds_status(max_waiting_time, args.critical_threshold, args.warning_threshold) > m, s = divmod(max_waiting_time, 60) > h, m = divmod(m, 60) > time = "%dh:%02dm:%02ds" % (h, m, s) this logic would be great in a function > print '%s Backlog Age: %s' % (status, time) > sys.exit(status_code[status]) > except Exception as e: > print e > sys.exit(status_code.get('UNKNOWN')) >

Attachment #8682427 - Flags: review?(catlee) → review-

Alin Selagea [:aselagea]

Assignee

Comment 11

•

9 years ago

Since our working hours do not seem to overlap, could you please help with some tips/suggestions on how to set up alerts based on graphite data? Or maybe point me to a person that could help during my working hours(GMT+2)? Thanks!

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Comment 12

•

9 years ago

cc'ing :pir, who is in London.

Flags: needinfo?(ashish)

Alin Selagea [:aselagea]

Assignee

Comment 13

•

9 years ago

Attached file check_backlog_age.py (obsolete) — Details

updated the script

Attachment #8682427 - Attachment is obsolete: true

Attachment #8683016 - Flags: review?(catlee)

Peter Radcliffe [:pir]

Comment 14

•

9 years ago

Sorry, I know nothing about alerts based on graphite data.

Ashish Vijayaram [:ashish]

Comment 15

•

9 years ago

Tried check_graphite_data with info in Comment 3: > [ashish@nagios1.private.scl3 ~]$ /usr/lib64/nagios/plugins/mozilla/check_graphite_data --url="https://graphite-scl3.mozilla.org/render/?target=highestCurrent(hosts.buildbot-master*.statsd.latency.total_master_lag-percentile-95,5)" --critupper --warn 125 --crit 150 > Current value: 148.0704, warn threshold: 125.0, crit threshold: 150.0 > [ashish@nagios1.private.scl3 ~]$ echo $? > 1 I tried with random thresholds, just to verify. Ideally I would like this setup on nagios-releng, but that'll need an ACL to graphite-scl3 (request in a separate bug). But I'm all clear to set this up with graphite data.

Amy Rich [:arr] [:arich]

Reporter

Updated

•

9 years ago

Depends on: 1222474

Amy Rich [:arr] [:arich]

Reporter

Comment 16

•

9 years ago

What do we want the check thresholds to be for warn and crit? The flow is in place and ashish is ready to set up the check as soon as we give him some numbers.

Flags: needinfo?(catlee)

Flags: needinfo?(alin.selagea)

Chris AtLee [:catlee]

Comment 17

•

9 years ago

master lag warning should be 300s (5 minutes), and critical should be 600s (10 minutes). coop may have a better idea of what backlog age thresholds should be. warning at 6h and critical at 12h?

Flags: needinfo?(catlee) → needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 18

•

9 years ago

(In reply to Chris AtLee [:catlee] from comment #17) > coop may have a better idea of what backlog age thresholds should be. > warning at 6h and critical at 12h? That's a reasonable starting point. We can always tweak them as required.

Flags: needinfo?(coop)

Ashish Vijayaram [:ashish]

Comment 19

•

9 years ago

Alright, I've set up the buildbot master lag check to check the metric every 2 mins and alert if the thresholds in Comment 17 are breached.

Flags: needinfo?(alin.selagea)

Amy Rich [:arr] [:arich]

Reporter

Comment 20

•

9 years ago

I've modified the check slightly so that we're not checking as often and waiting a bit before alerting (this is more of a long term issue than an immediate "everything is on fire" check). - normal_check_interval => 2, - max_check_attempts => 1, + normal_check_interval => 5, + max_check_attempts => 5, Committed revision 110283.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ashish Vijayaram [:ashish]

Comment 21

•

9 years ago

Thanks :arr. This bug is still pending work to setup the the buildbot age check from Comment 13...

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris AtLee [:catlee]

Comment 22

•

9 years ago

Comment on attachment 8683016 [details] check_backlog_age.py def get_min_submitted_at(): response = urllib2.urlopen(pending_url) result = json.loads(response.read()) for key in result['pending'].keys(): for k1 in result['pending'][key].keys(): for k2 in range(0, (len(result['pending'][key][k1]))): submitted_at.append(result['pending'][key][k1][k2]['submitted_at']) return min(submitted_at) this function is a bit confusing. would something like this work? def get_min_submitted_at(): response = urllib2.urlopen(pending_url) result = json.loads(response.read()) for branch in result['pending'].keys(): for revision in result['pending'][branch].keys(): for request in result['pending'][branch][revision] submitted_at.append(request['submitted_at']) return min(submitted_at)

Attachment #8683016 - Flags: review?(catlee) → review-

Alin Selagea [:aselagea]

Assignee

Comment 23

•

9 years ago

Attached file check_backlog_age.py (obsolete) — Details

Thanks, :catlee. Updated the script to reflect your suggestion, I also modified the thresholds for issuing alerts in nagios: WARNING: 21600 seconds --> 6 hours CRITICAL: 43200 --> 12 hours

Attachment #8683016 - Attachment is obsolete: true

Attachment #8688909 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8688909 - Flags: review?(catlee) → review+

Alin Selagea [:aselagea]

Assignee

Comment 24

•

9 years ago

I guess it would be useful if we had the alert for the backlog age too. :ashish could you set up this alert please? Or is there someone else who we should ask for that? Thanks :)

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Comment 25

•

9 years ago

Sorry, I forgot about this bug! I can help set that up too.

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Comment 26

•

9 years ago

Last request - would it be possible to remove dependency on argparse? That would avoid installing the package on all Nagios servers (since they're all built and maintained alike). Thanks!

Chris AtLee [:catlee]

Comment 27

•

9 years ago

Where would this check run? What version of python do they have installed?

Ashish Vijayaram [:ashish]

Comment 28

•

9 years ago

Running it from the nagios server is the simplest. Running it from a remote server will need it to be deployed via NRPE (with puppet code not in my control). FWIW: > [ashish@nagios1.private.releng.scl3 ~]$ python --version > Python 2.6.6

Alin Selagea [:aselagea]

Assignee

Comment 29

•

9 years ago

Attached file check_backlog_age.py (obsolete) — Details

:catlee would something like this be enough? Or am I missing something here? Thanks.

Attachment #8688909 - Attachment is obsolete: true

Attachment #8698432 - Flags: review?(catlee)

Alin Selagea [:aselagea]

Assignee

Comment 30

•

9 years ago

Attached patch bug_1220191.patch (obsolete) — Details — Splinter Review

Diff file with respect to the previous version of the patch.

Alin Selagea [:aselagea]

Assignee

Comment 31

•

9 years ago

Attached file check_backlog_age.py (obsolete) — Details

Updated the script to replace 'argparser' and use 'optparser' instead (which is available in python 2.6).

Attachment #8698432 - Attachment is obsolete: true

Attachment #8698432 - Flags: review?(catlee)

Attachment #8698965 - Flags: review?(catlee)

Alin Selagea [:aselagea]

Assignee

Comment 32

•

9 years ago

Attached patch bug_1220191.patch — Details — Splinter Review

Also uploaded the diff file.

Attachment #8698932 - Attachment is obsolete: true

Alin Selagea [:aselagea]

Assignee

Updated

•

9 years ago

Blocks: 1225475

Alin Selagea [:aselagea]

Assignee

Comment 33

•

9 years ago

Attached patch check_backlog_age.patch — Details — Splinter Review

Added the patch for the new script.

Attachment #8698965 - Attachment is obsolete: true

Attachment #8698965 - Flags: review?(catlee)

Attachment #8709422 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8709422 - Flags: review?(catlee) → review+

Alin Selagea [:aselagea]

Assignee

Updated

•

9 years ago

Attachment #8709422 - Flags: checked-in+

Alin Selagea [:aselagea]

Assignee

Comment 34

•

9 years ago

The script has been modified to use 'optparser'(available in Python 2.6) instead of 'argparser'. @ashish: when possible, can we please have this nagios alert implemented too? :) Thanks!

Flags: needinfo?(ashish)

Peter Radcliffe [:pir]

Comment 35

•

9 years ago

For any checks to be implemented we'll need an example of the script's use, options, etc, and target hosts to run it against as well as what levels it should alert at and documentation for what we should do when it does alert, who to escalate to, etc.

Alin Selagea [:aselagea]

Assignee

Comment 36

•

9 years ago

I described the check here: https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age NOTE: I presumed that it will be implemented the same way as check_pending_builds since the checks are very similar, so that'll mean creating the following config file: "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this also needs to be created). Let me know if you have further suggestions/questions. Thanks.

Amy Rich [:arr] [:arich]

Reporter

Comment 37

•

9 years ago

To be consistent with all the others, the check should be /usr/lib64/nagios/plugins/custom/check_backlog_age with no .py extension.

Peter Radcliffe [:pir]

Comment 38

•

9 years ago

(In reply to Alin Selagea [:aselagea][:buildduty] from comment #36) > I described the check here: > https://mana.mozilla.org/wiki/display/NAGIOS/check_backlog_age "check SlaveHealth interface to see the number of pending jobs on each pool, restart broken slaves if not taking jobs for longer periods of time" If that's something you want the MOC to do then it needs a lot more specific detail on how to do those things (step by step instructions for someone who doesn't know what any of those things mean or where to find them and what "check" and "longer periods of time" mean in practice). If it's not something you want the MOC to do it needs to be labelled as such otherwise oncall staff will see that and spend time trying to work out what it means and what the process is. > NOTE: I presumed that it will be implemented the same way as > check_pending_builds since the checks are very similar, so that'll mean > creating the following config file: > "/etc/nagios/nrpe.d/check_backlog_age.cfg" which will run the above script > placed here: "/usr/lib64/nagios/plugins/custom/check_backlog_age.py" (this > also needs to be created). That's useful, thanks.

Peter Radcliffe [:pir]

Comment 39

•

9 years ago

What ages should it go to warning and critical at?

Alin Selagea [:aselagea]

Assignee

Comment 40

•

9 years ago

@pir: Restarting the slaves when they stop taking jobs for longer periods of time (e.g. more than 10 hours) is generally done by the Buildduty team, but other Releng folks can do that as well, sorry for the confusion. I will update the mana page with this info. The list of steps to be done when such problems occur may also vary from case to case, so more thorough investigation is probably needed. @catlee: I initially thought of the following thresholds for the critical&warning states (also specified as default in the script): - CRITICAL: 43200 seconds --> 12 hours - WARNING: 21600 seconds --> 6 hours Do you prefer other values for this?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 41

•

9 years ago

I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would that happen too frequently?

Flags: needinfo?(catlee)

Ashish Vijayaram [:ashish]

Updated

•

9 years ago

Flags: needinfo?(ashish)

Alin Selagea [:aselagea]

Assignee

Comment 42

•

9 years ago

@catlee: I was wondering if there are plans to implement this nagios check in the near future. My opinion is that 12h for CRITICAL and 3h for WARNING would be fine at the moment. Also, we have another two bugs in the buildduty queue which basically point to the same thing as this one: bug 978956 and bug 1017551. If it's OK with you, I think we can add them to this bug's list of dependencies and mark them as resolved when the check is implemented. If you have other suggestions, please feel free to mention them :-). Thanks.

Flags: needinfo?(catlee)

Peter Radcliffe [:pir]

Comment 43

•

9 years ago

(In reply to Chris AtLee [:catlee] from comment #41) > I think CRITICAL at 12h is fine. Could we try setting WARNING at 3h? Would > that happen too frequently? There's no practical difference between CRITICAL and WARNING for us, they both page people. 3h seems low. This bug has been rather orphaned here given that Ashish is not in the MOC any more, this is not in the MOC queue and holds various bits of information about various different checks so is harder to get context from. We're rather short handed at the moment so if you still want a check implemented I would suggest the best way to make sure that happens is to open a dependent bug in Infrastructure & Operations > MOC: Service Requests with a clear description of what is still required, pointer to the docs, etc, and someone in the team can pick it up.

Amy Rich [:arr] [:arich]

Reporter

Comment 44

•

9 years ago

:aselagea: let's go with 3 and 12 and we can tweak that later if need be. :pir: the expectation is not that the moc will be responding to any pages. This check should only report to #buildduty. The machine is managed by IT puppet, though, so the buildduty folks just needed the help because they don't have access to that.

Alin Selagea [:aselagea]

Assignee

Updated

•

9 years ago

Depends on: 1248589

Peter Radcliffe [:pir]

Comment 45

•

9 years ago

(In reply to Amy Rich [:arr] [:arich] from comment #44) > :pir: the expectation is not that the moc will be responding to any pages. > This check should only report to #buildduty. The machine is managed by IT > puppet, though, so the buildduty folks just needed the help because they > don't have access to that. Ah, that makes a lot more sense. I've commented to that effect on the child bug and will get someone to work on implementing the check. We're rather shorthanded at the moment though.

Alin Selagea [:aselagea]

Assignee

Comment 47

•

9 years ago

Both checks are implemented and working.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Flags: needinfo?(catlee)

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Updated

•

9 years ago

Depends on: 1273005

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.