Closed Bug 1330293 Opened 7 years ago Closed 6 years ago

Prevent nagios_blocker_checker.pl from running longer than 5 minutes (and log to sentry if it does)

Categories

(bugzilla.mozilla.org :: General, defect, P3)

Production
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: gcox, Assigned: gcox)

Details

User Story

In infra bug 1329995, there was a stackup of 5 nagios_blocker_checker.pl scripts, sitting around for over an hour.  This wasted swap and pushed the box into alarm.

nrpe calls on this box have -t 60 to timeout, but this didn't propagate down to the actual perl, leaving the processes orphaned, which is why I think the script deserves high-but-finite time limiter.

Coincidence is not causation, but the timestamp on the process coincides with an admin hopping onto the admin server and running a BMO update, in case this alters your thinking.

Attachments

(1 file)

45 bytes, text/x-github-pull-request
Details | Review
nagios_blocker_checker.pl: if it doesn't complete in 5 minutes, the odds are that NRPE has long since given up and abandoned it.  The perl should have something (ala alarm(300)) to cut itself off in case it gets stuck.
Two hours running for these processes so far, killing...

[root@bugzillaadm.private.scl3 pradcliffe]# ps auxww | grep '1640[12]'
root     16401 26.5 40.1 2477232 1576324 ?     R    14:57  33:35 /usr/bin/perl /data/bugzilla/www/bugzilla.mozilla.org/scripts/nagios_blocker_checker.pl server-ops-devservices@mozilla-org.bugs
root     16402 26.5 39.3 2454016 1545352 ?     R    14:57  33:36 /usr/bin/perl /data/bugzilla/www/bugzilla.mozilla.org/scripts/nagios_blocker_checker.pl --product Infrastructure & Operations --component MOC: Projects --severity blocker
[root@bugzillaadm.private.scl3 pradcliffe]# date
Tue Jan 24 17:04:16 UTC 2017
:dkl could you work on this we get alerted every week when you guys push updates .

Thu 09:01:02 PST [5723] bugzillaadm.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 23% free (466 MB out of 2047 MB) (http://m.mozilla.org/Swap)
This is still going on, got another page and had to go kill another script today.
dkl: do you have any bandwidth to get a timeout in the perl script this quarter?
Flags: needinfo?(dkl)
(In reply to Keegan Ferrando [:fauweh] from comment #4)
> dkl: do you have any bandwidth to get a timeout in the perl script this
> quarter?

I am not working on BMO at the moment. Needinfo'ing dylan about this question.

dkl
Flags: needinfo?(dkl) → needinfo?(dylan)
It's on my radar, but not very high at the moment.
Flags: needinfo?(dylan)
Priority: -- → P3
Tossed a PR at this.  https://github.com/mozilla-bteam/bmo/pull/327
I feel pretty good about the alarm bit, not so sure about the sentry bit, but.
The sentry bit will work. I was surprised we exported that sentry function, but apparently we do.
Assignee: nobody → gcox
Attached file github pull request
Merged, so marking this as fixed until proven otherwise.  The deployment of the script could trigger the issue one more time, but, oh well, can't win 'em all.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: