Closed Bug 917462 Opened 11 years ago Closed 11 years ago

please adjust nagios alert for gaia_bumper.stamp on buildbot-master66

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: ericz)

References

Details

(Whiteboard: [reit-ops])

Attachments

(1 file)

buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp

Looks like it's warning ~574 seconds and critical at ~930 seconds?

Could we adjust this to warn at 1200 seconds and critical at 1800 seconds?
It keeps flapping with no-human-intervention-needed notifications.
Whiteboard: [reit-ops]
Assignee: infra → server-ops
Component: Infrastructure: Monitoring → Server Operations
Product: Infrastructure & Operations → mozilla.org
QA Contact: jdow → shyam
Assignee: server-ops → eziegenhorn
This is committed in rev 75199...will take a bit to get pushed out via puppet.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thank you!
This appeared today:

[19:39]	<nagios-releng>	Mon 19:39:55 PDT [4316] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 559 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)

Do you know how long it'll take to take effect?
Flags: needinfo?(eziegenhorn)
It's already in effect :

define service{
    use                     generic-service
        host_name              buildbot-master66.srv.releng.usw2.mozilla.com
        service_description     File Age - /builds/gaia_bumper/gaia_bumper.stamp
    check_command           check_file_age!1200!1800!/builds/gaia_bumper/gaia_bumper.stamp

That's odd you saw it show up :|
:aki Yeah that has been in effect for two weeks, I have no idea how you saw this alert.  I double-checked nagios1.private.releng.scl3 and it has the correct, current config values.  I looked in the logs there and the last time this alert shows up was the end of June.  What channel did you see this alert in?
Flags: needinfo?(eziegenhorn)
This was in #buildduty.
Is nagios-releng controlled by this service?
Ok, so :ashish had some great insights into why this isn't working right (the host isn't puppetized and a bad interaction with a weirdly-defined check) and he also got me access to the box which will be great help.  I believe the critical threshold is working now and am still working on the warning threshold which seems broken still.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ran a number of tests and this appears to be working reliably now.  Let me know if it false-alarms any longer.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
From today: [21:01]	<nagios-releng>	Sun 21:01:47 PDT [4742] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 501 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ok, what it was actually alerting about is that the file is 0 bytes.  For whatever reason, we specify it should be at least 1 byte big.  This is not the behavior we want for this box, but I'm not sure about others.  Additionally, we specify the -m flag to check_file_age, and it doesn't take a -m flag.  Perhaps that's from an older version.  Since fixing this affects other hosts I'm going to ask Ashish to review the patch before I put it in.
Attachment #824183 - Flags: review?(ashish)
Comment on attachment 824183 [details] [diff] [review]
checkcommands.pp patch

Review of attachment 824183 [details] [diff] [review]:
-----------------------------------------------------------------

The choice of 1 byte is quite likely historical. I can't imagine this change breaking anything but keep an eye out after pushed out. The -m flag applies to an earlier version of the bundled plugin but doesn't work unless used alongwith a different NRPE plugin check_file_age2...
Attachment #824183 - Flags: review?(ashish) → review+
Patch committed in r77265.  Will watch #buildduty for a few days.
13:42 nagios-releng: Wed 13:42:07 PDT [4782] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 527 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
13:58 nagios-releng: Wed 13:58:07 PDT [4783] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 1487 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
13:00 nagios-releng: Wed 14:00:08 PDT [4784] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is OK: FILE_AGE OK: /builds/gaia_bumper/gaia_bumper.stamp is 15 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
[12:55]	<nagios-releng>	Fri 12:55:48 PDT [4852] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 591 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
[05:09]	<nagios-releng>	[#buildduty] Tue 05:09:49 PST [4153] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 554 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
[05:19]	<nagios-releng>	[#buildduty] Tue 05:19:50 PST [4159] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is OK: FILE_AGE OK: /builds/gaia_bumper/gaia_bumper.stamp is 61 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
Yeah something is still busted.  To wit:

-sh-4.1$ /usr/lib64/nagios/plugins/check_nrpe -H buildbot-master66.srv.releng.usw2.mozilla.com -t 15 -c check_file_age -a "-w 677 -c 1500 -W 0 -C 0 -m -f /builds/gaia_bumper/gaia_bumper.stamp"
FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 261 seconds old and 0 bytes

That shouldn't warn with those parameters.  Once I regain access to the box I'll troubleshoot more.
Depends on: 937888
We were bumping up against differences in /etc/nagios/nrpe.cfg between releng hosts and infra hosts.  They defined the check_file_age check's arguments differently and it was causing the first argument specified for releng hosts (only buildbot-master66 uses it) to be ignored as it was garbled.  Therefore, it was using the default warning age of 240 seconds.  Dustin just landed a patch to nrpe.cfg for releng hosts to make it match infra hosts.  I will watch it a few more days.
This has alerted in the last three days, but they were all valid alerts.  Hesitantly going to close this again.  Thanks for your patience.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: