Closed Bug 800853 Opened 12 years ago Closed 11 years ago

Setup Nagios alert for file-age on hg repo journal.* files to help catch Bug 766810 early

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: ashish)

References

Details

So my theory on a way to help catch Bug 766810 for repositories ahead of Devs getting blocked, is to have Nagios check file-age here.

The gotchya is that the file not even exist, and shouldn't alert if missing, only alert if its older than [x] -- what is a good value for x is unknown to me as well.
I believe we have a NRPE plugin that does file age and also returns OK if the file is missing. Ashish, can you check this out?
Assignee: server-ops → ashish
(In reply to Justin Dow [:jabba] from comment #1)
> I believe we have a NRPE plugin that does file age and also returns OK if
> the file is missing. ...

If You don't have one then the GPLv2-ed "check.files.pl - Check files age and number of files in a directory" [1] could do it for You, for example.

Example use:
Directory content:
> ls
>check_files.pl

a CRITICAL message:
> ./check_files.pl -D . -F '*.pl' -w ~ -c 0
>CRITICAL - *.pl is 1 (more than 0)

an OK message:
> ./check_files.pl -D . -F '*.js' -w ~ -c 0
>OK - 0 *.js files found


It supports also ageing, but I couldn't managed to provide a test case
with correct results. :(




[1] http://exchange.nagios.org/directory/Plugins/System-Metrics/File-System/check-2Efiles-2Epl--2D-Check-files-age-and-number-of-files-in-a-directory/details
QA Contact: jdow → shyam
This is complete. I've set a check for journal.bookmarks on try and integration that will alert if present for ~5 mins.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
:fox2mike mentioned something about only making this alert on IRC as it's been flapping all day.
Had to manually run the following as per mana doc;
[root@hgssh1.dmz.scl3 ~]# cd /repo/hg/mozilla/integration/mozilla-inbound/.hg/
[root@hgssh1.dmz.scl3 .hg]# ls -aFl journal.*
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  0 Apr  2 14:35 journal.bookmarks
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  7 Apr  2 14:35 journal.branch
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3 39 Apr  2 14:35 journal.desc
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  0 Apr  2 14:35 journal.dirstate
[root@hgssh1.dmz.scl3 .hg]# mkdir -v ~/`date +%F-%R`
mkdir: created directory `/root/2013-04-02-14:57'
[root@hgssh1.dmz.scl3 .hg]# mv -v journal.bookmarks journal.branch journal.desc journal.dirstate !$
mv -v journal.bookmarks journal.branch journal.desc journal.dirstate ~/`date +%F-%R`
`journal.bookmarks' -> `/root/2013-04-02-14:57/journal.bookmarks'
removed `journal.bookmarks'
`journal.branch' -> `/root/2013-04-02-14:57/journal.branch'
removed `journal.branch'
`journal.desc' -> `/root/2013-04-02-14:57/journal.desc'
removed `journal.desc'
`journal.dirstate' -> `/root/2013-04-02-14:57/journal.dirstate'
removed `journal.dirstate'

Not sure if we need to tweak the settings or just make it alert on IRC...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bumped this to 600s/720s. I waited for the next alert to recover but after 40 mins, it hasn't:

[root@hgssh1.dmz.scl3 ~]# (date; ls -lh /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.*)
Tue Apr  2 20:53:48 PDT 2013
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  0 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  7 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.branch
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3 39 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.desc
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  0 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.dirstate

At which point I moved out the files to clear the alert. Calling this good to go. If this alerts flaps, then we should probably relook into the alert itself vs. bumping thresholds...
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Moved out the next one in inbound after 30 mins:

[root@hgssh1.dmz.scl3 .hg]# date; ls -aFl journal.*
Tue Apr  2 21:34:19 PDT 2013
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  0 Apr  2 21:04 journal.bookmarks
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  7 Apr  2 21:04 journal.branch
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3 39 Apr  2 21:04 journal.desc
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  0 Apr  2 21:04 journal.dirstate
I think we need to lower the thresholds :| And act everytime this pages till we upgrade mercurial. Sigh.
And I made the alert page oncall only, since this does need action vs IRC only pages..
Thresholds are low enough (10 mins/12 mins). I was waiting to see whether hg would clear up the files after a significant period of time but turns out it doesn't... This is paging the oncalls, so no change on that. I think you meant you removed team notifications, which is fine :)
hg is supposed to remove those files on successful commit. The reason this check is needed is because we're running a mercurial version that's affected by this bug : http://mercurial.808500.n3.nabble.com/issue3317-journal-bookmarks-transaction-recovery-error-on-multi-committer-repos-td3813780.html 

Fix is to upgrade mercurial. Until then, we'll keep seeing issues on and off :|

Docs can be changed to make sure that user isn't still connected to the system (ps aux | grep ssh should show usernames) and then just delete the journal.* files from the affected repo. Moving them is probably safer...for now.
(In reply to Shyam Mani [:fox2mike] from comment #10)
> Docs can be changed to make sure that user isn't still connected to the
> system (ps aux | grep ssh should show usernames) and then just delete the
> journal.* files from the affected repo. Moving them is probably safer...for
> now.

Docs updated.
I acked some of the alerts yesterday to see how long it would take for them to clear on their own and they do. However, if i recall it took from 40-60 minutes.

If this is a known issue, then I don't think alerting is the correct method, a cron should be made to do what the documentation requests for one to do when it alerts.
(In reply to Adrian Fernandez [:Aj] from comment #12)
> I acked some of the alerts yesterday to see how long it would take for them
> to clear on their own and they do. However, if i recall it took from 40-60
> minutes.

For those 40-60 mins, people can't commit to those repos.

> If this is a known issue, then I don't think alerting is the correct method,
> a cron should be made to do what the documentation requests for one to do
> when it alerts.

Probably true. I want this documented to show the negative side effects of not upgrading our hg install in a timely manner.
Just for reference, these alerts have been flapping today but have been recovering on their own.
Ok, so reopening to turn into an IRC-only notice, per previous chat this week.

Information:

It flapped today at [04:23:33]	nagios-scl3	Sat 01:23:19 PDT [553] hgssh1.dmz.scl3.mozilla.com:File Age - /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks is CRITICAL: FILE_AGE CRITICAL: /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks is 13m 39s seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks)

I found a checkin-needed in the wings, and applied it locally after updating my own checkout and then pushed.

The journal.bookmarks did not prevent me from pushing. and in-fact my own push cleared the file entirely.

It was previously owned by jseward@mozilla (who, interestingly enough, does not have a current inbound patch on the recent stack).

within seconds of my push going through we got the all-clear from nagios

I caught jseward on IRC just now as well, and got a log of what he was doing on the attempted inbound push, for the record he has L3 perms:

===================
sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg out
comparing with http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
changeset:   127889:c35d11135966
tag:         tip
parent:      126672:5a7aaa967ad3
user:        Julian Seward <jseward@acm.org>
date:        Sat Apr 06 10:07:36 2013 +0200
summary:     Bug 857242 - Make profiler verbosity on desktop be runtime-selectable.  r=bgirard



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg push -f ssh://jseward@mozilla.com@hg.mozilla.org/integration/mozilla-inbound
pushing to ssh://jseward%40mozilla.com@hg.mozilla.org/integration/mozilla-inbound
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 3 changes to 3 files (+1 heads)
remote: Two heads detected on branch 'default'
remote: Only one head per branch is allowed!
remote: transaction abort!
remote: rollback completed
remote: abort: pretxnchangegroup.b_singlehead hook failed



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg pull --rebase
pulling from http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
no changes found



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg out
comparing with http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
changeset:   127889:c35d11135966
tag:         tip
parent:      126672:5a7aaa967ad3
user:        Julian Seward <jseward@acm.org>
date:        Sat Apr 06 10:07:36 2013 +0200
summary:     Bug 857242 - Make profiler verbosity on desktop be runtime-selectable.  r=bgirard



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg push -f ssh://jseward@mozilla.com@hg.mozilla.org/integration/mozilla-inbound
pushing to ssh://jseward%40mozilla.com@hg.mozilla.org/integration/mozilla-inbound
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 3 changes to 3 files (+1 heads)
remote: Two heads detected on branch 'default'
remote: Only one head per branch is allowed!
remote: transaction abort!
remote: rollback completed
remote: abort: pretxnchangegroup.b_singlehead hook failed
===========

So it looks like this file gets left behind on at least the single-head hook. I also theorize if the try-can't-push issue happens when an L3-perm'd person gets a similar rollback failure on try but then L1 perm'd people can't remove the higher-group-owned .bookmarks file.

Either way I believe IRC-ONLY for this notification is ok here, since if someone can't actually push, they are pretty quick to tell us, and my successful push shows its not an actual problem by itself (at least not always)

[/information-overload]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
:Callek - thanks a lot for your debugging. I've set this check to not page oncall but just alert over IRC.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Further, the IRC-only alert was still noisy. Set this not to alert at all. However the check is still in place and its status can be ascertained over IRC or the Nagios web UI.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.