800853 - Setup Nagios alert for file-age on hg repo journal.* files to help catch Bug 766810 early

Reporter

Description

•

12 years ago

So my theory on a way to help catch Bug 766810 for repositories ahead of Devs getting blocked, is to have Nagios check file-age here.

The gotchya is that the file not even exist, and shouldn't alert if missing, only alert if its older than [x] -- what is a good value for x is unknown to me as well.

Justin Dow [:jabba]

Comment 1

•

12 years ago

I believe we have a NRPE plugin that does file age and also returns OK if the file is missing. Ashish, can you check this out?

Assignee: server-ops → ashish

Szabolcs Hubai (:xabolcs)

Comment 2

•

12 years ago

(In reply to Justin Dow [:jabba] from comment #1)
> I believe we have a NRPE plugin that does file age and also returns OK if
> the file is missing. ...

If You don't have one then the GPLv2-ed "check.files.pl - Check files age and number of files in a directory" [1] could do it for You, for example.

Example use:
Directory content:
> ls
>check_files.pl

a CRITICAL message:
> ./check_files.pl -D . -F '*.pl' -w ~ -c 0
>CRITICAL - *.pl is 1 (more than 0)

an OK message:
> ./check_files.pl -D . -F '*.js' -w ~ -c 0
>OK - 0 *.js files found


It supports also ageing, but I couldn't managed to provide a test case
with correct results. :(




[1] http://exchange.nagios.org/directory/Plugins/System-Metrics/File-System/check-2Efiles-2Epl--2D-Check-files-age-and-number-of-files-in-a-directory/details

Justin Dow [:jabba]

Updated

•

11 years ago

QA Contact: jdow → shyam

Ashish Vijayaram [:ashish]

Assignee

Comment 3

•

11 years ago

This is complete. I've set a check for journal.bookmarks on try and integration that will alert if present for ~5 mins.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Adrian J Fernandez [:Aj]

Comment 4

•

11 years ago

:fox2mike mentioned something about only making this alert on IRC as it's been flapping all day.
Had to manually run the following as per mana doc;
[root@hgssh1.dmz.scl3 ~]# cd /repo/hg/mozilla/integration/mozilla-inbound/.hg/
[root@hgssh1.dmz.scl3 .hg]# ls -aFl journal.*
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  0 Apr  2 14:35 journal.bookmarks
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  7 Apr  2 14:35 journal.branch
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3 39 Apr  2 14:35 journal.desc
-rw-rw-r-- 1 tvyas@mozilla.com scm_level_3  0 Apr  2 14:35 journal.dirstate
[root@hgssh1.dmz.scl3 .hg]# mkdir -v ~/`date +%F-%R`
mkdir: created directory `/root/2013-04-02-14:57'
[root@hgssh1.dmz.scl3 .hg]# mv -v journal.bookmarks journal.branch journal.desc journal.dirstate !$
mv -v journal.bookmarks journal.branch journal.desc journal.dirstate ~/`date +%F-%R`
`journal.bookmarks' -> `/root/2013-04-02-14:57/journal.bookmarks'
removed `journal.bookmarks'
`journal.branch' -> `/root/2013-04-02-14:57/journal.branch'
removed `journal.branch'
`journal.desc' -> `/root/2013-04-02-14:57/journal.desc'
removed `journal.desc'
`journal.dirstate' -> `/root/2013-04-02-14:57/journal.dirstate'
removed `journal.dirstate'

Not sure if we need to tweak the settings or just make it alert on IRC...

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Ashish Vijayaram [:ashish]

Assignee

Comment 5

•

11 years ago

Bumped this to 600s/720s. I waited for the next alert to recover but after 40 mins, it hasn't:

[root@hgssh1.dmz.scl3 ~]# (date; ls -lh /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.*)
Tue Apr  2 20:53:48 PDT 2013
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  0 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  7 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.branch
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3 39 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.desc
-rw-rw-r-- 1 mdas@mozilla.com scm_level_3  0 Apr  2 20:13 /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.dirstate

At which point I moved out the files to clear the alert. Calling this good to go. If this alerts flaps, then we should probably relook into the alert itself vs. bumping thresholds...

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Ashish Vijayaram [:ashish]

Assignee

Comment 6

•

11 years ago

Moved out the next one in inbound after 30 mins:

[root@hgssh1.dmz.scl3 .hg]# date; ls -aFl journal.*
Tue Apr  2 21:34:19 PDT 2013
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  0 Apr  2 21:04 journal.bookmarks
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  7 Apr  2 21:04 journal.branch
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3 39 Apr  2 21:04 journal.desc
-rw-rw-r-- 1 gsharp@mozilla.com scm_level_3  0 Apr  2 21:04 journal.dirstate

Shyam Mani [:fox2mike]

Comment 7

•

11 years ago

I think we need to lower the thresholds :| And act everytime this pages till we upgrade mercurial. Sigh.

Shyam Mani [:fox2mike]

Comment 8

•

11 years ago

And I made the alert page oncall only, since this does need action vs IRC only pages..

Ashish Vijayaram [:ashish]

Assignee

Comment 9

•

11 years ago

Thresholds are low enough (10 mins/12 mins). I was waiting to see whether hg would clear up the files after a significant period of time but turns out it doesn't... This is paging the oncalls, so no change on that. I think you meant you removed team notifications, which is fine :)

Shyam Mani [:fox2mike]

Comment 10

•

11 years ago

hg is supposed to remove those files on successful commit. The reason this check is needed is because we're running a mercurial version that's affected by this bug : http://mercurial.808500.n3.nabble.com/issue3317-journal-bookmarks-transaction-recovery-error-on-multi-committer-repos-td3813780.html 

Fix is to upgrade mercurial. Until then, we'll keep seeing issues on and off :|

Docs can be changed to make sure that user isn't still connected to the system (ps aux | grep ssh should show usernames) and then just delete the journal.* files from the affected repo. Moving them is probably safer...for now.

Ashish Vijayaram [:ashish]

Assignee

Comment 11

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from comment #10)
> Docs can be changed to make sure that user isn't still connected to the
> system (ps aux | grep ssh should show usernames) and then just delete the
> journal.* files from the affected repo. Moving them is probably safer...for
> now.

Docs updated.

Adrian J Fernandez [:Aj]

Comment 12

•

11 years ago

I acked some of the alerts yesterday to see how long it would take for them to clear on their own and they do. However, if i recall it took from 40-60 minutes.

If this is a known issue, then I don't think alerting is the correct method, a cron should be made to do what the documentation requests for one to do when it alerts.

Shyam Mani [:fox2mike]

Comment 13

•

11 years ago

(In reply to Adrian Fernandez [:Aj] from comment #12)
> I acked some of the alerts yesterday to see how long it would take for them
> to clear on their own and they do. However, if i recall it took from 40-60
> minutes.

For those 40-60 mins, people can't commit to those repos.

> If this is a known issue, then I don't think alerting is the correct method,
> a cron should be made to do what the documentation requests for one to do
> when it alerts.

Probably true. I want this documented to show the negative side effects of not upgrading our hg install in a timely manner.

Adrian J Fernandez [:Aj]

Comment 15

•

11 years ago

Just for reference, these alerts have been flapping today but have been recovering on their own.

Justin Wood (:Callek)

Reporter

Comment 16

•

11 years ago

Ok, so reopening to turn into an IRC-only notice, per previous chat this week.

Information:

It flapped today at [04:23:33]	nagios-scl3	Sat 01:23:19 PDT [553] hgssh1.dmz.scl3.mozilla.com:File Age - /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks is CRITICAL: FILE_AGE CRITICAL: /repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks is 13m 39s seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/repo/hg/mozilla/integration/mozilla-inbound/.hg/journal.bookmarks)

I found a checkin-needed in the wings, and applied it locally after updating my own checkout and then pushed.

The journal.bookmarks did not prevent me from pushing. and in-fact my own push cleared the file entirely.

It was previously owned by jseward@mozilla (who, interestingly enough, does not have a current inbound patch on the recent stack).

within seconds of my push going through we got the all-clear from nagios

I caught jseward on IRC just now as well, and got a log of what he was doing on the attempted inbound push, for the record he has L3 perms:

===================
sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg out
comparing with http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
changeset:   127889:c35d11135966
tag:         tip
parent:      126672:5a7aaa967ad3
user:        Julian Seward <jseward@acm.org>
date:        Sat Apr 06 10:07:36 2013 +0200
summary:     Bug 857242 - Make profiler verbosity on desktop be runtime-selectable.  r=bgirard



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg push -f ssh://jseward@mozilla.com@hg.mozilla.org/integration/mozilla-inbound
pushing to ssh://jseward%40mozilla.com@hg.mozilla.org/integration/mozilla-inbound
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 3 changes to 3 files (+1 heads)
remote: Two heads detected on branch 'default'
remote: Only one head per branch is allowed!
remote: transaction abort!
remote: rollback completed
remote: abort: pretxnchangegroup.b_singlehead hook failed



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg pull --rebase
pulling from http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
no changes found



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg out
comparing with http://hg.mozilla.org/integration/mozilla-inbound
searching for changes
changeset:   127889:c35d11135966
tag:         tip
parent:      126672:5a7aaa967ad3
user:        Julian Seward <jseward@acm.org>
date:        Sat Apr 06 10:07:36 2013 +0200
summary:     Bug 857242 - Make profiler verbosity on desktop be runtime-selectable.  r=bgirard



sewardj@ahania[462]:~/MOZ/M_INBOUND_Outgoing$ hg push -f ssh://jseward@mozilla.com@hg.mozilla.org/integration/mozilla-inbound
pushing to ssh://jseward%40mozilla.com@hg.mozilla.org/integration/mozilla-inbound
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 3 changes to 3 files (+1 heads)
remote: Two heads detected on branch 'default'
remote: Only one head per branch is allowed!
remote: transaction abort!
remote: rollback completed
remote: abort: pretxnchangegroup.b_singlehead hook failed
===========

So it looks like this file gets left behind on at least the single-head hook. I also theorize if the try-can't-push issue happens when an L3-perm'd person gets a similar rollback failure on try but then L1 perm'd people can't remove the higher-group-owned .bookmarks file.

Either way I believe IRC-ONLY for this notification is ok here, since if someone can't actually push, they are pretty quick to tell us, and my successful push shows its not an actual problem by itself (at least not always)

[/information-overload]

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Ashish Vijayaram [:ashish]

Assignee

Comment 17

•

11 years ago

:Callek - thanks a lot for your debugging. I've set this check to not page oncall but just alert over IRC.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Ashish Vijayaram [:ashish]

Assignee

Comment 18

•

11 years ago

Further, the IRC-only alert was still noisy. Set this not to alert at all. However the check is still in place and its status can be ascertained over IRC or the Nagios web UI.

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

Setup Nagios alert for file-age on hg repo journal.* files to help catch Bug 766810 early

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: ashish)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 15

Comment 16

Comment 17

Comment 18

Updated