Closed Bug 1274465 Opened 8 years ago Closed 6 years ago

Check monitoring for bundle service

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: gps)

References

Details

Attachments

(2 files)

It died for a week and we didn't notice (AFAIK!) until bundles started expiring, see bug 1274456.
It wasn't failing: it wasn't running.

The systemd timer service didn't start after the host was rebooted last week.

hg-bundle-generate.timer - Schedules periodic generation of hg bundles Loaded: loaded
               (/etc/systemd/system/hg-bundle-generate.timer; enabled; vendor preset: disabled) Active: inactive (dead) Assert: start assertion
               failed at Thu 2016-05-12 18:15:54 UTC; 1 weeks 0 days ago

There is a startup assertion verifying /repo/hg/master.%H exists. This refers to a path on the NFS mount. This prevents single-homed systemd services from being active on multiple servers. My guess is the NFS mount wasn't mounted when systemd attempted to start the timer unit and thus failed to start the timer unit. So, adding the missing dependency in the systemd unit file should fix this.
I'll do the first part of this.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/79beedbc2fac
scripts/generate-hg-s3-bundles: write local index.html and bundles.json files
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
We now generate /repo/hg/bundles/{index.html,bundles.json} files at the end of the bundle generation job.

We should now be able to install a Nagios check verifying the mtime of one of these files isn't too old.

The bundles are generated 24h apart. However, execution time for generation could vary by several hours. Plus, we've seen bundle generation fail randomly on some days. I think we should alert after 2 consecutive failures (so 2 days apart). Let's say when /repo/hg/bundles/index.html has an mtime more than 56 hours old.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
QA Contact: hwine → klibby
The topic of monitoring the bundle generation process came up again and I found this old bug.

I think comment #4 still applies. Let's get some kind of file age check installed.

I'll get a patch up for review to define a custom check on the machine. Then it will be off to fubar/MOC to get the check running in Nagios.
Status: REOPENED → ASSIGNED
This will allow us to monitor a file's age to determine when bundle
generation last completed.
This commit defines a custom Nagios check that monitors the
age of the /repo/hg/bundles/lastrun file. It warns after 2+
failures and criticals after 4+ failures.

Documentation for the alert has been added.
Comment on attachment 8992778 [details]
hgserver: touch a file when bundle generation completes (bug 1274465); r?sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D2203
Attachment #8992778 - Flags: review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7a2a2b7a9cac
hgserver: touch a file when bundle generation completes ; r=sheehan
Status: ASSIGNED → RESOLVED
Closed: 8 years ago6 years ago
Resolution: --- → FIXED
Let's keep this opened until the check is deployed.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Comment on attachment 8992779 [details]
ansible/hg-ssh: add nagios check for bundle generation age (bug 1274465); r?fubar

Kendall Libby [:fubar] has approved the revision.

https://phabricator.services.mozilla.com/D2204
Attachment #8992779 - Flags: review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/14387c1d60e8
ansible/hg-ssh: add nagios check for bundle generation age ; r=fubar
fubar: did this check get deployed? If you need my help, just ping me. If we need Mana docs, you can refer to / copy https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html#check-hg-bundle-generate-age.
Flags: needinfo?(klibby)
Added to puppet, running on nagios, and mana link added.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Flags: needinfo?(klibby)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: