Closed
Bug 1274465
Opened 8 years ago
Closed 6 years ago
Check monitoring for bundle service
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: gps)
References
Details
Attachments
(2 files)
It died for a week and we didn't notice (AFAIK!) until bundles started expiring, see bug 1274456.
Assignee | ||
Comment 1•8 years ago
|
||
It wasn't failing: it wasn't running. The systemd timer service didn't start after the host was rebooted last week. hg-bundle-generate.timer - Schedules periodic generation of hg bundles Loaded: loaded (/etc/systemd/system/hg-bundle-generate.timer; enabled; vendor preset: disabled) Active: inactive (dead) Assert: start assertion failed at Thu 2016-05-12 18:15:54 UTC; 1 weeks 0 days ago There is a startup assertion verifying /repo/hg/master.%H exists. This refers to a path on the NFS mount. This prevents single-homed systemd services from being active on multiple servers. My guess is the NFS mount wasn't mounted when systemd attempted to start the timer unit and thus failed to start the timer unit. So, adding the missing dependency in the systemd unit file should fix this.
Assignee | ||
Comment 2•8 years ago
|
||
I'll do the first part of this.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/79beedbc2fac scripts/generate-hg-s3-bundles: write local index.html and bundles.json files
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 4•8 years ago
|
||
We now generate /repo/hg/bundles/{index.html,bundles.json} files at the end of the bundle generation job. We should now be able to install a Nagios check verifying the mtime of one of these files isn't too old. The bundles are generated 24h apart. However, execution time for generation could vary by several hours. Plus, we've seen bundle generation fail randomly on some days. I think we should alert after 2 consecutive failures (so 2 days apart). Let's say when /repo/hg/bundles/index.html has an mtime more than 56 hours old.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•7 years ago
|
QA Contact: hwine → klibby
Assignee | ||
Comment 5•6 years ago
|
||
The topic of monitoring the bundle generation process came up again and I found this old bug. I think comment #4 still applies. Let's get some kind of file age check installed. I'll get a patch up for review to define a custom check on the machine. Then it will be off to fubar/MOC to get the check running in Nagios.
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 6•6 years ago
|
||
This will allow us to monitor a file's age to determine when bundle generation last completed.
Assignee | ||
Comment 7•6 years ago
|
||
This commit defines a custom Nagios check that monitors the age of the /repo/hg/bundles/lastrun file. It warns after 2+ failures and criticals after 4+ failures. Documentation for the alert has been added.
Comment 8•6 years ago
|
||
Comment on attachment 8992778 [details] hgserver: touch a file when bundle generation completes (bug 1274465); r?sheehan Connor Sheehan [:sheehan] has approved the revision. https://phabricator.services.mozilla.com/D2203
Attachment #8992778 -
Flags: review+
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/7a2a2b7a9cac hgserver: touch a file when bundle generation completes ; r=sheehan
Status: ASSIGNED → RESOLVED
Closed: 8 years ago → 6 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 10•6 years ago
|
||
Let's keep this opened until the check is deployed.
Comment 11•6 years ago
|
||
Comment on attachment 8992779 [details] ansible/hg-ssh: add nagios check for bundle generation age (bug 1274465); r?fubar Kendall Libby [:fubar] has approved the revision. https://phabricator.services.mozilla.com/D2204
Attachment #8992779 -
Flags: review+
Comment 12•6 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/14387c1d60e8 ansible/hg-ssh: add nagios check for bundle generation age ; r=fubar
Assignee | ||
Comment 13•6 years ago
|
||
fubar: did this check get deployed? If you need my help, just ping me. If we need Mana docs, you can refer to / copy https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html#check-hg-bundle-generate-age.
Flags: needinfo?(klibby)
Comment 14•6 years ago
|
||
Added to puppet, running on nagios, and mana link added.
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Flags: needinfo?(klibby)
Resolution: --- → FIXED
Updated•6 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•