Closed
Bug 818736
Opened 12 years ago
Closed 11 years ago
crontabber errors should be reported by nagios
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Assigned: ericz)
References
Details
When crontabber runs it keeps the state in a .json file. The default location for this is: /home/socorro/persistent/crontabbers.json At least on dev. We need monitoring the content of these. A healthy job looks something like this:: "slow-three": { "next_run": "2012-11-12 19:40:39.543379", "first_run": "2012-11-05 23:27:22.355076", "last_error": {}, "last_run": "2012-11-12 18:40:39.543379", "last_success": "2012-11-12 18:27:22.355093", "error_count": 0 }, An breaking job like this:: "slow-one": { "next_run": "2012-12-06 00:46:13.929268", "first_run": "2012-11-05 23:27:07.316347", "last_error": { "traceback": " File \"socorro/cron/crontabber.py\", line 580, in _run_one\n for last_success in self._run_job(job_class, config, info):\n File \"/U$ "type": "<type 'exceptions.Exception'>", "value": "Oh no! something went wrong" }, "last_run": "2012-12-05 23:46:13.929268", "last_success": "2012-11-13 00:27:07.316893", "error_count": 1 }, So, just looking for 'last_error' isn't enough.
Reporter | ||
Comment 1•12 years ago
|
||
CC'ed a bunch of people I suspect are good at monitoring.
Reporter | ||
Updated•12 years ago
|
Assignee: nobody → server-ops
Component: Backend → Server Operations
Product: Socorro → mozilla.org
QA Contact: shyam
Version: unspecified → other
Comment 2•12 years ago
|
||
We could look at error_count too, but I don't see the value of anything else...because by your example as soon as there is an actual error, last_error is set...and well, that tells you there is an issue, right? (which I assume is the objective here). Do let me know what else you were thinking about. Also, I'm not sure who wrote the monitoring stuff for socorro, so I'm going to toss this over to webops to take a call on and we'll help put it into production if that's the final call.
Assignee: server-ops → server-ops-webops
Component: Server Operations → Server Operations: Web Operations
QA Contact: shyam → nmaul
Reporter | ||
Comment 3•12 years ago
|
||
Right you are. You can instead use `error_count != 0`. However, if it's an option to include a message in the nagios alert itself, the message should be `last_error->type + last_error->value`
Comment 4•12 years ago
|
||
Hate to play ping-pong on this, but Webops doesn't make Nagios alerts; bouncing back to Server Operations so that they can create a Nagios alert as per their established processes.
Assignee: server-ops-webops → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
Comment 5•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #4) > Hate to play ping-pong on this, but Webops doesn't make Nagios alerts; > bouncing back to Server Operations so that they can create a Nagios alert as > per their established processes. It was sent to webops to approve that this is indeed needed. IMHO, we don't need to modify what we have. Peter?
Reporter | ||
Comment 6•12 years ago
|
||
We need this. It's not deadly crucial because we could instead just scan the syslog. (how we do that I don't know but that's a different question) crontabber runs all (or soon will be all) cron jobs in one single program that is aware of dependencies and is self-healing in that it'll re-attempt things that raise errors. If it fails, it's mostly like a serious problem and DB people and others might need to manually intervene to do what the cron failed. Does that answer question? :rhelmer anything to add?
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops → eziegenhorn
Comment 7•11 years ago
|
||
I suppose this bug can be merged with Bug 778792?
Assignee | ||
Comment 9•11 years ago
|
||
So what I gather from here, bug 778792 and mana is that we need to check these two files: sp-admin01.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json socorroadm.stage.private.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json I have peterbe's example code to parse it (our existing json check can't handle nested fields like we are looking for here). And that's all good. Does this file get completely regenerated once a day? I need to know how to make sure I'm not seeing old errors.
Reporter | ||
Comment 10•11 years ago
|
||
That is correct. The file gets re-written much more than once a day. Every time a job is run it updates the file.
Reporter | ||
Comment 11•11 years ago
|
||
One option (this is Lonnen's idea) is to bake email reporting into crontabber itself. We can set up config options for: * list of email addresses to alert * threshold for number of repeated errors * (optional) SMTP configuration It can either be baked in as a core feature or it can simply be yet another app just like any of the other postgres stored procedure jobs. Feature parity would be to simply email the same set of people every XX minutes if the crontabbers.json file contains errors.
Comment 12•11 years ago
|
||
I don't want to block anything on more advanced monitoring. Feature parity with cron means sending an email containing the job name, the exit code, and the error message when the job fails. Only one email needs to be sent; there's an alias that will forward it to the team.
Comment 13•11 years ago
|
||
(In reply to Chris Lonnen :lonnen from comment #12) > I don't want to block anything on more advanced monitoring. Feature parity > with cron means sending an email containing the job name, the exit code, and > the error message when the job fails. Only one email needs to be sent; > there's an alias that will forward it to the team. BTW if crontabber just prints the error message to stdout/stderr, it should get emailed to us with the current mechanism.
Reporter | ||
Comment 14•11 years ago
|
||
After some discussion we've decided to put the business logic of errors into crontabber instead. See https://bugzilla.mozilla.org/show_bug.cgi?id=836425 What you'll then do is run crontabber the same way crontab runs it but with an extra parameter like ``--has-errors`` (or ``--nagios``) or something.
Depends on: 836425
Comment 15•11 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #14) > After some discussion we've decided to put the business logic of errors into > crontabber instead. > > See https://bugzilla.mozilla.org/show_bug.cgi?id=836425 > > What you'll then do is run crontabber the same way crontab runs it but with > an extra parameter like ``--has-errors`` (or ``--nagios``) or something. We were thinking this could be run from NRPE (looks like it's running on the admin box). Does this sound like a reasonable way to monitor this?
Reporter | ||
Comment 16•11 years ago
|
||
:ericz had a chance to look at rhelmer's last comment which lead to this: https://bugzilla.mozilla.org/show_bug.cgi?id=836425 I'm eager to start working on this but it would be nice to hear that Nagios will be able to work with that.
Assignee | ||
Comment 17•11 years ago
|
||
:peterbe, that sounds like it might work. I'll comment on bug 836425.
Reporter | ||
Comment 18•11 years ago
|
||
Over to you ericz. See how crontabber.py is invoked in crontab for the right parameters and stuff. In fact, it's wrapped in a bash script I think but I'm sure you'll know what you need to do puppet-wise. All you need to do is add ``--nagios`` and it'll exit 0, 1 or 2 accordingly.
Assignee | ||
Comment 19•11 years ago
|
||
:peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1 or socorroadm.stage.private.phx1. Can you point me in the right direction?
Updated•11 years ago
|
Summary: crontabber errors should go to some sort of monitoring → crontabber errors should be reported by nagios
Comment 20•11 years ago
|
||
(In reply to Eric Ziegenhorn :ericz from comment #19) > :peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1 > or socorroadm.stage.private.phx1. Can you point me in the right direction? This should work now on both prod and stage: PYTHONPATH=/data/socorro/application:/data/socorro/thirdparty/ /data/socorro/application/socorro/cron/crontabber.py --admin.conf=/etc/socorro/crontabber.ini --nagios
Comment 21•11 years ago
|
||
We've had to disable crontabber.py in production, so this should be held up for a while.
Comment 22•11 years ago
|
||
crontabber.py is re-enabled in production. Comment 20 is relevant again.
Reporter | ||
Comment 23•11 years ago
|
||
:ericz Has this been enabled on stage and prod yet?
Assignee | ||
Comment 24•11 years ago
|
||
No. I've been swamped with Graphite but will try and squeeze this in soon.
Assignee | ||
Comment 25•11 years ago
|
||
As per IRC these alerts should go to cron-socorro@mozilla.com and #socorro-alerts.
Assignee | ||
Comment 26•11 years ago
|
||
This is done in stage and prod, and I've seen the alerts fire. When this alerts there will be a documentation link to https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab. If you could fill that out, that'd be great.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 27•11 years ago
|
||
(In reply to Eric Ziegenhorn :ericz from comment #26) > This is done in stage and prod, and I've seen the alerts fire. When this > alerts there will be a documentation link to > https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab. If > you could fill that out, that'd be great. Link --> Page Not Found.
Assignee | ||
Comment 28•11 years ago
|
||
It works for me. Do you have access to mana.mozilla.org?
Assignee | ||
Comment 29•11 years ago
|
||
Apparently this is restricted so nevermind about filling that out.
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•