Closed Bug 818736 Opened 12 years ago Closed 11 years ago

crontabber errors should be reported by nagios

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: ericz)

References

Details

When crontabber runs it keeps the state in a .json file. 

The default location for this is: /home/socorro/persistent/crontabbers.json
At least on dev. 

We need monitoring the content of these. 

A healthy job looks something like this::

   "slow-three": {
    "next_run": "2012-11-12 19:40:39.543379",
    "first_run": "2012-11-05 23:27:22.355076",
    "last_error": {},
    "last_run": "2012-11-12 18:40:39.543379",
    "last_success": "2012-11-12 18:27:22.355093",
    "error_count": 0
  },

An breaking job like this::

   "slow-one": {
    "next_run": "2012-12-06 00:46:13.929268",
    "first_run": "2012-11-05 23:27:07.316347",
    "last_error": {
      "traceback": "  File \"socorro/cron/crontabber.py\", line 580, in _run_one\n    for last_success in self._run_job(job_class, config, info):\n  File \"/U$
      "type": "<type 'exceptions.Exception'>",
      "value": "Oh no! something went wrong"
    },
    "last_run": "2012-12-05 23:46:13.929268",
    "last_success": "2012-11-13 00:27:07.316893",
    "error_count": 1
  },

So, just looking for 'last_error' isn't enough.
CC'ed a bunch of people I suspect are good at monitoring.
Assignee: nobody → server-ops
Component: Backend → Server Operations
Product: Socorro → mozilla.org
QA Contact: shyam
Version: unspecified → other
We could look at error_count too, but I don't see the value of anything else...because by your example as soon as there is an actual error, last_error is set...and well, that tells you there is an issue, right? (which I assume is the objective here).

Do let me know what else you were thinking about. Also, I'm not sure who wrote the monitoring stuff for socorro, so I'm going to toss this over to webops to take a call on and we'll help put it into production if that's the final call.
Assignee: server-ops → server-ops-webops
Component: Server Operations → Server Operations: Web Operations
QA Contact: shyam → nmaul
Right you are. You can instead use `error_count != 0`. However, if it's an option to include a message in the nagios alert itself, the message should be `last_error->type + last_error->value`
Hate to play ping-pong on this, but Webops doesn't make Nagios alerts; bouncing back to Server Operations so that they can create a Nagios alert as per their established processes.
Assignee: server-ops-webops → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
(In reply to Daniel Maher [:phrawzty] from comment #4)
> Hate to play ping-pong on this, but Webops doesn't make Nagios alerts;
> bouncing back to Server Operations so that they can create a Nagios alert as
> per their established processes.

It was sent to webops to approve that this is indeed needed. IMHO, we don't need to modify what we have.

Peter?
We need this. It's not deadly crucial because we could instead just scan the syslog. (how we do that I don't know but that's a different question)

crontabber runs all (or soon will be all) cron jobs in one single program that is aware of dependencies and is self-healing in that it'll re-attempt things that raise errors. 

If it fails, it's mostly like a serious problem and DB people and others might need to manually intervene to do what the cron failed. 

Does that answer question?

:rhelmer anything to add?
Assignee: server-ops → eziegenhorn
I suppose this bug can be merged with Bug 778792?
So what I gather from here, bug 778792 and mana is that we need to check these two files:

sp-admin01.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json
socorroadm.stage.private.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json

I have peterbe's example code to parse it (our existing json check can't handle nested fields like we are looking for here).  And that's all good.  Does this file get completely regenerated once a day?  I need to know how to make sure I'm not seeing old errors.
That is correct. 
The file gets re-written much more than once a day. Every time a job is run it updates the file.
One option (this is Lonnen's idea) is to bake email reporting into crontabber itself. 

We can set up config options for:

* list of email addresses to alert
* threshold for number of repeated errors
* (optional) SMTP configuration

It can either be baked in as a core feature or it can simply be yet another app just like any of the other postgres stored procedure jobs. 

Feature parity would be to simply email the same set of people every XX minutes if the crontabbers.json file contains errors.
I don't want to block anything on more advanced monitoring. Feature parity with cron means sending an email containing the job name, the exit code, and the error message when the job fails. Only one email needs to be sent; there's an alias that will forward it to the team.
(In reply to Chris Lonnen :lonnen from comment #12)
> I don't want to block anything on more advanced monitoring. Feature parity
> with cron means sending an email containing the job name, the exit code, and
> the error message when the job fails. Only one email needs to be sent;
> there's an alias that will forward it to the team.

BTW if crontabber just prints the error message to stdout/stderr, it should get emailed to us with the current mechanism.
After some discussion we've decided to put the business logic of errors into crontabber instead. 

See https://bugzilla.mozilla.org/show_bug.cgi?id=836425

What you'll then do is run crontabber the same way crontab runs it but with an extra parameter like ``--has-errors`` (or ``--nagios``) or something.
Depends on: 836425
(In reply to Peter Bengtsson [:peterbe] from comment #14)
> After some discussion we've decided to put the business logic of errors into
> crontabber instead. 
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=836425
> 
> What you'll then do is run crontabber the same way crontab runs it but with
> an extra parameter like ``--has-errors`` (or ``--nagios``) or something.

We were thinking this could be run from NRPE (looks like it's running on the admin box).

Does this sound like a reasonable way to monitor this?
:ericz had a chance to look at rhelmer's last comment which lead to this: https://bugzilla.mozilla.org/show_bug.cgi?id=836425

I'm eager to start working on this but it would be nice to hear that Nagios will be able to work with that.
:peterbe, that sounds like it might work.  I'll comment on bug 836425.
Over to you ericz. See how crontabber.py is invoked in crontab for the right parameters and stuff. In fact, it's wrapped in a bash script I think but I'm sure you'll know what you need to do puppet-wise.

All you need to do is add ``--nagios`` and it'll exit 0, 1 or 2 accordingly.
:peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1 or socorroadm.stage.private.phx1.  Can you point me in the right direction?
Summary: crontabber errors should go to some sort of monitoring → crontabber errors should be reported by nagios
(In reply to Eric Ziegenhorn :ericz from comment #19)
> :peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1
> or socorroadm.stage.private.phx1.  Can you point me in the right direction?

This should work now on both prod and stage:

PYTHONPATH=/data/socorro/application:/data/socorro/thirdparty/ /data/socorro/application/socorro/cron/crontabber.py --admin.conf=/etc/socorro/crontabber.ini --nagios
We've had to disable crontabber.py in production, so this should be held up for a while.
crontabber.py is re-enabled in production. Comment 20 is relevant again.
:ericz Has this been enabled on stage and prod yet?
No.  I've been swamped with Graphite but will try and squeeze this in soon.
As per IRC these alerts should go to cron-socorro@mozilla.com and #socorro-alerts.
This is done in stage and prod, and I've seen the alerts fire.  When this alerts there will be a documentation link to https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab.  If you could fill that out, that'd be great.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to Eric Ziegenhorn :ericz from comment #26)
> This is done in stage and prod, and I've seen the alerts fire.  When this
> alerts there will be a documentation link to
> https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab.  If
> you could fill that out, that'd be great.

Link --> Page Not Found.
It works for me.  Do you have access to mana.mozilla.org?
Apparently this is restricted so nevermind about filling that out.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.