818736 - crontabber errors should be reported by nagios

Reporter

Description

•

12 years ago

When crontabber runs it keeps the state in a .json file. 

The default location for this is: /home/socorro/persistent/crontabbers.json
At least on dev. 

We need monitoring the content of these. 

A healthy job looks something like this::

   "slow-three": {
    "next_run": "2012-11-12 19:40:39.543379",
    "first_run": "2012-11-05 23:27:22.355076",
    "last_error": {},
    "last_run": "2012-11-12 18:40:39.543379",
    "last_success": "2012-11-12 18:27:22.355093",
    "error_count": 0
  },

An breaking job like this::

   "slow-one": {
    "next_run": "2012-12-06 00:46:13.929268",
    "first_run": "2012-11-05 23:27:07.316347",
    "last_error": {
      "traceback": "  File \"socorro/cron/crontabber.py\", line 580, in _run_one\n    for last_success in self._run_job(job_class, config, info):\n  File \"/U$
      "type": "<type 'exceptions.Exception'>",
      "value": "Oh no! something went wrong"
    },
    "last_run": "2012-12-05 23:46:13.929268",
    "last_success": "2012-11-13 00:27:07.316893",
    "error_count": 1
  },

So, just looking for 'last_error' isn't enough.

Peter Bengtsson [:peterbe]

Reporter

Comment 1

•

12 years ago

CC'ed a bunch of people I suspect are good at monitoring.

Peter Bengtsson [:peterbe]

Reporter

Updated

•

12 years ago

Assignee: nobody → server-ops

Component: Backend → Server Operations

Product: Socorro → mozilla.org

QA Contact: shyam

Version: unspecified → other

Shyam Mani [:fox2mike]

Comment 2

•

12 years ago

We could look at error_count too, but I don't see the value of anything else...because by your example as soon as there is an actual error, last_error is set...and well, that tells you there is an issue, right? (which I assume is the objective here).

Do let me know what else you were thinking about. Also, I'm not sure who wrote the monitoring stuff for socorro, so I'm going to toss this over to webops to take a call on and we'll help put it into production if that's the final call.

Assignee: server-ops → server-ops-webops

Component: Server Operations → Server Operations: Web Operations

QA Contact: shyam → nmaul

Peter Bengtsson [:peterbe]

Reporter

Comment 3

•

12 years ago

Right you are. You can instead use `error_count != 0`. However, if it's an option to include a message in the nagios alert itself, the message should be `last_error->type + last_error->value`

Daniel Maher [:phrawzty]

Comment 4

•

12 years ago

Hate to play ping-pong on this, but Webops doesn't make Nagios alerts; bouncing back to Server Operations so that they can create a Nagios alert as per their established processes.

Assignee: server-ops-webops → server-ops

Component: Server Operations: Web Operations → Server Operations

QA Contact: nmaul → shyam

Shyam Mani [:fox2mike]

Comment 5

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #4)
> Hate to play ping-pong on this, but Webops doesn't make Nagios alerts;
> bouncing back to Server Operations so that they can create a Nagios alert as
> per their established processes.

It was sent to webops to approve that this is indeed needed. IMHO, we don't need to modify what we have.

Peter?

Peter Bengtsson [:peterbe]

Reporter

Comment 6

•

12 years ago

We need this. It's not deadly crucial because we could instead just scan the syslog. (how we do that I don't know but that's a different question)

crontabber runs all (or soon will be all) cron jobs in one single program that is aware of dependencies and is self-healing in that it'll re-attempt things that raise errors. 

If it fails, it's mostly like a serious problem and DB people and others might need to manually intervene to do what the cron failed. 

Does that answer question?

:rhelmer anything to add?

Eric Ziegenhorn :ericz

Assignee

Updated

•

11 years ago

Assignee: server-ops → eziegenhorn

Ashish Vijayaram [:ashish]

Comment 7

•

11 years ago

I suppose this bug can be merged with Bug 778792?

Eric Ziegenhorn :ericz

Assignee

Comment 9

•

11 years ago

So what I gather from here, bug 778792 and mana is that we need to check these two files:

sp-admin01.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json
socorroadm.stage.private.phx1.mozilla.com:/home/socorro/persistent/crontabbers.json

I have peterbe's example code to parse it (our existing json check can't handle nested fields like we are looking for here).  And that's all good.  Does this file get completely regenerated once a day?  I need to know how to make sure I'm not seeing old errors.

Peter Bengtsson [:peterbe]

Reporter

Comment 10

•

11 years ago

That is correct. 
The file gets re-written much more than once a day. Every time a job is run it updates the file.

Peter Bengtsson [:peterbe]

Reporter

Comment 11

•

11 years ago

One option (this is Lonnen's idea) is to bake email reporting into crontabber itself. 

We can set up config options for:

* list of email addresses to alert
* threshold for number of repeated errors
* (optional) SMTP configuration

It can either be baked in as a core feature or it can simply be yet another app just like any of the other postgres stored procedure jobs. 

Feature parity would be to simply email the same set of people every XX minutes if the crontabbers.json file contains errors.

Lonnen :lonnen

Comment 12

•

11 years ago

I don't want to block anything on more advanced monitoring. Feature parity with cron means sending an email containing the job name, the exit code, and the error message when the job fails. Only one email needs to be sent; there's an alias that will forward it to the team.

Robert Helmer [:rhelmer]

Comment 13

•

11 years ago

(In reply to Chris Lonnen :lonnen from comment #12)
> I don't want to block anything on more advanced monitoring. Feature parity
> with cron means sending an email containing the job name, the exit code, and
> the error message when the job fails. Only one email needs to be sent;
> there's an alias that will forward it to the team.

BTW if crontabber just prints the error message to stdout/stderr, it should get emailed to us with the current mechanism.

Peter Bengtsson [:peterbe]

Reporter

Comment 14

•

11 years ago

After some discussion we've decided to put the business logic of errors into crontabber instead. 

See https://bugzilla.mozilla.org/show_bug.cgi?id=836425

What you'll then do is run crontabber the same way crontab runs it but with an extra parameter like ``--has-errors`` (or ``--nagios``) or something.

Depends on: 836425

Robert Helmer [:rhelmer]

Comment 15

•

11 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #14)
> After some discussion we've decided to put the business logic of errors into
> crontabber instead. 
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=836425
> 
> What you'll then do is run crontabber the same way crontab runs it but with
> an extra parameter like ``--has-errors`` (or ``--nagios``) or something.

We were thinking this could be run from NRPE (looks like it's running on the admin box).

Does this sound like a reasonable way to monitor this?

Peter Bengtsson [:peterbe]

Reporter

Comment 16

•

11 years ago

:ericz had a chance to look at rhelmer's last comment which lead to this: https://bugzilla.mozilla.org/show_bug.cgi?id=836425

I'm eager to start working on this but it would be nice to hear that Nagios will be able to work with that.

Eric Ziegenhorn :ericz

Assignee

Comment 17

•

11 years ago

:peterbe, that sounds like it might work.  I'll comment on bug 836425.

Peter Bengtsson [:peterbe]

Reporter

Comment 18

•

11 years ago

Over to you ericz. See how crontabber.py is invoked in crontab for the right parameters and stuff. In fact, it's wrapped in a bash script I think but I'm sure you'll know what you need to do puppet-wise.

All you need to do is add ``--nagios`` and it'll exit 0, 1 or 2 accordingly.

Eric Ziegenhorn :ericz

Assignee

Comment 19

•

11 years ago

:peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1 or socorroadm.stage.private.phx1.  Can you point me in the right direction?

Robert Helmer [:rhelmer]

Updated

•

11 years ago

Summary: crontabber errors should go to some sort of monitoring → crontabber errors should be reported by nagios

Robert Helmer [:rhelmer]

Comment 20

•

11 years ago

(In reply to Eric Ziegenhorn :ericz from comment #19)
> :peterbe, I'm not seeing crontabber.py in cron anywhere on sp-admin01.phx1
> or socorroadm.stage.private.phx1.  Can you point me in the right direction?

This should work now on both prod and stage:

PYTHONPATH=/data/socorro/application:/data/socorro/thirdparty/ /data/socorro/application/socorro/cron/crontabber.py --admin.conf=/etc/socorro/crontabber.ini --nagios

Lonnen :lonnen

Comment 21

•

11 years ago

We've had to disable crontabber.py in production, so this should be held up for a while.

Lonnen :lonnen

Comment 22

•

11 years ago

crontabber.py is re-enabled in production. Comment 20 is relevant again.

Peter Bengtsson [:peterbe]

Reporter

Comment 23

•

11 years ago

:ericz Has this been enabled on stage and prod yet?

Eric Ziegenhorn :ericz

Assignee

Comment 24

•

11 years ago

No.  I've been swamped with Graphite but will try and squeeze this in soon.

Eric Ziegenhorn :ericz

Assignee

Comment 25

•

11 years ago

As per IRC these alerts should go to cron-socorro@mozilla.com and #socorro-alerts.

Eric Ziegenhorn :ericz

Assignee

Comment 26

•

11 years ago

This is done in stage and prod, and I've seen the alerts fire.  When this alerts there will be a documentation link to https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab.  If you could fill that out, that'd be great.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Peter Bengtsson [:peterbe]

Reporter

Comment 27

•

11 years ago

(In reply to Eric Ziegenhorn :ericz from comment #26)
> This is done in stage and prod, and I've seen the alerts fire.  When this
> alerts there will be a documentation link to
> https://mana.mozilla.org/wiki/display/NAGIOS/Socorro+Admin+-+crontab.  If
> you could fill that out, that'd be great.

Link --> Page Not Found.

Eric Ziegenhorn :ericz

Assignee

Comment 28

•

11 years ago

It works for me.  Do you have access to mana.mozilla.org?

Eric Ziegenhorn :ericz

Assignee

Comment 29

•

11 years ago

Apparently this is restricted so nevermind about filling that out.

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard