Closed Bug 651267 Opened 13 years ago Closed 13 years ago

Need cronjobs table

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jberkus, Assigned: jberkus)

References

Details

We have a number of cronjobs which need to "backfill" if they fail or skip for several hours.  Currently our way of keeping track of which ones need to backfill is by querying the tables affected by the cronjob.  However, since many of these tables don't fill on a current-timestamp basis, this process is fragile.  Particularly, signature_updates doesn't work in testing because of it.

As such, I'd like to add a table called cronjobs, which would have the status and last run time of all cron jobs which run on PostgreSQL.  Schema:

table cronjobs (
   job_name text not null,
   last_success_time timestamptz,
   last_success_target timestamptz,
   last_failure_time timestamptz,
   last_failure_target timestamptz
);

I would add a stored producedure to help with populating and using this table, to wit:

SELECT get_last_cronjob_run($job_name)
SELECT update_cronjob_run($job_name, $target_time, $success)

... with some refinement from Rob, of course.
Can we just use a pickle?  That's how the others work.
We could, but that means that the information about what cron jobs have been run and when is located somewhere non-central and not monitorable by nagios (and irrecoverable if spadmin dies).  Given the number of these we run, it seems like something central is called for at this point.

Anyway, I think that Brandon has a suggestion for fixing this in a less ad-hoc way.
Rob, Jabba, and I have been having discussions about doing this a different way for a while.  We'd like to use some kind of distributed scheduler and process control system (perhaps involving supervisord).

Let's hold off on implementing this in postgres until we finish that conversation, because it we may jump in a different direction.
The other issue having either a cron table, or comprehensive pickle files, would solve is Rob's current issue of bootstrapping an empty database.
(In reply to comment #4)
> The other issue having either a cron table, or comprehensive pickle files,
> would solve is Rob's current issue of bootstrapping an empty database.

Tracking the above in bug 653362 btw.
Rob, Brandon,

Please check the following data structure to see that it'll work for you:

table cronjobs (
   cronjob  text not null primary key,
   enabled boolean not null default true,
   frequency interval,
   last_success timestamptz,
   last_failure timestamptz,
   failure_message text
);

This seems like the simplest construction.  Then I can implement a couple of functions, or you can handle cycling stuff in the python.   Let's chat about it.
Target Milestone: 1.7.8 → 2.0
Assignee: josh → mpressman
Assignee: mpressman → josh
Target Milestone: 2.0 → 2.1
Rob says that he can't get python code done by Thursday.  As such, bumping this feature to 2.1.
(In reply to comment #6)
> Rob, Brandon,
> 
> Please check the following data structure to see that it'll work for you:
> 
> table cronjobs (
>    cronjob  text not null primary key,
>    enabled boolean not null default true,
>    frequency interval,
>    last_success timestamptz,
>    last_failure timestamptz,
>    failure_message text
> );
> 
> This seems like the simplest construction.  Then I can implement a couple of
> functions, or you can handle cycling stuff in the python.   Let's chat about
> it.

This seems fine, I am not sure about "frequency" though - that only exists in the crontab right now and the job generally doesn't have access to it.

Currently different jobs implement this in an ad-hoc manner such as :
http://code.google.com/p/socorro/source/browse/trunk/socorro/cron/bugzilla.py#115

Other scripts might look at the last time a table was updated.

A good place to start might be to implement new get_last_run_date/save_last_run_date functions which use this new table in a utility class, or perhaps a "cronjob" class that all python cron jobs inherit from. Then we can move the jobs over piecemeal, and be able to easily back it out if anything goes wrong.
The reason for "frequency" is for monitoring.  That is, nagios/ganglia needs to check if the cron job was run when it was supposed to be run, and in order to do that, it needs to know how often it was supposed to be run.

In theory, you could also design a generic cronjob which ran every 1/2 hour and ran the various cronjobs based on the frequency in the table.  Not sure if we want to go that far.

The other thing to keep in mind (and one of the sources of issues with missed cronjobs) is that there's a difference between the "run timestamp" and the "target timestamp".  That is, there's the time the job was run at and the time it was run for.  Take TCBS, for example; TCBS is run at NOW, but would be run for a time 3 hours ago.   TCBS is not the only cronjob we have with this kind of lag.

So, revised table:

table cronjobs (
   cronjob  text not null primary key,
   enabled boolean not null default true,
   frequency interval,
   lag interval,
   last_success timestamptz,
   last_target_time timestamptz,
   last_failure timestamptz,
   failure_message text
);
OK, I've loaded the following schema:

create table cronjobs (
   cronjob  text not null primary key,
   enabled boolean not null default true,
   frequency interval,
   lag interval,
   last_success timestamptz,
   last_target_time timestamptz,
   last_failure timestamptz,
   failure_message text,
   description text
);

I'll be working on some python classes next.  No promised that they get done by thursday.
Here are my proposed python classes:

get_last_run_time
  input properties: cronjob_name
  output properties:
      time = target_time of last successful run (or runtime, if identical)
      fail_time = runtime of last failure
      run_time = runtime of last success

set_cronjob_run
  input properties: 
      cronjob name = name of the job
      runtime = now
      target_time = time job was run *for*; use runtime if not set
      success = boolean
      failure_message = text of error message if success = false

good to start?
Table created, waiting for Thursday to create on Stage.
Blocks: 664496
Now part of general cronjob tracking bug.  This bug now covers the table creation *only*.
Target Milestone: 2.1 → 2.0
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.