crontabber should be able to retry more often than the job's frequency

RESOLVED DUPLICATE of bug 854012

Status

RESOLVED DUPLICATE of bug 854012
7 years ago
6 years ago

People

(Reporter: rhelmer, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
We hit a problem with crontabber not running the daily matviews jobs on time after deploying it with Socorro 16 (see bug 775532)

The problem seems to be that crontabber tried to run shortly after we deployed, and correctly detected that everything had already run (which is an error).

However, it set "next_run" for all jobs to 24h in the future (from cronta:

    "last_run": "2012-07-18 21:53:10.871443" 
    "first_run": "2012-07-18 21:53:10.871443"
    "next_run": "2012-07-19 21:53:10.840229"

I think that this is way too long to wait in the event of a failure; crontabber should probably retry hourly.

A related problem is that for time-sensitive jobs like this, the start time could start to "drift" if it's always based on the last_run time, even with the above fixed. 

We should probably have a way to specify the start time as well as the frequency (unless someone has a better idea for a solution to this).
We do have the option of setting which hour a job should run. E.g. 22:00. Like this:

"socorro.cron.jobs.matviews.DailyCrashesCronApp|1d|22:30"

The hour you set will be in UTC so if you want it to be in the middle of the night here on the west coast, make it something like "08:00".

Comment 2

7 years ago
As a note, we are probably able to run the crash aggregations any time after midnight UTC, as we're aggregating for the last UTC day. We may want to delay somewhat due to system loads and possibly metrics pushing ADU data only at a certain point.
I would prefer us to get nearer to UTC midnight than what we have now, but given that we're ending up deep in the off-work/night hours in the US anyhow, I'd just prefer it to not be too late in the European day.
(Reporter)

Comment 3

7 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> We do have the option of setting which hour a job should run. E.g. 22:00.
> Like this:
> 
> "socorro.cron.jobs.matviews.DailyCrashesCronApp|1d|22:30"
> 
> The hour you set will be in UTC so if you want it to be in the middle of the
> night here on the west coast, make it something like "08:00".

OK cool so that part of it we don't have to worry about.

What do you think about retrying more often after failure? If we were to run the daily matviews jobs a little after midnight and they failed, then I think we'd want to retry pretty soon and not wait 1d before retry.
(In reply to Robert Helmer [:rhelmer] from comment #3)
> 
> What do you think about retrying more often after failure? If we were to run
> the daily matviews jobs a little after midnight and they failed, then I
> think we'd want to retry pretty soon and not wait 1d before retry.

That's not a trivial change. Especially for the backfillers. 
I would really need to think about what this means before I say much more.
(Reporter)

Comment 5

7 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> We do have the option of setting which hour a job should run. E.g. 22:00.
> Like this:
> 
> "socorro.cron.jobs.matviews.DailyCrashesCronApp|1d|22:30"
> 
> The hour you set will be in UTC so if you want it to be in the middle of the
> night here on the west coast, make it something like "08:00".

Filed bug 777010 to get this into tomorrow's release.

(In reply to Peter Bengtsson [:peterbe] from comment #4)
> (In reply to Robert Helmer [:rhelmer] from comment #3)
> > 
> > What do you think about retrying more often after failure? If we were to run
> > the daily matviews jobs a little after midnight and they failed, then I
> > think we'd want to retry pretty soon and not wait 1d before retry.
> 
> That's not a trivial change. Especially for the backfillers. 
> I would really need to think about what this means before I say much more.

OK well as long as we solve the first problem (specifying a start time), we're not any worse off than the current situation. I think being able to retry more often would be nice-to-have.
(Reporter)

Updated

7 years ago
Assignee: nobody → rhelmer
(Reporter)

Comment 7

6 years ago
Not working on this right now but still would be a very useful feature - as it stands right now, manual intervention is required if a once-per-day job fails (which is no worse than the old situation per comment 5)
Assignee: rhelmer → nobody
(Reporter)

Comment 8

6 years ago
Fixed in bug 854012
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 854012
You need to log in before you can comment on or make changes to this bug.