Closed Bug 853649 Opened 11 years ago Closed 11 years ago

[crontabber] support asynchronous (and blocking) execution of jobs

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: peterbe, Assigned: peterbe)

Details

Crontabber is executed every 5 minutes (by crontab) and when done so, a crude bash wrapper locks it so if executing a bunch of jobs takes longer than 5 minutes it doesn't execute a second process. This is good because we don't want to accidentally start working on a "child job" if its "parent job" hasn't finished yet. 

This is quite impractical since most jobs take a long time to finish and all other jobs (that might be totally independent) have to patiently wait. This is particularly impractical when you mix the postgres stored procedures and the pig/hadoop jobs. Both can take many minutes to complete. 

The solution is to remove the bash locking and taking it "in house" so that crontab can fire up multiple python processes. This means that you can have multiple crontabbers running simultaneously. In particular this means that we need to make the writing to the crontabbers.json file (and its backup to postgres) "thread safe" (put in quotation marks because we're not running multiple threads, we're running multiple processors). I.e. we open crontabber.json, load it into memory, run a slow job, and then writing back to the file several minutes later. 

The solution of the inner locking needs to be aware of dependencies. "Trees of dependency" needs to run in serial. For example, we don't want to run the nightly matview whilst the adu matview is still running. What we'll do is serialize the dependency trees and make that the lock. We can do that by extracting a jobs all parents, grand-parents, children and grand-children into a sorted list and hash it. 

The first thing to solve is to make the reading and writing of the crontabbers.json thread safe. When we change the state we first need to do one more reading and merge the state.
(In reply to Peter Bengtsson [:peterbe] from comment #0)
> Crontabber is executed every 5 minutes (by crontab) and when done so, a
> crude bash wrapper locks it so if executing a bunch of jobs takes longer
> than 5 minutes it doesn't execute a second process. This is good because we
> don't want to accidentally start working on a "child job" if its "parent
> job" hasn't finished yet. 
> 
> This is quite impractical since most jobs take a long time to finish and all
> other jobs (that might be totally independent) have to patiently wait. This
> is particularly impractical when you mix the postgres stored procedures and
> the pig/hadoop jobs. Both can take many minutes to complete. 
> 
> The solution is to remove the bash locking and taking it "in house" so that
> crontab can fire up multiple python processes. This means that you can have
> multiple crontabbers running simultaneously. In particular this means that
> we need to make the writing to the crontabbers.json file (and its backup to
> postgres) "thread safe" (put in quotation marks because we're not running
> multiple threads, we're running multiple processors). I.e. we open
> crontabber.json, load it into memory, run a slow job, and then writing back
> to the file several minutes later. 

This might be tricky to get right, for example you need to detect that the file was written to since you read it and merge the changes (if the locking is done right there should be no merge conflicts to resolve at least).

What is the benefit of having the JSON file anyway? Just doing inserts into a table and resolving anything necessary at read time would let the DB do this for us (for example). This would also give us a history of changes (append-only instead of overwriting all the data on each run).

I could see the argument that blocking jobs from running just because the DB is down seems wrong, but I can't think of any jobs that don't need input from or deliver output to the database.
You make a very interesting case. The reason we're using crontabbers.json was because we initially wanted this to be "pure". I.e. as few depedencies as possible. 
Perhaps the best way is to make state.load() and state.save() go straight to postgres and no .json file at all. In fact, instead of state.save() we'd ditch that and override the state.__setattr__ to immediately go straight to postgres. Or some other simplification of sorts.
Assignee: nobody → peterbe
Target Milestone: --- → 48
Target Milestone: 48 → Future
This is now possible, thanks to no longer using crontabber.json.  Ie. to run parallel crontabbers.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.