Closed Bug 777670 Opened 8 years ago Closed 7 years ago

crontabber not running daily matviews since 2012-07-25

Categories

(Socorro :: Infra, task)

task
Not set

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: rhelmer, Assigned: peterbe)

Details

(Whiteboard: [qa-])

Attachments

(1 file)

I'm not sure what to make of the log, looks like it tried to run things?

2012-07-26 03:00:02,069 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.weekly_reports_partitions.WeeklyReportsPartitionsCronApp'> because it's not time to run
2012-07-26 03:00:02,069 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.nightly_builds.NightlyBuildsCronApp'>
2012-07-26 03:00:02,092 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'>
2012-07-26 03:00:02,628 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'> on 2012-07-26 10:00:02.092426+00:00
2012-07-26 03:00:02,652 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.SignaturesCronApp'>
2012-07-26 03:00:02,675 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.OSVersionsCronApp'>
2012-07-26 03:00:02,699 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.TCBSCronApp'>
2012-07-26 03:00:02,721 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ADUCronApp'>
2012-07-26 03:00:02,745 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.DailyCrashesCronApp'>
2012-07-26 03:00:02,771 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.HangReportCronApp'>
2012-07-26 03:00:02,795 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.RankCompareCronApp'>
2012-07-26 03:00:02,821 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.NightlyBuildsCronApp'>
2012-07-26 03:00:02,853 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.duplicates.DuplicatesCronApp'> because it's not time to run
2012-07-26 03:00:02,854 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.reports_clean.ReportsCleanCronApp'> because it's not time to run
2012-07-26 03:00:02,854 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.bugzilla.BugzillaCronApp'> because it's not time to run

According to crontabbers.json there were no errors and it ran:

  "tcbs-matview": {
    "next_run": "2012-07-27 10:00:00.000000", 
    "last_error": {}, 
    "first_run": "2012-07-25 23:19:56.679210", 
    "last_run": "2012-07-26 10:00:02.699569", 
    "error_count": 0
  }, 

However there is no data for 2012-07-25 in TCBS etc. I checked and we do have data in raw_adu (if the problem was something like this I'd expect an error).

It looks like crontabber finished in under 5m which is also suspicious.
Running backfill as breakpad_rw:

breakpad=> select backfill_matviews('2012-01-25', '2012-01-25');
(In reply to Robert Helmer [:rhelmer] from comment #1)
> Running backfill as breakpad_rw:
> 
> breakpad=> select backfill_matviews('2012-01-25', '2012-01-25');

Oops of course I actually meant:

breakpad=> select backfill_matviews('2012-07-25', '2012-07-25');

(restarted the psql connection etc)
Backfill completed:

breakpad=> select backfill_matviews('2012-07-25', '2012-07-25');
INFO:  backfilling reports_clean
INFO:  backfilling 06:00:00 of reports_clean starting at 2012-07-25 00:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling 06:00:00 of reports_clean starting at 2012-07-25 06:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling 06:00:00 of reports_clean starting at 2012-07-25 12:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling 06:00:00 of reports_clean starting at 2012-07-25 18:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling 06:00:00 of reports_clean starting at 2012-07-26 00:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling 02:00:00 of reports_clean starting at 2012-07-26 06:00:00+00
CONTEXT:  SQL statement "SELECT backfill_reports_clean( first_rc, last_rc )"
PL/pgSQL function "backfill_matviews" line 48 at PERFORM
INFO:  backfilling other matviews for 2012-07-25
INFO:  adu
INFO:  tcbs
NOTICE:  table "new_tcbs" does not exist, skipping
CONTEXT:  SQL statement "DROP TABLE IF EXISTS new_tcbs"
PL/pgSQL function "backfill_matviews" line 58 at SQL statement
INFO:  daily crashes
INFO:  signatures
INFO:  hang report
INFO:  nightly builds
INFO:  rank_compare
INFO:  correlations
 backfill_matviews 
-------------------
 t
(1 row)
peterbe, could you please take a look at why this didn't work? See comment 0, logs are on sp-admin01
Assignee: nobody → peterbe
Summary: crontabber did not run TCBS etc. for 2012-07-25 → crontabber did not run daily matviews for 2012-07-25
Josh: Can you double-check that the code is right:

https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/matviews.py#L51

In particular:

1. Is the name of the stored procedure spelled correctly? 'update_tcbs'

2. Note that the code doesn't do any "external" transaction commits. 

3. Note that the jobs based on the 'MatViewBackfillBase' class take the "now date" (which would have been "2012-07-26 10:00:02") minus 1 day and with the hour part trimmed off, e.g. "2012-07-25".

3b. Because this ran earlier in the release-blur but at a different hour it would have already been fed the "2012-07-25" date.
Josh: Ignore my previous comment. 

We've figure out what the problem was. Most of the matviews were backfill based and they don't just look at the "next_run" attribute. Because when they run, the also look at the last_run and based on that it takes last_run + frequency till it exceeds "now". So for those it doesn't just run once, it runs as many times as it has missed. 

Because we manually fudged the "next_run" to correct for the misunderstand of UTC vs. Pacific time, we forgot to also fudge the last_run. 
And because last_run + frequency < 24 hours, it wasn't time to run them.
I'm now pretty confident the tcbs-matview job will run the next time. In fact, all backfilling jobs that specify a time will run. 

I've added some tests which hopefully prove it:
https://github.com/mozilla/socorro/pull/743

If a job doesn't have a `last_success` to base it's backfilling calculation on, it will instead use the `first_run` date.
Looks like it tried to run, but for the wrong day:

InternalError: nightly_builds has already been run for 2012-07-25.
InternalError: TCBS has already been run for the day 2012-07-25.

etc.

Running backfill now:
breakpad=> select backfill_matviews('2012-07-26', '2012-07-26');
Summary: crontabber did not run daily matviews for 2012-07-25 → crontabber did not run daily matviews for 2012-07-25 or 26
No TCBS or graphs update, although it does look like everything ran, or at least logged that it did:

2012-07-28 03:00:01,961 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.nightly_builds.NightlyBuildsCronApp'>
2012-07-28 03:00:03,984 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.nightly_builds.NightlyBuildsCronApp'> on 201
2-07-28 10:00:01.467704+00:00
2012-07-28 03:00:04,267 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'>
2012-07-28 03:00:04,507 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'> on 2012-07
-28 10:00:04.267924+00:00
2012-07-28 03:00:04,709 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.SignaturesCronApp'>
2012-07-28 03:03:37,224 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.SignaturesCronApp'> on 2012-07-28 1
0:00:01.881395+00:00
2012-07-28 03:03:37,249 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.OSVersionsCronApp'>
2012-07-28 03:03:38,375 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.OSVersionsCronApp'> on 2012-07-28 1
0:02:56.304722+00:00
2012-07-28 03:03:38,397 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.TCBSCronApp'>
2012-07-28 03:07:44,977 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.TCBSCronApp'> on 2012-07-28 10:02:5
7.743661+00:00
2012-07-28 03:07:45,003 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ADUCronApp'>
2012-07-28 03:07:45,042 DEBUG - MainThread - error when running <class 'socorro.cron.jobs.matviews.ADUCronApp'> on None
Traceback (most recent call last):
  File "/data/socorro/application/socorro/cron/crontabber.py", line 675, in _run_one
    for last_success in self._run_job(job_class, config, info):
  File "/data/socorro/application/socorro/cron/crontabber.py", line 89, in main
    function(when)
  File "/data/socorro/application/socorro/cron/crontabber.py", line 128, in _run_proxy
    self.run(connection, date)
  File "/data/socorro/application/socorro/cron/jobs/matviews.py", line 31, in run
    cursor.callproc(self.get_proc_name(), [target_date])
InternalError: update_adu has already been run for 2012-07-27

2012-07-28 03:07:45,067 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.DailyCrashesCronApp'>
2012-07-28 03:07:53,173 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.DailyCrashesCronApp'> on 2012-07-28
 10:02:57.869434+00:00
2012-07-28 03:07:53,773 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.HangReportCronApp'>
2012-07-28 03:38:14,925 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.HangReportCronApp'> on 2012-07-28 1
0:02:57.909317+00:00
2012-07-28 03:38:14,983 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.RankCompareCronApp'>
2012-07-28 03:38:17,150 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.RankCompareCronApp'> on 2012-07-28 
10:02:58.012266+00:00
2012-07-28 03:38:17,176 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.NightlyBuildsCronApp'>
2012-07-28 03:38:19,663 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.NightlyBuildsCronApp'> on 2012-07-2
8 10:03:00.275062+00:00
2012-07-28 03:38:19,801 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.duplicates.DuplicatesCronApp'>
2012-07-28 03:42:06,553 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.duplicates.DuplicatesCronApp'> on 2012-07-28 10:23:49.277860+00:00
Summary: crontabber did not run daily matviews for 2012-07-25 or 26 → crontabber did not run daily matviews for 2012-07-[25,26,27]
Started backfill for yesterday:
breakpad=> select backfill_matviews('2012-07-27', '2012-07-27');
peterbe, I don't see a cursor.commit() in either of the run() methods of the matview base classes:

https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/matviews.py#L19

Is this in autocommit mode or done elsewhere? If not then could this be it, since psycopg2 uses transactions by default, right?
(In reply to Robert Helmer [:rhelmer] from comment #11)
> peterbe, I don't see a cursor.commit() in either of the run() methods of the
> matview base classes:
> 
> https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/matviews.
> py#L19
> 
> Is this in autocommit mode or done elsewhere? If not then could this be it,
> since psycopg2 uses transactions by default, right?

It does NOT do any transaction commits in the python code. However, I asked Josh about this twice and he said it wasn't needed. Perhaps he misinterpreted the question. Perhaps we need to quickly add a connection.commit() to all the matviews.py jobs.
Did not work again tonight, looks like a scheduling problem again maybe?

2012-07-29 03:00:01,465 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.weekly_reports_partitions.WeeklyReportsPartitionsCronApp'> because it's not time to run
2012-07-29 03:00:01,465 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.nightly_builds.NightlyBuildsCronApp'>
2012-07-29 03:00:01,490 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'>
2012-07-29 03:00:01,661 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.matviews.ProductVersionsCronApp'> on 2012-07-29 10:00:01.491261+00:00
2012-07-29 03:00:01,682 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.SignaturesCronApp'>
2012-07-29 03:00:01,712 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.OSVersionsCronApp'>
2012-07-29 03:00:01,736 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.TCBSCronApp'>
2012-07-29 03:00:01,760 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.ADUCronApp'>
2012-07-29 03:00:01,804 DEBUG - MainThread - error when running <class 'socorro.cron.jobs.matviews.ADUCronApp'> on None
Traceback (most recent call last):
  File "/data/socorro/application/socorro/cron/crontabber.py", line 675, in _run_one
    for last_success in self._run_job(job_class, config, info):
  File "/data/socorro/application/socorro/cron/crontabber.py", line 89, in main
    function(when)
  File "/data/socorro/application/socorro/cron/crontabber.py", line 128, in _run_proxy
    self.run(connection, date)
  File "/data/socorro/application/socorro/cron/jobs/matviews.py", line 32, in run
    cursor.callproc(self.get_proc_name(), [target_date])
InternalError: update_adu has already been run for 2012-07-27

2012-07-29 03:00:01,829 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.DailyCrashesCronApp'>
2012-07-29 03:00:01,852 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.HangReportCronApp'>
2012-07-29 03:00:01,879 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.RankCompareCronApp'>
2012-07-29 03:00:01,901 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.matviews.NightlyBuildsCronApp'>
2012-07-29 03:00:01,924 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.duplicates.DuplicatesCronApp'> because it's not time to run
2012-07-29 03:00:01,925 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.reports_clean.ReportsCleanCronApp'> because it's not time to run
2012-07-29 03:00:01,925 DEBUG - MainThread - skipping <class 'socorro.cron.jobs.bugzilla.BugzillaCronApp'> because it's not time to run
Summary: crontabber did not run daily matviews for 2012-07-[25,26,27] → crontabber not running daily matviews since 2012-07-25
Since the hourly jobs seem to be running fine, instead of doing a full backfill I am just running the old dailyMatviews job. It only takes around 10min for the most important things (tcbs, signature, etc) and then quite a bit longer for the hang report.

If crontabber doesn't work again tomorrow, I'll run it this way in the morning.
Did not work tonight, likely due to previous failures (currently there is no way to distinguish "already ran" from actual failures, which will be fixed separately in the matviews)

I have the Socorro v16 daily_matviews scripts checked out, and am running them now since this takes way less time than backfilling.
Peter and I talked in IRC and agreed that we should roll back to the "old" crontabber until staging is in a good state.

Let's get that done today (7/30) so nobody has to be hand-running stuff at 3am tomorrow morning.  

Thanks for the diligence and hard work, guys.
(In reply to Laura Thomson :laura from comment #17)
> Peter and I talked in IRC and agreed that we should roll back to the "old"
> crontabber until staging is in a good state.
> 
> Let's get that done today (7/30) so nobody has to be hand-running stuff at
> 3am tomorrow morning.  
> 
> Thanks for the diligence and hard work, guys.

OK, why don't we leave the new code in place and just bring the old code back? That way we can switch or fallback just by changing the system crontab, and it'll be less disruptive than the way we've tried the past few times.
(In reply to Robert Helmer [:rhelmer] from comment #18)
> (In reply to Laura Thomson :laura from comment #17)
> > Peter and I talked in IRC and agreed that we should roll back to the "old"
> > crontabber until staging is in a good state.
> > 
> > Let's get that done today (7/30) so nobody has to be hand-running stuff at
> > 3am tomorrow morning.  
> > 
> > Thanks for the diligence and hard work, guys.
> 
> OK, why don't we leave the new code in place and just bring the old code
> back? That way we can switch or fallback just by changing the system
> crontab, and it'll be less disruptive than the way we've tried the past few
> times.

I suggest we comment out the matviews once from crontabber.ini and leave the code in place. Also, we'll need to re-surrect the old daily_matviews.py code plus scripts.
(In reply to Peter Bengtsson [:peterbe] from comment #19)
> (In reply to Robert Helmer [:rhelmer] from comment #18)
> > (In reply to Laura Thomson :laura from comment #17)
> > > Peter and I talked in IRC and agreed that we should roll back to the "old"
> > > crontabber until staging is in a good state.
> > > 
> > > Let's get that done today (7/30) so nobody has to be hand-running stuff at
> > > 3am tomorrow morning.  
> > > 
> > > Thanks for the diligence and hard work, guys.
> > 
> > OK, why don't we leave the new code in place and just bring the old code
> > back? That way we can switch or fallback just by changing the system
> > crontab, and it'll be less disruptive than the way we've tried the past few
> > times.
> 
> I suggest we comment out the matviews once from crontabber.ini and leave the
> code in place. Also, we'll need to re-surrect the old daily_matviews.py code
> plus scripts.

Makes sense. I'll do a PR.
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/80075bf2c97fc1fb280a44df6e51bc23566fd023
bug 777670 - resurrect old daily_matviews until crontabber is ready

https://github.com/mozilla/socorro/commit/5bdea1bf5dfa97994887fafb09ea01c85a07cde8
Merge pull request #751 from rhelmer/bug777670-resurrect-daily_matviews

bug 777670 - resurrect old daily_matviews until crontabber is ready
Whiteboard: [qa-]
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/43fc29564690c352f376ae7bc208ab17a3378986
bug777670 - backfills slow jobs with time

https://github.com/mozilla/socorro/commit/9da90ff79a5994b9fae717267b4506ea54b66923
Merge pull request #743 from peterbe/bug777670-backfills-slow-with-time

bug777670 - backfills slow jobs with time
This is a bug from the days when we tried to deploy crontabber. 
...which we later retracted. crontabber is a different beast these days and will hopefully soon be reattempted.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.