Closed Bug 1024580 Opened 10 years ago Closed 10 years ago

Socorro didn't (fully) update to current data today (2014-06-12)

Categories

(Socorro :: Backend, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: selenamarie)

References

Details

I noticed today that while some underlying data was present (my custom reports ran fine and those need e.g. product_adu to be updated, AFAIK by the adu-matview job), the Socorro UI was not displaying today's data correctly. The ADU push came late today (10am UTC instead of usually ~3:20am), so data could have been somewhat delayed anyhow, but looking at things, this wasn't even the issue. Something went down in a strange way in Crontabber land. What I first noticed was that in https://crash-stats.mozilla.com/crontabber-state/, all daily jobs have a next_run time of "in X hours", which should mean they ran fine and are scheduled again for tomorrow - but some have a last_success of "Y hours ago" (where Y is 24-X, as expected when they have been run fine today) but a lot of them have "a day ago" which suggests they did not run today. Together with the next_run being in the future (would be in the past if today's run was scheduled but not run yet, i.e. overdue), it looks like they were not even scheduled to run today!
Two things that can play into this are that due to bug 1023867, we had almost all daily jobs blocked for hours yesterday and run very late, and bug 1010609 has applied config changes to schedule the non-correlations jobs at 5am instead of 10am UTC (but didn't reset them yet). I focused together with peterbe on two example jobs that are fairly similar but one did run and one didn't: tcbs-build-matview (not run) and tcbs-matview (run correctly today). Looking at the times they actually ran last, I came to suspect that those jobs that did perform their very late run yesterday before midnight UTC were correctly scheduled today but those that happened to be run after midnight, i.e. already today, were not scheduled for their regular run today (but for tomorrow right away). This query from peterbe seems to confirm that with the two examples mentioned: breakpad=# select app_name,log_time, success, duration from crontabber_log where app_name like '%tcbs%' order by log_time desc limit 10; app_name | log_time | success | duration -------------------+-------------------------------+------------------------+---------------- tcbs-build-matview | 2014-06-12 10:22:26.090643+00 | 2014-06-12 05:00:00+00 | 00:00:36.58112 tcbs-matview | 2014-06-12 00:38:32.627713+00 | 2014-06-11 05:00:00+00 | 00:06:08.47249 tcbs-build-matview | 2014-06-11 23:45:49.365399+00 | 2014-06-11 10:00:00+00 | 00:00:37.30777 tcbs-matview | 2014-06-10 18:42:07.702854+00 | 2014-06-10 10:00:00+00 | 00:05:14.33292 tcbs-build-matview | 2014-06-10 10:22:04.266779+00 | 2014-06-10 10:00:00+00 | 00:00:38.205 tcbs-matview | 2014-06-09 11:02:11.661655+00 | 2014-06-09 10:00:00+00 | 00:06:26.26315 tcbs-build-matview | 2014-06-09 10:21:23.024723+00 | 2014-06-09 10:00:00+00 | 00:00:30.6378 tcbs-matview | 2014-06-08 11:04:24.315397+00 | 2014-06-08 10:00:00+00 | 00:06:09.19385 tcbs-build-matview | 2014-06-08 10:23:07.009557+00 | 2014-06-08 10:00:00+00 | 00:00:31.00978 tcbs-matview | 2014-06-07 17:43:44.773219+00 | 2014-06-07 10:00:00+00 | 00:05:05.78041
Note that interestingly, all those jobs run since midnight seem to say they ran for the new 5am timing from bug 1010609 while those ran before midnight have the old 10am time. Also, the /crontabber-state and /api/CrontabberState pages say that the runs for tomorrow are all scheduled for the new 5am time.
Selena said on IRC that it probably was the push of the changed config that created the issue, but bug 1024172 says the push was done at 23:15 UTC (and the comment with that log is from 16:26, 11 minutes later): [2014-06-11 23:15:12] Finished commit_www But looking at the crontabber log, it looks like the issues did not start there, they only started at midnight UTC: breakpad=> select app_name,log_time, success, duration from crontabber_log where log_time > '2014-06-11 00:00:00' and app_name like '%matview' order by log_time desc; app_name | log_time | success | duration -----------------------------------------+-------------------------------+-------------------------------+---------------- correlations-core-matview | 2014-06-12 18:34:04.46661+00 | 2014-06-12 07:30:00+00 | 00:25:12.4421 correlations-module-matview | 2014-06-12 17:40:36.632819+00 | 2014-06-12 08:00:00+00 | 03:45:22.14739 correlations-addon-matview | 2014-06-12 13:38:39.205578+00 | 2014-06-12 07:00:00+00 | 02:38:28.14928 adu-matview | 2014-06-12 10:35:20.451389+00 | 2014-06-12 05:00:00+00 | 00:00:00.74012 nightly-builds-matview | 2014-06-12 10:25:49.191775+00 | 2014-06-12 05:00:00+00 | 00:00:03.53962 tcbs-build-matview | 2014-06-12 10:22:26.090643+00 | 2014-06-12 05:00:00+00 | 00:00:36.58112 crashes-by-user-build-matview | 2014-06-12 10:21:26.238424+00 | 2014-06-12 05:00:00+00 | 00:00:44.84509 build-adu-matview | 2014-06-12 10:20:41.18487+00 | 2014-06-12 05:00:00+00 | 00:00:01.27609 signatures-matview | 2014-06-12 10:20:39.729457+00 | 2014-06-12 05:00:00+00 | 00:11:39.44403 signature-summary-products-matview | 2014-06-12 10:09:00.021468+00 | 2014-06-12 05:00:00+00 | 00:01:59.69654 signature-summary-installations-matview | 2014-06-12 10:03:51.180705+00 | 2014-06-12 05:00:00+00 | 00:02:15.01764 product-versions-matview | 2014-06-12 10:01:35.703209+00 | 2014-06-12 10:01:35.312672+00 | 00:00:00.3807 adu-matview | 2014-06-12 10:01:35.148604+00 | | 00:00:00.03901 raw-update-channel-matview | 2014-06-12 10:01:30.561645+00 | 2014-06-12 05:00:00+00 | 00:00:00.11114 android-devices-matview | 2014-06-12 10:01:30.284566+00 | 2014-06-12 05:00:00+00 | 00:00:54.71838 graphics-device-matview | 2014-06-12 10:00:35.248821+00 | 2014-06-12 05:00:00+00 | 00:00:28.33271 explosiveness-matview | 2014-06-12 00:40:24.667944+00 | 2014-06-11 05:00:00+00 | 00:01:15.3627 exploitability-matview | 2014-06-12 00:38:56.210562+00 | 2014-06-11 05:00:00+00 | 00:00:10.39105 crashes-by-user-matview | 2014-06-12 00:38:45.559467+00 | 2014-06-11 05:00:00+00 | 00:00:07.12553 tcbs-matview | 2014-06-12 00:38:32.627713+00 | 2014-06-11 05:00:00+00 | 00:06:08.47249 signature-summary-uptime-matview | 2014-06-12 00:21:53.954498+00 | 2014-06-11 05:00:00+00 | 00:04:08.72477 home-page-graph-matview | 2014-06-12 00:07:39.35401+00 | 2014-06-11 05:00:00+00 | 00:00:04.00544 nightly-builds-matview | 2014-06-11 23:49:17.93129+00 | 2014-06-11 10:00:00+00 | 00:00:03.55446 tcbs-build-matview | 2014-06-11 23:45:49.365399+00 | 2014-06-11 10:00:00+00 | 00:00:37.30777 crashes-by-user-build-matview | 2014-06-11 23:44:44.372092+00 | 2014-06-11 10:00:00+00 | 00:00:42.27451 build-adu-matview | 2014-06-11 23:44:01.892927+00 | 2014-06-11 10:00:00+00 | 00:00:01.59691 signatures-matview | 2014-06-11 23:44:00.114891+00 | 2014-06-11 10:00:00+00 | 00:11:16.14642 correlations-core-matview | 2014-06-11 23:32:43.79662+00 | 2014-06-11 11:30:00+00 | 00:29:33.33036 signature-summary-products-matview | 2014-06-11 23:03:10.264574+00 | 2014-06-11 10:00:00+00 | 00:02:33.93785 correlations-addon-matview | 2014-06-11 22:36:46.355177+00 | 2014-06-11 11:00:00+00 | 02:36:44.61951 signature-summary-installations-matview | 2014-06-11 20:00:01.287253+00 | 2014-06-11 10:00:00+00 | 00:02:20.57461 product-versions-matview | 2014-06-11 19:57:40.501262+00 | 2014-06-11 19:57:39.968419+00 | 00:00:00.52633 adu-matview | 2014-06-11 19:57:39.806435+00 | 2014-06-11 10:00:00+00 | 00:00:01.42956 correlations-module-matview | 2014-06-11 19:26:36.442943+00 | 2014-06-11 12:00:00+00 | 03:53:28.37117 raw-update-channel-matview | 2014-06-11 10:01:32.215334+00 | 2014-06-11 10:00:00+00 | 00:00:00.09673 android-devices-matview | 2014-06-11 10:01:31.944959+00 | 2014-06-11 10:00:00+00 | 00:00:54.79721 graphics-device-matview | 2014-06-11 10:00:35.645732+00 | 2014-06-11 10:00:00+00 | 00:00:28.34705 (37 rows) Interestingly, up to midnight, all jobs said they completed for the 10am job, afterwards they say they completed for the 5am one. So it looks like the changed config somehow magically started taking effect exactly at midnight.
Backfilling for 6/11/04. breakpad=# select backfill_matviews('2014-06-11', '2014-06-11', False); -- 'False' is to avoid redoing all the reports clean, which went fine yesterday.
Assignee: nobody → sdeckelmann
Status: NEW → ASSIGNED
So, this is from IRC and I think sheds some light on this: <KaiRo> selenamarie: ah, ok. so we just happened to start the next run after the config change around midnight. and that confused the hell out of crontabber as the next run would be scheduled less than 1d (24h) after the previous one now <KaiRo> do I get that right? <selenamarie> KaiRo: yes, i think so. there's some nuance in there - the "confusion" was really ours. the scheduler did what we told it to do and at least one of the dependencies for "make this thing run TODAY" was met <selenamarie> KaiRo: the underlying issue is that we can't think of jobs as "queued up" -- they are scheduled for Day X, and if we change a config, it resets everything sort of like groundhog day (the movie) <KaiRo> hmm, ok <selenamarie> also complicating things is that crontabber runs on UTC, but we reconfigure things in PT <selenamarie> so... the day changeover isn't always apparent to us <peterbe> selenamarie: yeah, messing with the backfill is scary business. That's why `--job=my-backfill-app --force` gets ignored. <KaiRo> selenamarie: ok, so we understand why this happened. that's a good step. <peterbe> selenamarie: Now we didn't use PST time. <peterbe> selenamarie: What lonnen did was he changed from 10amUTC to 05amUTC. <KaiRo> selenamarie: now, when tcbs-matview (and others) run tomorrow, will that fill in the data for 2014-06-11 as well or only the data for -12? <selenamarie> peterbe: i just mean that we did a push of a config change, near midnight UTC, without considering how it would affect ongoing backfill <peterbe> KaiRo: it will *not* fill in 11. <selenamarie> peterbe: i shall refer to this as "the Gremlins problem" from here on out <peterbe> 11th* <KaiRo> peterbe: I think selenamarie meant those people doing the prod pushes are living in Pacific Time <peterbe> ah <KaiRo> peterbe: ok, so we need to backfill everything for the 11th? <peterbe> KaiRo: not everything. Only those matviews that didn't get 11th. Right selenamarie ? <KaiRo> if we do have that granular a backfill, I guess :) <selenamarie> peterbe: yeah <selenamarie> we don't have that granular a backfill :) <selenamarie> but i can at least prevent it from redoing all the reports clean stuff
Backfill complete for 6/11.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1037959
You need to log in before you can comment on or make changes to this bug.