Closed Bug 1010257 Opened 10 years ago Closed 7 years ago

product_crash_ratio and /daily data coming very late every now and then since 2014-05-13

Categories

(Socorro :: Database, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: kairo, Unassigned)

Details

My daily jobs pull data out of Socorro with a query similar to this:

SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('29.0.1','29.0','30.0b','30.0b3','31.0a2','32.0a1') AND adu_date BETWEEN '2014-05-10' AND '2014-05-14' ORDER BY adu_date DESC, version DESC;

This doesn't include data for yesterday, 20154-05-13, even though it should.

https://crash-stats.mozilla.com/daily?p=Firefox also misses the data for this day (might pull from the same table or set of tables in the background).
Here's the views and tables: 

breakpad=# \d+ product_crash_ratio
                   View "public.product_crash_ratio"
       Column       |     Type     | Modifiers | Storage  | Description
--------------------+--------------+-----------+----------+-------------
 product_version_id | integer      |           | plain    |
 product_name       | citext       |           | extended |
 version_string     | citext       |           | extended |
 adu_date           | date         |           | plain    |
 crashes            | bigint       |           | plain    |
 adu_count          | bigint       |           | plain    |
 throttle           | numeric(5,2) |           | main     |
 adjusted_crashes   | integer      |           | plain    |
 crash_ratio        | numeric      |           | main     |
View definition:
 SELECT crcounts.product_version_id, product_versions.product_name,
    product_versions.version_string, crcounts.report_date AS adu_date,
    sum(crcounts.report_count)::bigint AS crashes,
    sum(crcounts.adu) AS adu_count,
    product_release_channels.throttle::numeric(5,2) AS throttle,
    sum(crcounts.report_count::numeric / product_release_channels.throttle)::integer AS adjusted_crashes,
    crash_hadu(sum(crcounts.report_count)::bigint, sum(crcounts.adu), product_release_channels.throttle) AS crash_ratio
   FROM crashes_by_user_rollup crcounts
   JOIN product_versions ON crcounts.product_version_id = product_versions.product_version_id
   JOIN product_release_channels ON product_versions.product_name = product_release_channels.product_name AND product_versions.build_type = product_release_channels.release_channel
  GROUP BY crcounts.product_version_id, product_versions.product_name, product_versions.version_string, crcounts.report_date, product_release_channels.throttle;

breakpad=# \d+ crashes_by_user_rollup
               View "public.crashes_by_user_rollup"
       Column       |  Type   | Modifiers | Storage  | Description
--------------------+---------+-----------+----------+-------------
 product_version_id | integer |           | plain    |
 report_date        | date    |           | plain    |
 os_short_name      | citext  |           | extended |
 report_count       | bigint  |           | plain    |
 adu                | integer |           | plain    |
View definition:
 SELECT crashes_by_user.product_version_id, crashes_by_user.report_date,
    crashes_by_user.os_short_name,
    sum(crashes_by_user.report_count) AS report_count,
    min(crashes_by_user.adu) AS adu
   FROM crashes_by_user
   JOIN crash_types USING (crash_type_id)
  WHERE crash_types.include_agg
  GROUP BY crashes_by_user.product_version_id, crashes_by_user.report_date, crashes_by_user.os_short_name;

breakpad=# \d+ crashes_by_user
                          Table "public.crashes_by_user"
       Column       |  Type   | Modifiers | Storage  | Stats target | Description
--------------------+---------+-----------+----------+--------------+-------------
 product_version_id | integer | not null  | plain    |              |
 os_short_name      | citext  | not null  | extended |              |
 crash_type_id      | integer | not null  | plain    |              |
 report_date        | date    | not null  | plain    |              |
 report_count       | integer | not null  | plain    |              |
 adu                | integer | not null  | plain    |              |
Indexes:
    "crashes_by_user_key" PRIMARY KEY, btree (product_version_id, report_date, os_short_name, crash_type_id)
Foreign-key constraints:
    "crashes_by_user_crash_type_id_fkey" FOREIGN KEY (crash_type_id) REFERENCES crash_types(crash_type_id)
Has OIDs: no
Note that this was updated fine every day so far, but seems to have not been today for some reason.
My first guess was that the matview update process died for some reason before finishing this but Selena said on IRC that she saw no errors.
Investigated this - the way we're doing dependencies is delaying this unnecessarily (every nightly job that depends on reports_clean is depending on the very latest hour of reports_clean to be up-to-date, when it only really matters if reports_clean is up-to-date as of yesterday UTC).

Also, I notice that crontabber is running a job due at 11am (CorrelationsAddonCronApp) before a job due at 10am (CrashesByUserCronApp) - when I added the Correlations*CronApp jobs I purposely set the time higher because I assumed that crontabber would finish up all the other matviews first, since the correlations ones can be quite slow:

  socorro.cron.jobs.matviews.CorrelationsAddonCronApp|1d|11:00
  socorro.cron.jobs.matviews.CrashesByUserCronApp|1d|10:00
Flags: needinfo?(peterbe)
(In reply to Robert Helmer [:rhelmer] from comment #3)
> Investigated this - the way we're doing dependencies is delaying this
> unnecessarily (every nightly job that depends on reports_clean is depending
> on the very latest hour of reports_clean to be up-to-date, when it only
> really matters if reports_clean is up-to-date as of yesterday UTC).
> 

I filed bug 1010362 to improve this situation.
I see the data for yesterday now.
1) Is that because this ran to completion by itself?
2) Do I need to expect getting this so very late every day until bug 1010362 is fixed?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #5)
> I see the data for yesterday now.
> 1) Is that because this ran to completion by itself?

Yes

> 2) Do I need to expect getting this so very late every day until bug 1010362
> is fixed?

There are two bugs here - that one, and crontabber isn't honoring the start time (or I misunderstand what it's for).

Fixing either of these will help - bug 1010362 I could get done pretty quickly I'll take it.
FWIW, today's data came in as early as usual and my jobs could pull it at the normal time it runs, which is pretty early.
(In reply to Robert Helmer [:rhelmer] from comment #3)
> Investigated this - the way we're doing dependencies is delaying this
> unnecessarily (every nightly job that depends on reports_clean is depending
> on the very latest hour of reports_clean to be up-to-date, when it only
> really matters if reports_clean is up-to-date as of yesterday UTC).
> 
> Also, I notice that crontabber is running a job due at 11am
> (CorrelationsAddonCronApp) before a job due at 10am (CrashesByUserCronApp) -
> when I added the Correlations*CronApp jobs I purposely set the time higher
> because I assumed that crontabber would finish up all the other matviews
> first, since the correlations ones can be quite slow:
> 
>   socorro.cron.jobs.matviews.CorrelationsAddonCronApp|1d|11:00
>   socorro.cron.jobs.matviews.CrashesByUserCronApp|1d|10:00

So this is sadly quite non-trivial. I think. 

I dug into crontabber and made it print the jobs instead of running them. So, first the code gets the list of jobs out from the configuration (the massive JOBS_LIST list). Also, I made it only list the ones we're interested in and you get::

 ReportsCleanCronApp
 CrashesByUserCronApp
 CorrelationsAddonCronApp
 CorrelationsCoreCronApp
 CorrelationsModuleCronApp


Then it calls the function _reorder_class_list() whose responsibility is to rearrange the order so that we run "parents" before we run "children". If it didn't do this, there's a risk that you try to run a child before a parent and if that happens the child will have to wait one whole cycle till it gets run. 

Here's the output of that function:

 ReportsCleanCronApp
 CorrelationsModuleCronApp
 CorrelationsAddonCronApp
 CorrelationsCoreCronApp
 CrashesByUserCronApp

It clearly demonstrates the problem you explained. 

The code is here: https://github.com/mozilla/crontabber/blob/master/crontabber/base.py#L24-L84
Lonnen and I wrote that. 
I don't remember exactly how it works but I can look into it. The thing is, the CrashesByUserCronApp depends on two other apps and CrashesBy*CronApp only depends on 1. Perhaps this whole thing can be solved by using an ordereddict for the sub-list of dependencies. 

So, for that I've filed https://github.com/mozilla/crontabber/issues/21
Flags: needinfo?(peterbe)
Come to think of it, this is only a problem the first time. The next day, the crashesperuser will kick in at 10 and the correlations will kick in at 11. I.e. when the heartbeat beats at, say, 10:05 it'll see that `reports-clean` passed and `adu` passed because `reports-clean` passed so the next jobs to consider are:
1. Correlations*CronApp
2. CrashesByUserCronApp
That's the wrong order but when it sees that the clock is 10:05 and that Correlations*CronApp isn't supposed to run until 11:00 it'll skip those and process CrashesByUserCronApp. 

Right?
Sometimes it feels like this bug is mocking me. Almost like any day I'm eager to see the data, it runs into the bug and comes really late - and on the weekend, when I don't care, it comes right away (i.e. jobs done in the right order).

And if we do bug 1010609, this just goes away, right? Could we just go there?
Summary: product_crash_ratio and /daily missing data for 2014-05-13 → product_crash_ratio and /daily data coming very late every now and then since 2014-05-13
Assignee: selenamarie → nobody
I'm not sure why this bug was assigned to me -- the details of it went in a non-Postgres directly. :)

:rhelmer, :peterbe is this something either of you can tackle?
FWIW, today is the third day in a row that we have run into this bug.
(In reply to Selena Deckelmann :selenamarie :selena from comment #11)
> I'm not sure why this bug was assigned to me -- the details of it went in a
> non-Postgres directly. :)
> 
> :rhelmer, :peterbe is this something either of you can tackle?

This one is a crontabber bug, I think I'd rather leave it to :peterbe :)

I have bug 1010362 assigned which should help the problem by making the dependencies more sane.
Oh also bug 1010609 (start matviews earlier in the morning) could be a quick-fix here, too.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #12)
> FWIW, today is the third day in a row that we have run into this bug.

And today is the 6th in a row. Looks like we run into this issue every day now. :(
The Postgres based daily report is gone and replaced with Crashes Per Day which is based on SuperSearch.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.