Closed Bug 1275146 Opened 8 years ago Closed 8 years ago

Crash aggregator uses the wrong thing for activity date

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: mdoglio)

Details

Attachments

(1 file)

The crash aggregator is using the ping timestamp and calling that the "activity date": "activity_date is the date pings were generated on the client for a particular aggregate."

This is incorrect/misleading/bad. The activity date is the day when the activity being recorded actually took place. For main pings, this should be payload.info.subsessionStartDate. For crash pings, this should be payload/crashDate.

This is important because people who leave their browser running over midnight will record the activity for date N shortly after midnight on date N+1. This makes dividing rates very error prone and is leading to some of the misalignment of crash rates early in each build.

I clearly think we need to fix this. I'm not sure what to do about correcting/backfilling the old data, though.
Component: Metrics: Product Metrics → Metrics: Pipeline
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> I clearly think we need to fix this. I'm not sure what to do about
> correcting/backfilling the old data, though.

We might avoid backfilling if it can be shown that the lag, in number of days, between the real activity date and the date the pings were created is 0 for the vast majority of pings.
Points: --- → 2
I'm working on a patch to fix the aggregator. It should be ready soon.
P1 as Mauro is actively working on it
Priority: -- → P1
I've tested this and shown that we are misattributing 35+% of main-ping activity, but only 10% of crash ping activity: https://gist.github.com/bsmedberg/0f84d540eca797bfe3edde432006317c

Mauro, is the new (scala-based) aggregator still using this repo: https://github.com/mozilla/moz-crash-rate-aggregates at all, or is it only committed within https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/views/CrashAggregateView.scala ?

If the old repo is obsolete, please close it out and move the docs.
The old repo (moz-crash-rate-aggregates) is obsolete. I'll close it now. Afaik the documentation on telemetry-batch-view is up to date, are there references to the old repo that I need to update anywhere?
On a second thought, the only thing that we are still using form the old repo is the watchdog job. I opened bug 1275346 to move it to the new repo. Also, I don't have rights to modify the settings on the old repo. Roberto can you please add me to the admin list?
Attachment #8755968 - Flags: review?(rvitillo)
Status: NEW → ASSIGNED
Attachment #8755968 - Flags: review?(rvitillo) → review+
This is now on master. I tested the new code and it takes roughly half of the time thanks to another patch (bug 1275025) that landed today as well. Processing one day of data requires about 24 cpu hours on a c3.4xlarge instance (~0.84cent/h).
Benjamin do you want me to schedule a backfill? If so, how many days do you want it to cover?
Flags: needinfo?(benjamin)
in an irc conversation with :bsmedberg he proposed to only backfill pings from the beta channel for the last 60 days. This is a bit problematic because if we did that we would loose all the data from non-beta pings. Given the recent performance improvements to the telemetry-batch-view library I think the best thing to do is to backfill all the channels. That should be doable in 2.5 days on a 20 cpu cluster. I'll start scheduling a backfill of the last month of data.
Flags: needinfo?(benjamin)
I see crash rate on beta showing up different now, since around May 19th: https://www.dropbox.com/s/j0i2xua3xc742af/Screenshot%202016-06-20%2015.35.28.PNG?dl=0


The spike after release is now removed, and we quickly get a stable crash rate which is more stable with higher usage hours (higher denominator). Is that what we expected?
Flags: needinfo?(mdoglio)
Last week I updated the data with submission_date > May 19th so it's normal that you see a change starting on that day. I have now updated the data starting from May 5th so you should see that change happening 2 weeks earlier.
Given that the impact of this bug was to undercount usage hours right after release the change you notice makes sense to me.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(mdoglio)
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: