Crash aggregator uses the wrong thing for activity date

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: Benjamin Smedberg, Assigned: mdoglio)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
The crash aggregator is using the ping timestamp and calling that the "activity date": "activity_date is the date pings were generated on the client for a particular aggregate."

This is incorrect/misleading/bad. The activity date is the day when the activity being recorded actually took place. For main pings, this should be payload.info.subsessionStartDate. For crash pings, this should be payload/crashDate.

This is important because people who leave their browser running over midnight will record the activity for date N shortly after midnight on date N+1. This makes dividing rates very error prone and is leading to some of the misalignment of crash rates early in each build.

I clearly think we need to fix this. I'm not sure what to do about correcting/backfilling the old data, though.

Updated

2 years ago
Component: Metrics: Product Metrics → Metrics: Pipeline
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> I clearly think we need to fix this. I'm not sure what to do about
> correcting/backfilling the old data, though.

We might avoid backfilling if it can be shown that the lag, in number of days, between the real activity date and the date the pings were created is 0 for the vast majority of pings.
Points: --- → 2
(Assignee)

Comment 2

2 years ago
I'm working on a patch to fix the aggregator. It should be ready soon.

Comment 3

2 years ago
P1 as Mauro is actively working on it
Priority: -- → P1
(Reporter)

Comment 4

2 years ago
I've tested this and shown that we are misattributing 35+% of main-ping activity, but only 10% of crash ping activity: https://gist.github.com/bsmedberg/0f84d540eca797bfe3edde432006317c

Mauro, is the new (scala-based) aggregator still using this repo: https://github.com/mozilla/moz-crash-rate-aggregates at all, or is it only committed within https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/views/CrashAggregateView.scala ?

If the old repo is obsolete, please close it out and move the docs.
(Assignee)

Comment 5

2 years ago
The old repo (moz-crash-rate-aggregates) is obsolete. I'll close it now. Afaik the documentation on telemetry-batch-view is up to date, are there references to the old repo that I need to update anywhere?
(Assignee)

Comment 6

2 years ago
On a second thought, the only thing that we are still using form the old repo is the watchdog job. I opened bug 1275346 to move it to the new repo. Also, I don't have rights to modify the settings on the old repo. Roberto can you please add me to the admin list?
(Assignee)

Comment 7

2 years ago
Created attachment 8755968 [details] [review]
PR 69 on telemetry-btach-view
Attachment #8755968 - Flags: review?(rvitillo)
(Assignee)

Updated

2 years ago
Status: NEW → ASSIGNED
Attachment #8755968 - Flags: review?(rvitillo) → review+
(Assignee)

Comment 8

2 years ago
This is now on master. I tested the new code and it takes roughly half of the time thanks to another patch (bug 1275025) that landed today as well. Processing one day of data requires about 24 cpu hours on a c3.4xlarge instance (~0.84cent/h).
Benjamin do you want me to schedule a backfill? If so, how many days do you want it to cover?
Flags: needinfo?(benjamin)
(Assignee)

Comment 9

2 years ago
in an irc conversation with :bsmedberg he proposed to only backfill pings from the beta channel for the last 60 days. This is a bit problematic because if we did that we would loose all the data from non-beta pings. Given the recent performance improvements to the telemetry-batch-view library I think the best thing to do is to backfill all the channels. That should be doable in 2.5 days on a 20 cpu cluster. I'll start scheduling a backfill of the last month of data.
Flags: needinfo?(benjamin)
I see crash rate on beta showing up different now, since around May 19th: https://www.dropbox.com/s/j0i2xua3xc742af/Screenshot%202016-06-20%2015.35.28.PNG?dl=0


The spike after release is now removed, and we quickly get a stable crash rate which is more stable with higher usage hours (higher denominator). Is that what we expected?
Flags: needinfo?(mdoglio)
(Assignee)

Comment 11

a year ago
Last week I updated the data with submission_date > May 19th so it's normal that you see a change starting on that day. I have now updated the data starting from May 5th so you should see that change happening 2 weeks earlier.
Given that the impact of this bug was to undercount usage hours right after release the change you notice makes sense to me.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Flags: needinfo?(mdoglio)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.