Closed
Bug 1275146
Opened 8 years ago
Closed 8 years ago
Crash aggregator uses the wrong thing for activity date
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: benjamin, Assigned: mdoglio)
Details
Attachments
(1 file)
The crash aggregator is using the ping timestamp and calling that the "activity date": "activity_date is the date pings were generated on the client for a particular aggregate." This is incorrect/misleading/bad. The activity date is the day when the activity being recorded actually took place. For main pings, this should be payload.info.subsessionStartDate. For crash pings, this should be payload/crashDate. This is important because people who leave their browser running over midnight will record the activity for date N shortly after midnight on date N+1. This makes dividing rates very error prone and is leading to some of the misalignment of crash rates early in each build. I clearly think we need to fix this. I'm not sure what to do about correcting/backfilling the old data, though.
Updated•8 years ago
|
Component: Metrics: Product Metrics → Metrics: Pipeline
Comment 1•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #0) > I clearly think we need to fix this. I'm not sure what to do about > correcting/backfilling the old data, though. We might avoid backfilling if it can be shown that the lag, in number of days, between the real activity date and the date the pings were created is 0 for the vast majority of pings.
Updated•8 years ago
|
Points: --- → 2
Assignee | ||
Comment 2•8 years ago
|
||
I'm working on a patch to fix the aggregator. It should be ready soon.
Reporter | ||
Comment 4•8 years ago
|
||
I've tested this and shown that we are misattributing 35+% of main-ping activity, but only 10% of crash ping activity: https://gist.github.com/bsmedberg/0f84d540eca797bfe3edde432006317c Mauro, is the new (scala-based) aggregator still using this repo: https://github.com/mozilla/moz-crash-rate-aggregates at all, or is it only committed within https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/views/CrashAggregateView.scala ? If the old repo is obsolete, please close it out and move the docs.
Assignee | ||
Comment 5•8 years ago
|
||
The old repo (moz-crash-rate-aggregates) is obsolete. I'll close it now. Afaik the documentation on telemetry-batch-view is up to date, are there references to the old repo that I need to update anywhere?
Assignee | ||
Comment 6•8 years ago
|
||
On a second thought, the only thing that we are still using form the old repo is the watchdog job. I opened bug 1275346 to move it to the new repo. Also, I don't have rights to modify the settings on the old repo. Roberto can you please add me to the admin list?
Assignee | ||
Comment 7•8 years ago
|
||
Attachment #8755968 -
Flags: review?(rvitillo)
Assignee | ||
Updated•8 years ago
|
Status: NEW → ASSIGNED
Updated•8 years ago
|
Attachment #8755968 -
Flags: review?(rvitillo) → review+
Assignee | ||
Comment 8•8 years ago
|
||
This is now on master. I tested the new code and it takes roughly half of the time thanks to another patch (bug 1275025) that landed today as well. Processing one day of data requires about 24 cpu hours on a c3.4xlarge instance (~0.84cent/h). Benjamin do you want me to schedule a backfill? If so, how many days do you want it to cover?
Flags: needinfo?(benjamin)
Assignee | ||
Comment 9•8 years ago
|
||
in an irc conversation with :bsmedberg he proposed to only backfill pings from the beta channel for the last 60 days. This is a bit problematic because if we did that we would loose all the data from non-beta pings. Given the recent performance improvements to the telemetry-batch-view library I think the best thing to do is to backfill all the channels. That should be doable in 2.5 days on a 20 cpu cluster. I'll start scheduling a backfill of the last month of data.
Flags: needinfo?(benjamin)
Comment 10•8 years ago
|
||
I see crash rate on beta showing up different now, since around May 19th: https://www.dropbox.com/s/j0i2xua3xc742af/Screenshot%202016-06-20%2015.35.28.PNG?dl=0 The spike after release is now removed, and we quickly get a stable crash rate which is more stable with higher usage hours (higher denominator). Is that what we expected?
Flags: needinfo?(mdoglio)
Assignee | ||
Comment 11•8 years ago
|
||
Last week I updated the data with submission_date > May 19th so it's normal that you see a change starting on that day. I have now updated the data starting from May 5th so you should see that change happening 2 weeks earlier. Given that the impact of this bug was to undercount usage hours right after release the change you notice makes sense to me.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(mdoglio)
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•