Closed Bug 1275146 Opened 8 years ago Closed 8 years ago

Crash aggregator uses the wrong thing for activity date

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: mdoglio)

Details

Attachments

(1 file)

PR 69 on telemetry-btach-view 8 years ago Mauro Doglio [:mdoglio] 55 bytes, text/x-github-pull-request	rvitillo : review+	Details \| Review

Benjamin Smedberg

Reporter

Description

•

8 years ago

The crash aggregator is using the ping timestamp and calling that the "activity date": "activity_date is the date pings were generated on the client for a particular aggregate."

This is incorrect/misleading/bad. The activity date is the day when the activity being recorded actually took place. For main pings, this should be payload.info.subsessionStartDate. For crash pings, this should be payload/crashDate.

This is important because people who leave their browser running over midnight will record the activity for date N shortly after midnight on date N+1. This makes dividing rates very error prone and is leading to some of the misalignment of crash rates early in each build.

I clearly think we need to fix this. I'm not sure what to do about correcting/backfilling the old data, though.

Katie Parlante

Updated

•

8 years ago

Component: Metrics: Product Metrics → Metrics: Pipeline

Roberto Agostino Vitillo (:rvitillo)

Comment 1

•

8 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> I clearly think we need to fix this. I'm not sure what to do about
> correcting/backfilling the old data, though.

We might avoid backfilling if it can be shown that the lag, in number of days, between the real activity date and the date the pings were created is 0 for the vast majority of pings.

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Points: --- → 2

Mauro Doglio [:mdoglio]

Assignee

Comment 2

•

8 years ago

I'm working on a patch to fix the aggregator. It should be ready soon.

Thomas Huelbert

Comment 3

•

8 years ago

P1 as Mauro is actively working on it

Priority: -- → P1

Benjamin Smedberg

Reporter

Comment 4

•

8 years ago

I've tested this and shown that we are misattributing 35+% of main-ping activity, but only 10% of crash ping activity: https://gist.github.com/bsmedberg/0f84d540eca797bfe3edde432006317c

Mauro, is the new (scala-based) aggregator still using this repo: https://github.com/mozilla/moz-crash-rate-aggregates at all, or is it only committed within https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/views/CrashAggregateView.scala ?

If the old repo is obsolete, please close it out and move the docs.

Mauro Doglio [:mdoglio]

Assignee

Comment 5

•

8 years ago

The old repo (moz-crash-rate-aggregates) is obsolete. I'll close it now. Afaik the documentation on telemetry-batch-view is up to date, are there references to the old repo that I need to update anywhere?

Mauro Doglio [:mdoglio]

Assignee

Comment 6

•

8 years ago

On a second thought, the only thing that we are still using form the old repo is the watchdog job. I opened bug 1275346 to move it to the new repo. Also, I don't have rights to modify the settings on the old repo. Roberto can you please add me to the admin list?

Mauro Doglio [:mdoglio]

Assignee

Comment 7

•

8 years ago

Attached file PR 69 on telemetry-btach-view — Details

Attachment #8755968 - Flags: review?(rvitillo)

Mauro Doglio [:mdoglio]

Assignee

Updated

•

8 years ago

Status: NEW → ASSIGNED

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Attachment #8755968 - Flags: review?(rvitillo) → review+

Mauro Doglio [:mdoglio]

Assignee

Comment 8

•

8 years ago

This is now on master. I tested the new code and it takes roughly half of the time thanks to another patch (bug 1275025) that landed today as well. Processing one day of data requires about 24 cpu hours on a c3.4xlarge instance (~0.84cent/h).
Benjamin do you want me to schedule a backfill? If so, how many days do you want it to cover?

Flags: needinfo?(benjamin)

Mauro Doglio [:mdoglio]

Assignee

Comment 9

•

8 years ago

in an irc conversation with :bsmedberg he proposed to only backfill pings from the beta channel for the last 60 days. This is a bit problematic because if we did that we would loose all the data from non-beta pings. Given the recent performance improvements to the telemetry-batch-view library I think the best thing to do is to backfill all the channels. That should be doable in 2.5 days on a 20 cpu cluster. I'll start scheduling a backfill of the last month of data.

Flags: needinfo?(benjamin)

:Harald Kirschner :digitarald

Comment 10

•

8 years ago

I see crash rate on beta showing up different now, since around May 19th: https://www.dropbox.com/s/j0i2xua3xc742af/Screenshot%202016-06-20%2015.35.28.PNG?dl=0


The spike after release is now removed, and we quickly get a stable crash rate which is more stable with higher usage hours (higher denominator). Is that what we expected?

Flags: needinfo?(mdoglio)

Mauro Doglio [:mdoglio]

Assignee

Comment 11

•

8 years ago

Last week I updated the data with submission_date > May 19th so it's normal that you see a change starting on that day. I have now updated the data starting from May 5th so you should see that change happening 2 weeks earlier.
Given that the impact of this bug was to undercount usage hours right after release the change you notice makes sense to me.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Flags: needinfo?(mdoglio)

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Crash aggregator uses the wrong thing for activity date

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: benjamin, Assigned: mdoglio)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Attachment

General

Description

File Name

Content Type