1382336 - Backfill focus-event to when we released Android Focus

Reporter

Description

•

7 years ago

      No description provided.

Frank Bertsch [:frank]

Reporter

Comment 1

•

7 years ago

whd, we have a new schema in https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/schemas/telemetry/focus-event. Can you update those in prod, and backfill the data to 2017-06-01?

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Reporter

Updated

•

7 years ago

Summary: Backfill focus-event to when we release Android Focus → Backfill focus-event to when we released Android Focus

Wesley Dawson [:whd]

Comment 2

•

7 years ago

I've updated production, so newer data should be making it through. Backfilling this data will be expensive in terms of compute cost and engineering time. This is because we only keep 30 days of decode errors and thus this requires backfilling from landfill, which we haven't done in over a year and is not automated. I am assuming that this data was dropped both from parquet and the main store judging from the change to both json and parquet schemas.

The assumption was that we would catch decode errors within 30 days of their introduction. Perhaps we should update the errors stream retention to reflect the fact that we won't necessarily notice schema issues like this for smaller data sets.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Reporter

Comment 3

•

7 years ago

(In reply to Wesley Dawson [:whd] from comment #2)
> The assumption was that we would catch decode errors within 30 days of their
> introduction. Perhaps we should update the errors stream retention to
> reflect the fact that we won't necessarily notice schema issues like this
> for smaller data sets.

I'm not sure that's necessary. The end result of this should be that we require an email and automatically alert on validation errors of some threshold. Focus-event pings were being discarded at a rate of ~70%, which would surly surpass an arbitrary threshold we decide on.

whd: We didn't release until 20170620. If we can backfill until that will be satisfactory. We may want to jump on that ASAP so we don't lose any more data.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 4

•

7 years ago

(In reply to Frank Bertsch [:frank] from comment #3)
> I'm not sure that's necessary. The end result of this should be that we
> require an email and automatically alert on validation errors of some
> threshold.

We should probably require adding a production monitor ala https://github.com/mozilla-services/puppet-config/blob/master/pipeline/modules/pipeline/templates/hindsight/analysis/moz_telemetry_doctype_monitor_main.cfg.erb for every new doc type.

> whd: We didn't release until 20170620. If we can backfill until that will be
> satisfactory. We may want to jump on that ASAP so we don't lose any more
> data.

I'm not going to have time to work on this ASAP unless it is critically urgent, so I've extended the retention for errors to 60 days for now. Since this data should be available in the errors stream it doesn't require special access to process, but again there is no automation for performing this kind of backfill.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Reporter

Comment 5

•

7 years ago

(In reply to Wesley Dawson [:whd] from comment #4)
> I've updated production, so newer data should be making it through.

Are you sure? Ingestion error rates haven't changed: https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.frank.frank_moz_focus_event_monitor.ingestion_error.html

> We should probably require adding a production monitor ala
> https://github.com/mozilla-services/puppet-config/blob/master/pipeline/
> modules/pipeline/templates/hindsight/analysis/
> moz_telemetry_doctype_monitor_main.cfg.erb for every new doc type.

I wholeheartedly agree. I was thinking we could put these alerts in "mozilla-pipeline-schemas", and maybe add a build check that any schema has an associated alert. Best case would be we automatically deploy those alerts, but I'm not sure how difficult that would be.

> I'm not going to have time to work on this ASAP unless it is critically
> urgent, so I've extended the retention for errors to 60 days for now. Since
> this data should be available in the errors stream it doesn't require
> special access to process, but again there is no automation for performing
> this kind of backfill.

I would say this is critically urgent. They have been using these numbers to decide on next steps for the Focus platform, which right now is completely skewed.

Frank Bertsch [:frank]

Reporter

Comment 6

•

7 years ago

Putting in a NI for the above comment.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Reporter

Updated

•

7 years ago

Depends on: 1382673

Wesley Dawson [:whd]

Comment 7

•

7 years ago

(In reply to Frank Bertsch [:frank] from comment #5)
> Are you sure? Ingestion error rates haven't changed:
> https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.frank.
> frank_moz_focus_event_monitor.ingestion_error.html

I was sure that I rebuilt production, but was not sure that the fix took, which is why I used the phrasing I did. As it happens we've got a process issue with the schema repo, which is that we allow updates to the templates that do not update the actual schemas. This is the case with https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/schemas/telemetry/focus-event. I've worked around this in the build logic by explicitly rebuilding the repo, but we should enforce that the actual schema tree be updated in PRs as per https://github.com/mozilla-services/mozilla-pipeline-schemas#notes.

I'll redeploy production again today.

> I would say this is critically urgent. They have been using these numbers to
> decide on next steps for the Focus platform, which right now is completely
> skewed.

I need :kparlante or :jason to buy in before I drop everything and work on this. Alternatively it might make sense to show someone else how to do this kind of thing so I'm not necessarily a bottleneck. Additionally it looks like you've added a blocker that requires a code change, which also needs to be prioritized.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Reporter

Comment 8

•

7 years ago

(In reply to Wesley Dawson [:whd] from comment #7)
> I was sure that I rebuilt production, but was not sure that the fix took,
> which is why I used the phrasing I did. As it happens we've got a process
> issue with the schema repo, which is that we allow updates to the templates
> that do not update the actual schemas. This is the case with
> https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/
> schemas/telemetry/focus-event. I've worked around this in the build logic by
> explicitly rebuilding the repo, but we should enforce that the actual schema
> tree be updated in PRs as per
> https://github.com/mozilla-services/mozilla-pipeline-schemas#notes.
> 
> I'll redeploy production again today.

Great, thank you!

> I need :kparlante or :jason to buy in before I drop everything and work on
> this. Alternatively it might make sense to show someone else how to do this
> kind of thing so I'm not necessarily a bottleneck. 

I was actually going to mention that. I'm more than willing to do this myself, do you just want to show me what you're doing when we do this backfill?

> Additionally it looks
> like you've added a blocker that requires a code change, which also needs to
> be prioritized.

Yes, that came up after I put in the other comment. The backfill will have to wait until we fix the null values.

Katie Parlante

Comment 9

•

7 years ago

Yes, these two bugs are urgent/high priority. (That is, "prioritize over other work" urgent, not "work through the weekend" urgent.) Working on getting someone assigned to the Parquet Writer bug.

Wesley Dawson [:whd]

Comment 10

•

7 years ago

Attached file hindsight.tar.gz — Details

Fortunately, backfilling from errors is much easier in the new infra than the old, since our error format is compatible with our landfill format and therefore also compatible with our standard decoding logic. With a couple of tweaks I threw together a hindsight config to reprocess the data to telemetry-backfill (attached), and then copied the resultant objects into the prod store. This logic could be used with a few tweaks for backfilling any schema issues we have in the future.

Included in the backfill is the focus-event-parquet output, even though presumably these are affected by bug #1382673. Once the parquet writer bug is fixed I can re-run the parquet output backfill, which will be a more traditional backfill since it will use the main store.

Wesley Dawson [:whd]

Comment 11

•

7 years ago

I'm attempting a backfill from the main store with "json_decode_null = true". If this is successful I will deploy this to production next week and fix up the intervening few days.

Wesley Dawson [:whd]

Comment 12

•

7 years ago

I backfilled from 20170618 to 20170803 with what I believe was proper configuration and updated code. :frank, can you verify that the output for those days looks good?

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Reporter

Comment 13

•

7 years ago

Looks great - now NULL is the highest reported search engine, as expected.

Flags: needinfo?(fbertsch)

Wesley Dawson [:whd]

Comment 14

•

7 years ago

This backfill is complete.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

2 years ago

Component: Datasets: Mobile → General

Bugzilla

Quick Search

Backfill focus-event to when we released Android Focus

Categories

(Data Platform and Tools :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: frank, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Attachment

General

Description

File Name

Content Type