Closed Bug 1382336 Opened 7 years ago Closed 7 years ago

Backfill focus-event to when we released Android Focus

Categories

(Data Platform and Tools :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Unassigned)

References

Details

Attachments

(1 file)

      No description provided.
whd, we have a new schema in https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/schemas/telemetry/focus-event. Can you update those in prod, and backfill the data to 2017-06-01?
Flags: needinfo?(whd)
Summary: Backfill focus-event to when we release Android Focus → Backfill focus-event to when we released Android Focus
I've updated production, so newer data should be making it through. Backfilling this data will be expensive in terms of compute cost and engineering time. This is because we only keep 30 days of decode errors and thus this requires backfilling from landfill, which we haven't done in over a year and is not automated. I am assuming that this data was dropped both from parquet and the main store judging from the change to both json and parquet schemas.

The assumption was that we would catch decode errors within 30 days of their introduction. Perhaps we should update the errors stream retention to reflect the fact that we won't necessarily notice schema issues like this for smaller data sets.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #2)
> The assumption was that we would catch decode errors within 30 days of their
> introduction. Perhaps we should update the errors stream retention to
> reflect the fact that we won't necessarily notice schema issues like this
> for smaller data sets.

I'm not sure that's necessary. The end result of this should be that we require an email and automatically alert on validation errors of some threshold. Focus-event pings were being discarded at a rate of ~70%, which would surly surpass an arbitrary threshold we decide on.

whd: We didn't release until 20170620. If we can backfill until that will be satisfactory. We may want to jump on that ASAP so we don't lose any more data.
Flags: needinfo?(whd)
(In reply to Frank Bertsch [:frank] from comment #3)
> I'm not sure that's necessary. The end result of this should be that we
> require an email and automatically alert on validation errors of some
> threshold.

We should probably require adding a production monitor ala https://github.com/mozilla-services/puppet-config/blob/master/pipeline/modules/pipeline/templates/hindsight/analysis/moz_telemetry_doctype_monitor_main.cfg.erb for every new doc type.

> whd: We didn't release until 20170620. If we can backfill until that will be
> satisfactory. We may want to jump on that ASAP so we don't lose any more
> data.

I'm not going to have time to work on this ASAP unless it is critically urgent, so I've extended the retention for errors to 60 days for now. Since this data should be available in the errors stream it doesn't require special access to process, but again there is no automation for performing this kind of backfill.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #4)
> I've updated production, so newer data should be making it through.

Are you sure? Ingestion error rates haven't changed: https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.frank.frank_moz_focus_event_monitor.ingestion_error.html

> We should probably require adding a production monitor ala
> https://github.com/mozilla-services/puppet-config/blob/master/pipeline/
> modules/pipeline/templates/hindsight/analysis/
> moz_telemetry_doctype_monitor_main.cfg.erb for every new doc type.

I wholeheartedly agree. I was thinking we could put these alerts in "mozilla-pipeline-schemas", and maybe add a build check that any schema has an associated alert. Best case would be we automatically deploy those alerts, but I'm not sure how difficult that would be.

> I'm not going to have time to work on this ASAP unless it is critically
> urgent, so I've extended the retention for errors to 60 days for now. Since
> this data should be available in the errors stream it doesn't require
> special access to process, but again there is no automation for performing
> this kind of backfill.

I would say this is critically urgent. They have been using these numbers to decide on next steps for the Focus platform, which right now is completely skewed.
Putting in a NI for the above comment.
Flags: needinfo?(whd)
Depends on: 1382673
(In reply to Frank Bertsch [:frank] from comment #5)
> Are you sure? Ingestion error rates haven't changed:
> https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.frank.
> frank_moz_focus_event_monitor.ingestion_error.html

I was sure that I rebuilt production, but was not sure that the fix took, which is why I used the phrasing I did. As it happens we've got a process issue with the schema repo, which is that we allow updates to the templates that do not update the actual schemas. This is the case with https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/schemas/telemetry/focus-event. I've worked around this in the build logic by explicitly rebuilding the repo, but we should enforce that the actual schema tree be updated in PRs as per https://github.com/mozilla-services/mozilla-pipeline-schemas#notes.

I'll redeploy production again today.

> I would say this is critically urgent. They have been using these numbers to
> decide on next steps for the Focus platform, which right now is completely
> skewed.

I need :kparlante or :jason to buy in before I drop everything and work on this. Alternatively it might make sense to show someone else how to do this kind of thing so I'm not necessarily a bottleneck. Additionally it looks like you've added a blocker that requires a code change, which also needs to be prioritized.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #7)
> I was sure that I rebuilt production, but was not sure that the fix took,
> which is why I used the phrasing I did. As it happens we've got a process
> issue with the schema repo, which is that we allow updates to the templates
> that do not update the actual schemas. This is the case with
> https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/
> schemas/telemetry/focus-event. I've worked around this in the build logic by
> explicitly rebuilding the repo, but we should enforce that the actual schema
> tree be updated in PRs as per
> https://github.com/mozilla-services/mozilla-pipeline-schemas#notes.
> 
> I'll redeploy production again today.

Great, thank you!

> I need :kparlante or :jason to buy in before I drop everything and work on
> this. Alternatively it might make sense to show someone else how to do this
> kind of thing so I'm not necessarily a bottleneck. 

I was actually going to mention that. I'm more than willing to do this myself, do you just want to show me what you're doing when we do this backfill?

> Additionally it looks
> like you've added a blocker that requires a code change, which also needs to
> be prioritized.

Yes, that came up after I put in the other comment. The backfill will have to wait until we fix the null values.
Yes, these two bugs are urgent/high priority. (That is, "prioritize over other work" urgent, not "work through the weekend" urgent.) Working on getting someone assigned to the Parquet Writer bug.
Attached file hindsight.tar.gz
Fortunately, backfilling from errors is much easier in the new infra than the old, since our error format is compatible with our landfill format and therefore also compatible with our standard decoding logic. With a couple of tweaks I threw together a hindsight config to reprocess the data to telemetry-backfill (attached), and then copied the resultant objects into the prod store. This logic could be used with a few tweaks for backfilling any schema issues we have in the future.

Included in the backfill is the focus-event-parquet output, even though presumably these are affected by bug #1382673. Once the parquet writer bug is fixed I can re-run the parquet output backfill, which will be a more traditional backfill since it will use the main store.
I'm attempting a backfill from the main store with "json_decode_null = true". If this is successful I will deploy this to production next week and fix up the intervening few days.
I backfilled from 20170618 to 20170803 with what I believe was proper configuration and updated code. :frank, can you verify that the output for those days looks good?
Flags: needinfo?(fbertsch)
Looks great - now NULL is the highest reported search engine, as expected.
Flags: needinfo?(fbertsch)
This backfill is complete.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Mobile → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: