Pocket telemetry tile_id type change from int to string
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: thecount, Assigned: klukas)
References
Details
Attachments
(2 files)
Pocket uses the tile_id to track engagement metrics to figure out what Firefox users what to read on newtab,
Right now those metrics are incomplete because the tile_id doesn't map to all the information we need to figure out which stories people are reading.
In the future we'll send a tile_id that contains the info we need to track engagement for each item, that id should be a string. So in the database for telemetry we'll need to store it as a string.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
Do you know specifically which bits of telemetry would be affected by this change?
I suspect it may just be the following:
If that's the case, then we have a few options. One option would be to make the existing id
field optional and introduce an alternate ID field with a different name (perhaps full_id
) that is of type string; we could update Firefox to send the new field rather than the old.
Another option would be to bump the version of the impression-stats
schema, which would allow us to make the type change without changing the name of the field. That would appear as a new impression-stats.2.schema.json
definition in the above repository and would flow to a separate table in BigQuery.
Assignee | ||
Comment 2•3 years ago
|
||
cc Kirill who has been involved with transformation of impression-stats
data in the past
also ?ni :nanj - To your knowledge, does the tile ID show up anywhere other than impression-stats? I'm assuming that the tile_id
that appears in contextual-services pings isn't relevant here since it refers to topsites tiles rather than pocket tiles.
Comment 3•3 years ago
|
||
does the tile ID show up anywhere other than impression-stats? I'm assuming that the tile_id that appears in contextual-services pings isn't relevant here since it refers to topsites tiles rather than pocket tiles.
I think activity_stream.tile_id_types
also has a tile_id
field. Other than that, no more pings do. Also, Pocket tiles are independent with Contextual Services tiles, any change here won't affect the other.
Comment 4•3 years ago
|
||
Assignee | ||
Comment 5•3 years ago
|
||
I've drafted a PR for what it would look like to transition to a v2 schema: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730
Assuming we've correctly scoped the problem, I think the v2 schema is the best option. If we move forward with this, we'd need to plan for a future where we have some clients sending v1 telemetry and some sending v2, meaning there will be a few follow-ups to be handled:
- Pocket will need to provide a new variant of
activity_stream.tile_id_types
that accounts for the new format - We'll need to update some of the derived table definitions in https://github.com/mozilla/bigquery-etl to union the v1 and v2 tables (I assume we'd simply be casting the old integer-style IDs to strings, but more discovery may be needed there)
- Pocket may need to adjust streaming consumers; the telemetry infrastructure provides a derived Pub/Sub topic containing all the impression-stats pings that Pocket consumes. The v2 telemetry will automatically flow into this topic, but Pocket may need to update processing logic to account for the new ID format in v2 documents
There may be other use cases I'm not aware of that could require adjusting as well, so the above is not necessarily representing full scope of work here.
Comment 6•3 years ago
|
||
Thanks for the ping, Jeff! Yes, to my knowledge this should only affect the impression-stats
ping.
I think your last message about what we need to take care of in moving to a v2 schema is comprehensive. We could provide a new variant of the activity_stream.tile_id_types
table.
We have some hourly Airflow jobs that query the moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1
table. They aren't streaming consumers, but just frequent batch jobs. Would there be a derived table combining v1 and v2 pings that would update hourly?
Assignee | ||
Comment 7•3 years ago
|
||
We have some hourly Airflow jobs that query the moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1 table. They aren't streaming consumers, but just frequent batch jobs. Would there be a derived table combining v1 and v2 pings that would update hourly?
No, there wouldn't be. But you could update your logic to union the two tables, or we could provision a virtual view within mozdata
to provide that interface.
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
|
||
It sounds like we're aligned, then, on moving forward with impression-stats
v2. If so, I'll go ahead and get review on https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730 and get that merged. That will automatically trigger creation of moz-fx-data-shared-prod.activity_stream_live.impression_stats_v2
and moz-fx-data-shared-prod.activity_stream_stable.impression_stats_v2
and it will allow the pipeline to start accepting documents for that schema.
Before that, we'll actually need to merge https://github.com/mozilla/bigquery-etl/pull/2843 to ensure stability of the current user-facing view.
That will allow implementation of the client changes to happen whenever you're ready. I'll file a separate bug for further ETL changes to adapt to the new schema.
And then we should make sure we understand who's in charge of coordination for this project moving forward so that we get proper sequencing client-side changes, adding the tile_id_types
table, and ETL improvements such that we don't break use cases.
Assignee | ||
Comment 10•3 years ago
|
||
https://github.com/mozilla/bigquery-etl/pull/2843 is merged and I'm now getting review on https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730 so the new schema will be available when Pocket folks are ready to make the ID change.
Kirill - Are you in a good position to be in charge of coordination as this gets implemented?
Comment 11•3 years ago
|
||
I am probably the best person! When would these changes land in release?
Comment 12•3 years ago
•
|
||
Assignee | ||
Comment 13•3 years ago
|
||
The new v2 schema is now merged, so will be deployed within the next 24 hours, meaning the pipeline will be able to validate documents sent with the v2 schema and also destination tables will be created in BigQuery.
Comment 14•3 years ago
|
||
Heads up that Chelsea, the data engineer at Pocket that originally needed this change, has left Pocket and this project has since been de-prioritized on the Pocket side. Jeshua Irving would be able to provide more details, but for now I don't think we need to do any further work.
Thank you for merging the schema changes, Jeff! They will be useful in the future when this work gets picked up again!
Assignee | ||
Comment 15•3 years ago
|
||
I'll got ahead and close this bug for now, and we can reopen or file a follow-up in the future if this gets picked up.
Description
•