Closed Bug 1761790 Opened 3 years ago Closed 3 years ago

Pocket telemetry tile_id type change from int to string

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: thecount, Assigned: klukas)

References

Details

Attachments

(2 files)

Pocket uses the tile_id to track engagement metrics to figure out what Firefox users what to read on newtab,

Right now those metrics are incomplete because the tile_id doesn't map to all the information we need to figure out which stories people are reading.

In the future we'll send a tile_id that contains the info we need to track engagement for each item, that id should be a string. So in the database for telemetry we'll need to store it as a string.

Assignee: nobody → jklukas
Priority: -- → P2

Do you know specifically which bits of telemetry would be affected by this change?

I suspect it may just be the following:

If that's the case, then we have a few options. One option would be to make the existing id field optional and introduce an alternate ID field with a different name (perhaps full_id) that is of type string; we could update Firefox to send the new field rather than the old.

Another option would be to bump the version of the impression-stats schema, which would allow us to make the type change without changing the name of the field. That would appear as a new impression-stats.2.schema.json definition in the above repository and would flow to a separate table in BigQuery.

See Also: → 1743493

cc Kirill who has been involved with transformation of impression-stats data in the past

also ?ni :nanj - To your knowledge, does the tile ID show up anywhere other than impression-stats? I'm assuming that the tile_id that appears in contextual-services pings isn't relevant here since it refers to topsites tiles rather than pocket tiles.

Flags: needinfo?(najiang)

does the tile ID show up anywhere other than impression-stats? I'm assuming that the tile_id that appears in contextual-services pings isn't relevant here since it refers to topsites tiles rather than pocket tiles.

I think activity_stream.tile_id_types also has a tile_id field. Other than that, no more pings do. Also, Pocket tiles are independent with Contextual Services tiles, any change here won't affect the other.

Flags: needinfo?(najiang)

I've drafted a PR for what it would look like to transition to a v2 schema: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730

Assuming we've correctly scoped the problem, I think the v2 schema is the best option. If we move forward with this, we'd need to plan for a future where we have some clients sending v1 telemetry and some sending v2, meaning there will be a few follow-ups to be handled:

  • Pocket will need to provide a new variant of activity_stream.tile_id_types that accounts for the new format
  • We'll need to update some of the derived table definitions in https://github.com/mozilla/bigquery-etl to union the v1 and v2 tables (I assume we'd simply be casting the old integer-style IDs to strings, but more discovery may be needed there)
  • Pocket may need to adjust streaming consumers; the telemetry infrastructure provides a derived Pub/Sub topic containing all the impression-stats pings that Pocket consumes. The v2 telemetry will automatically flow into this topic, but Pocket may need to update processing logic to account for the new ID format in v2 documents

There may be other use cases I'm not aware of that could require adjusting as well, so the above is not necessarily representing full scope of work here.

Thanks for the ping, Jeff! Yes, to my knowledge this should only affect the impression-stats ping.

I think your last message about what we need to take care of in moving to a v2 schema is comprehensive. We could provide a new variant of the activity_stream.tile_id_types table.

We have some hourly Airflow jobs that query the moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1 table. They aren't streaming consumers, but just frequent batch jobs. Would there be a derived table combining v1 and v2 pings that would update hourly?

We have some hourly Airflow jobs that query the moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1 table. They aren't streaming consumers, but just frequent batch jobs. Would there be a derived table combining v1 and v2 pings that would update hourly?

No, there wouldn't be. But you could update your logic to union the two tables, or we could provision a virtual view within mozdata to provide that interface.

It sounds like we're aligned, then, on moving forward with impression-stats v2. If so, I'll go ahead and get review on https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730 and get that merged. That will automatically trigger creation of moz-fx-data-shared-prod.activity_stream_live.impression_stats_v2 and moz-fx-data-shared-prod.activity_stream_stable.impression_stats_v2 and it will allow the pipeline to start accepting documents for that schema.

Before that, we'll actually need to merge https://github.com/mozilla/bigquery-etl/pull/2843 to ensure stability of the current user-facing view.

That will allow implementation of the client changes to happen whenever you're ready. I'll file a separate bug for further ETL changes to adapt to the new schema.

And then we should make sure we understand who's in charge of coordination for this project moving forward so that we get proper sequencing client-side changes, adding the tile_id_types table, and ETL improvements such that we don't break use cases.

https://github.com/mozilla/bigquery-etl/pull/2843 is merged and I'm now getting review on https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/730 so the new schema will be available when Pocket folks are ready to make the ID change.

Kirill - Are you in a good position to be in charge of coordination as this gets implemented?

I am probably the best person! When would these changes land in release?

:thecount - FYI, now that the impression-stats schema version has bumped to "2". We also need to update that in Firefox at here, here, and here.

The new v2 schema is now merged, so will be deployed within the next 24 hours, meaning the pipeline will be able to validate documents sent with the v2 schema and also destination tables will be created in BigQuery.

Heads up that Chelsea, the data engineer at Pocket that originally needed this change, has left Pocket and this project has since been de-prioritized on the Pocket side. Jeshua Irving would be able to provide more details, but for now I don't think we need to do any further work.

Thank you for merging the schema changes, Jeff! They will be useful in the future when this work gets picked up again!

I'll got ahead and close this bug for now, and we can reopen or file a follow-up in the future if this gets picked up.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: