Closed Bug 1854245 Opened 1 year ago Closed 1 year ago

[Meta] Add content identifiers to Pocket Glean pings

Categories

(Firefox :: New Tab Page, task, P1)

task

Tracking

()

RESOLVED FIXED
121 Branch
Tracking Status
firefox120 --- fixed
firefox121 --- fixed

People

(Reporter: mmiermans, Assigned: chutten)

References

Details

(Keywords: meta)

Attachments

(4 files)

Given that all pingcentre telemetry (including Activity Stream) will be removed by EoY, we will need to rely on Glean pings instead.

My understanding is that Firefox Desktop already emits the following Glean pings:

Currently these pings do not include a content identifier.

  1. Add new recommendation id field to Glean, and stop using the deprecated tileId field. (Changed based on feedback in #63. Could you add both the tileId and the (not-yet-existing) recommendId fields from the Pocket recommendation to each of the corresponding Glean pings?)
  2. Stop Pocket New Tab Activity Stream events from being emitted, to prevent double-counting engagement. Because the newtab Glean ping has been around for a while (as chutten mentioned), the new/additional Prefect data pulls (using Glean data) will need to filter to Firefox >= 121.

I tried to fill in details as best as I could. Pocket Firefox Integration will probably split this up into multiple bugs.

Summary: Add content identifiers to Pocket Glean pings → [Meta] Add content identifiers to Pocket Glean pings
Keywords: meta

Is this information present in Firefox Desktop? Is it, for instance, the tile's id (as communicated here and received here)?

Assignee: sdowne → chutten
Severity: -- → N/A
Status: NEW → ASSIGNED
Flags: needinfo?(mmiermans)
Priority: -- → P1
Blocks: 1857324

Is this information present in Firefox Desktop? Is it, for instance, the tile's id (as communicated here and received here)?

Yes, it's present, at least in the sense that Firefox receives it from our API. With this bug, we will stop using tileId and start using the id attribute included in the same API response. Here's an example:

{
   "data":[
      {
         "__typename":"Recommendation",
         "id":"887db13d-5036-4327-8630-dfc4288fd8fd",
         "tileId":1661483798510704,
         "url":"https://www.courrierinternational.com/article/rugby-no-dupont-no-problem-pour-la-presse-etrangere-il-faudra-quelque-chose-de-monumental-pour-arreter-les-bleus?utm_source=pocket-newtab-fr-fr",
         "title":"Rugby. “No Dupont, no problem” : pour la presse étrangère, “il faudra quelque chose de monumental pour arrêter les Bleus”",
         "excerpt":"Il n’y a finalement pas eu de surprise. La France a battu l’Italie “à plate couture” vendredi 6 octobre et s’est qualifiée pour les quarts de finale de la Coupe du monde de rugby. Pour la presse internationale, en passant huit essais aux Transalpins, les Bleus ont surtout montré “qu’ils seraient dangereux, avec ou sans le retour de son capitaine”.",
         "publisher":"Courrier international",
         "imageUrl":"https://s3.us-east-1.amazonaws.com/pocket-curatedcorpusapi-prod-images/6d2e4e41-9f43-45a8-81c1-be204def566d.jpeg",
         "timeToRead":3
      }
   ]
}
Flags: needinfo?(mmiermans)

Mathijs, what datasets is the recommendation_id identifier permitted to join to? Its inclusion in the "newtab" ping gives it access to client_id.

Also: I don't know the answers to the data review request questions for this collection. Please take this attached template and fill in the necessary questions.

Flags: needinfo?(mmiermans)

I'm proposing to rename 'id' to 'recommendationId' in the API. We independently came to the same conclusion as Scott that 'id' is too generic:

It is my understanding that we can at some point migrate away from id: item.tileId,, and just use recommendation_id: item.id,? Is that correct? Otherwise, I think it's kinda confusing to have all these different ids with names that are not super descriptive.

I'll sync with Chutten before merging this, to ensure there's no unintended impact to Nightly.

Flags: needinfo?(mmiermans)

Mathijs, what datasets is the recommendation_id identifier permitted to join to? Its inclusion in the "newtab" ping gives it access to client_id.

Does this answer from the review request answer this sufficiently, or should I add some more detail?

  1. Please provide a general description of how you will analyze this data.

We will join the above Firefox events with metadata for the content and server response.

As stated above, large groups of users receive the same recommendation_id, and therefore it cannot be associated with any personal information.

Examples of content metadata are: the curator's name, topic, curator tags.

The main server metadata is the response time.

These backend events are emitted using Snowplow. Eventually they will be migrated to Glean as well, but we don't have a concrete timeline for that yet.

Please review the above data review request.

Flags: needinfo?(klong)

Comment on attachment 9358646 [details]
data collection review request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?

This collection is Glean so is / will be documented in the Glean Dictionary.

Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism) Provide details as to the control mechanism available.

Yes. These collections are Glean. The opt-out can be found in the product's preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, chutten@mozilla.com, najiang@mozilla.com, mmccorquodale@mozilla.com, lina@mozilla.com, anicholson@mozilla.com will be responsible for the permanent collections.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

This bug falls under Category 2, "interaction data". It aligns with the Cat 2 example from the data collection category doc: "For example, selections of add-ons or tiles to determine potential interest categories etc."

Is the data collection request for default-on or default-off?

Default on for all channels, locales, and countries that are eligible for New Tab recommendations:
https://mozilla-hub.atlassian.net/wiki/spaces/FPS/pages/80448805/Regional+Differences

Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

Yes, it includes a new recommendation_id. This replaces the previously used tile_id, which may act as precedent.

Is the data collection covered by the existing Firefox privacy notice?

Unknown, because of the introduction of the new recommendation_id. Escalating to Legal.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

Does the data collection use a third-party collection tool?

No.

Attachment #9358646 - Flags: data-review-

@mathijs - Because of the introduction of the new recommendation_id, this needs to be escalated to legal. You can follow this process to do that:

https://wiki.mozilla.org/Data_Collection#Step_3:_Sensitive_Data_Collection_Review_Process

Flags: needinfo?(klong)

Sent an email to data-review yesterday:

We would like to change the identifier for New Tab recommendations from one that only identifies the content ('tileId') to 'recommendation_id', which is unique both for the content, and changes each time recommendations are re-ranked.

We do not get any additional information about individual users by making this change, because New Tab recommendations are delivered through a cached CDN. Each cached response is sent to 3,000 users on average. The goal is to have better observability and reliability, by being able to trace click and impression events back to the server-side event that generated the recommendations.

Forgot to mention: we had a chat about this and we're not aiming to remove the old pocket AS events/pings until data validation's been done on the base set of Activity Stream reinstrumentation. They've already been instrumenting side-by-side since the New Tab Telemetry Project landed the first batch of instrumentation in Fx106 via bug 1786612.

The removal of the PingCentre-sent pocket AS instrumentation will happen as part of the broader removal of all PingCentre-sent AS instrumentation towards the end of the PingCentre Replacement Project (Phase 11, to be precise). (There's no bug for it yet, but when there is it will be tied to the meta bug 1820548).

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ead72c63f0b1 Add pocket recommendation id to pocket newtab events r=thecount,mmiermans
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 121 Branch

The change from tile_id to recommendation_id was approved by legal. Email response:

I've reviewed with product legal and determined that this use case is covered under our existing privacy notice. Good to go.

Depends on D190858

Original Revision: https://phabricator.services.mozilla.com/D190980

Depends on D192238

Attachment #9361016 - Flags: approval-mozilla-beta?

Comment on attachment 9361016 [details]
Bug 1854245 - Add pocket recommendation id to pocket newtab events

Approved for 120.0b5

Attachment #9361016 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: in-testsuite+

After further review, we determined that we actually do need the tile_id field in this ping. Specifically, while adding recommendation_id alone is sufficient for understanding which pieces of recommended content are impressed/clicks in New Tab, sponsored content does not have a recommendation_id.

Instead, sponsored content receives a tile_id which is equal to its ad_id in Kevel. We use this tile ID to tie individual ads to specific advertisers, in order to monitor overall New Tab sponsored content performance, as well as to provide click rate client reporting to each advertiser.

I corresponded with Nneka Soyinka over Slack asking if it was OK to add this request to the existing data review. She mentioned that it was, and that we did not need a separate data review for tile_id:

I think the main thing would be understanding why two identifiers are needed. I don't think the review needs to be separate per se. To me it's a copy/paste of everything you've already shared in the past review with the update that there is one additional data point that wasn't previously included, but here's why we need it

@chutten @Kenny Long can you please acknowledge that you also agree we can add tile_id to the newtab ping so that we can proceed with implementation? Thank you!

Flags: needinfo?(klong)
Flags: needinfo?(chutten)

Confirmed: tile_id can be added to "newtab". shim will get its own thing. See bug 1862670 for implementation.

Flags: needinfo?(chutten)
Flags: needinfo?(klong)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: