[Meta] Add content identifiers to Pocket Glean pings
Categories
(Firefox :: New Tab Page, task, P1)
Tracking
()
People
(Reporter: mmiermans, Assigned: chutten)
References
Details
(Keywords: meta)
Attachments
(4 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
2.80 KB,
text/plain
|
Details | |
3.94 KB,
text/plain
|
klong
:
data-review-
|
Details |
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
|
Details | Review |
Given that all pingcentre telemetry (including Activity Stream) will be removed by EoY, we will need to rely on Glean pings instead.
My understanding is that Firefox Desktop already emits the following Glean pings:
Currently these pings do not include a content identifier.
- Add new recommendation
id
field to Glean, and stop using the deprecatedtileId
field. (Changed based on feedback in #63.Could you add both the)tileId
and the (not-yet-existing)recommendId
fields from the Pocket recommendation to each of the corresponding Glean pings? Stop Pocket New Tab Activity Stream events from being emitted, to prevent double-counting engagement.Because the newtab Glean ping has been around for a while (as chutten mentioned), the new/additional Prefect data pulls (using Glean data) will need to filter to Firefox >= 121.
Reporter | ||
Comment 1•1 year ago
|
||
I tried to fill in details as best as I could. Pocket Firefox Integration will probably split this up into multiple bugs.
Updated•1 year ago
|
Assignee | ||
Comment 2•1 year ago
|
||
Is this information present in Firefox Desktop? Is it, for instance, the tile's id (as communicated here and received here)?
Reporter | ||
Comment 3•1 year ago
•
|
||
Is this information present in Firefox Desktop? Is it, for instance, the tile's id (as communicated here and received here)?
Yes, it's present, at least in the sense that Firefox receives it from our API. With this bug, we will stop using tileId
and start using the id
attribute included in the same API response. Here's an example:
{
"data":[
{
"__typename":"Recommendation",
"id":"887db13d-5036-4327-8630-dfc4288fd8fd",
"tileId":1661483798510704,
"url":"https://www.courrierinternational.com/article/rugby-no-dupont-no-problem-pour-la-presse-etrangere-il-faudra-quelque-chose-de-monumental-pour-arreter-les-bleus?utm_source=pocket-newtab-fr-fr",
"title":"Rugby. “No Dupont, no problem” : pour la presse étrangère, “il faudra quelque chose de monumental pour arrêter les Bleus”",
"excerpt":"Il n’y a finalement pas eu de surprise. La France a battu l’Italie “à plate couture” vendredi 6 octobre et s’est qualifiée pour les quarts de finale de la Coupe du monde de rugby. Pour la presse internationale, en passant huit essais aux Transalpins, les Bleus ont surtout montré “qu’ils seraient dangereux, avec ou sans le retour de son capitaine”.",
"publisher":"Courrier international",
"imageUrl":"https://s3.us-east-1.amazonaws.com/pocket-curatedcorpusapi-prod-images/6d2e4e41-9f43-45a8-81c1-be204def566d.jpeg",
"timeToRead":3
}
]
}
Assignee | ||
Comment 4•1 year ago
|
||
Depends on D190858
Assignee | ||
Comment 5•1 year ago
|
||
Mathijs, what datasets is the recommendation_id
identifier permitted to join to? Its inclusion in the "newtab" ping gives it access to client_id
.
Also: I don't know the answers to the data review request questions for this collection. Please take this attached template and fill in the necessary questions.
Reporter | ||
Comment 6•1 year ago
|
||
Reporter | ||
Comment 7•1 year ago
|
||
I'm proposing to rename 'id' to 'recommendationId' in the API. We independently came to the same conclusion as Scott that 'id' is too generic:
It is my understanding that we can at some point migrate away from id: item.tileId,, and just use recommendation_id: item.id,? Is that correct? Otherwise, I think it's kinda confusing to have all these different ids with names that are not super descriptive.
I'll sync with Chutten before merging this, to ensure there's no unintended impact to Nightly.
Reporter | ||
Comment 8•1 year ago
•
|
||
Mathijs, what datasets is the recommendation_id identifier permitted to join to? Its inclusion in the "newtab" ping gives it access to client_id.
Does this answer from the review request answer this sufficiently, or should I add some more detail?
- Please provide a general description of how you will analyze this data.
We will join the above Firefox events with metadata for the content and server response.
As stated above, large groups of users receive the same recommendation_id, and therefore it cannot be associated with any personal information.
Examples of content metadata are: the curator's name, topic, curator tags.
The main server metadata is the response time.
These backend events are emitted using Snowplow. Eventually they will be migrated to Glean as well, but we don't have a concrete timeline for that yet.
Reporter | ||
Comment 9•1 year ago
|
||
Please review the above data review request.
Comment 10•1 year ago
|
||
Comment on attachment 9358646 [details]
data collection review request
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?
This collection is Glean so is / will be documented in the Glean Dictionary.
Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism) Provide details as to the control mechanism available.
Yes. These collections are Glean. The opt-out can be found in the product's preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, chutten@mozilla.com, najiang@mozilla.com, mmccorquodale@mozilla.com, lina@mozilla.com, anicholson@mozilla.com will be responsible for the permanent collections.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
This bug falls under Category 2, "interaction data". It aligns with the Cat 2 example from the data collection category doc: "For example, selections of add-ons or tiles to determine potential interest categories etc."
Is the data collection request for default-on or default-off?
Default on for all channels, locales, and countries that are eligible for New Tab recommendations:
https://mozilla-hub.atlassian.net/wiki/spaces/FPS/pages/80448805/Regional+Differences
Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?
Yes, it includes a new recommendation_id
. This replaces the previously used tile_id
, which may act as precedent.
Is the data collection covered by the existing Firefox privacy notice?
Unknown, because of the introduction of the new recommendation_id
. Escalating to Legal.
Does there need to be a check-in in the future to determine whether to renew the data?
No. This collection is permanent.
Does the data collection use a third-party collection tool?
No.
Comment 11•1 year ago
|
||
@mathijs - Because of the introduction of the new recommendation_id
, this needs to be escalated to legal. You can follow this process to do that:
https://wiki.mozilla.org/Data_Collection#Step_3:_Sensitive_Data_Collection_Review_Process
Reporter | ||
Comment 12•1 year ago
|
||
Sent an email to data-review yesterday:
We would like to change the identifier for New Tab recommendations from one that only identifies the content ('tileId') to 'recommendation_id', which is unique both for the content, and changes each time recommendations are re-ranked.
We do not get any additional information about individual users by making this change, because New Tab recommendations are delivered through a cached CDN. Each cached response is sent to 3,000 users on average. The goal is to have better observability and reliability, by being able to trace click and impression events back to the server-side event that generated the recommendations.
Assignee | ||
Comment 13•1 year ago
•
|
||
Forgot to mention: we had a chat about this and we're not aiming to remove the old pocket AS events/pings until data validation's been done on the base set of Activity Stream reinstrumentation. They've already been instrumenting side-by-side since the New Tab Telemetry Project landed the first batch of instrumentation in Fx106 via bug 1786612.
The removal of the PingCentre-sent pocket AS instrumentation will happen as part of the broader removal of all PingCentre-sent AS instrumentation towards the end of the PingCentre Replacement Project (Phase 11, to be precise). (There's no bug for it yet, but when there is it will be tied to the meta bug 1820548).
Comment 14•1 year ago
|
||
Comment 15•1 year ago
|
||
bugherder |
Reporter | ||
Comment 16•1 year ago
|
||
The change from tile_id to recommendation_id was approved by legal. Email response:
I've reviewed with product legal and determined that this use case is covered under our existing privacy notice. Good to go.
Assignee | ||
Comment 17•1 year ago
|
||
Depends on D190858
Original Revision: https://phabricator.services.mozilla.com/D190980
Depends on D192238
Updated•1 year ago
|
Comment 18•1 year ago
|
||
Comment on attachment 9361016 [details]
Bug 1854245 - Add pocket recommendation id to pocket newtab events
Approved for 120.0b5
Comment 19•1 year ago
|
||
uplift |
Updated•1 year ago
|
Updated•1 year ago
|
Comment 20•1 year ago
|
||
After further review, we determined that we actually do need the tile_id
field in this ping. Specifically, while adding recommendation_id
alone is sufficient for understanding which pieces of recommended content are impressed/clicks in New Tab, sponsored content does not have a recommendation_id
.
Instead, sponsored content receives a tile_id
which is equal to its ad_id
in Kevel. We use this tile ID to tie individual ads to specific advertisers, in order to monitor overall New Tab sponsored content performance, as well as to provide click rate client reporting to each advertiser.
I corresponded with Nneka Soyinka over Slack asking if it was OK to add this request to the existing data review. She mentioned that it was, and that we did not need a separate data review for tile_id
:
I think the main thing would be understanding why two identifiers are needed. I don't think the review needs to be separate per se. To me it's a copy/paste of everything you've already shared in the past review with the update that there is one additional data point that wasn't previously included, but here's why we need it
@chutten @Kenny Long can you please acknowledge that you also agree we can add tile_id
to the newtab
ping so that we can proceed with implementation? Thank you!
Assignee | ||
Comment 21•1 year ago
|
||
Confirmed: tile_id
can be added to "newtab". shim
will get its own thing. See bug 1862670 for implementation.
Description
•