Closed Bug 1544598 Opened 5 months ago Closed 5 months ago

Content blocking origin Telemetry uses the wrong origins when recording Telemetry

Categories

(Core :: Privacy: Anti-Tracking, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: englehardt, Assigned: xeonchen)

References

Details

Attachments

(2 files, 3 obsolete files)

Content blocking is currently passing the exact origins from the content blocking log into the recordOrigin API of Origin Telemetry [0]. This causes several problems:

  1. The recordOrigin API expects the passed origin to exactly match the list of origins [1]. This list is generated directly from the tracking protection list, meaning that includes the bare hostname that is often, but not always, an eTLD+1.

  2. The tracking protection list contains a few records with paths [2], which we won't have in originEntry.mOrigin.

Hacky suggestion for issue (1):
Since we don't know which rule caused the origin to be blocked, we have to figure out which slide of the origin to report at the time recordOrigin is called. recordOrigin ignores incorrect values [3], so we either have to add in origin slicing to recordOrigin or we have to call recordOrigin with multiple slices and allow Telemetry to silently drop the slices that aren't on the list and only record data for the matching slice.

No block rule on the list has more than two components. Thus, if we decided to take the approach of calling recordOrigin multiple times, we only need to call it with the eTLD+1 and eTLD+2 of the origin contained in the content blocking log. This will cause double counting for yandex.ru, since there are a number of yandex subdomains on the list as well (e.g., adfox.yandex.ru, an.yandex.ru). We could correct for this post-collection by subtracting the sums for *.yandex.ru from yandex.ru. There are not other origins that have both domains from the eTLD+1 and eTLD+2 levela on the list.

This isn't a great long-term solution, but it should be enough for the MVP. I'm comfortable ignoring issue (2) for the MVP, since it's only a couple records.

[0] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/dom/base/ContentBlockingLog.cpp#132

[1] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/toolkit/components/telemetry/core/TelemetryOriginData.inc

[2] https://github.com/mozilla-services/shavar-prod-lists/blob/8f59ebb7d2d8f9a57d6d0001d6e7bb7609da0e3e/disconnect-blacklist.json#L6731-L6735

[3] https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/toolkit/components/telemetry/core/TelemetryOrigin.cpp#312-318

Chris, is this something you're willing to solve for in recordOrigin? You'd just need to check all slices of a passed hostname down to its eTLD+1. That would be better than the hack solution above.

Flags: needinfo?(chutten)
Status: NEW → ASSIGNED
Priority: -- → P1

To my knowledge the analyses that this collection was to support were based on the rules being exercised during a pageload, not the hosts of the resources being blocked.

Does this mean a page blocking a resource from static.accounts.google.com and a resource from dynamic.accounts.google.com will call RecordOrigin(blocked, "accounts.google.com") twice? That second call will spawn a complete second batch of prio buffers, inflating the ping size and the counts for that rule.

Could we instead call RecordOrigin with the rules that were used? Then it will be deduplicated and match exactly the origins in the list.

Flags: needinfo?(chutten)
Flags: needinfo?(xeonchen)
Flags: needinfo?(senglehardt)
Attachment #9058462 - Attachment description: Bug 1544598 - record eTLD+1 & eTLD+2 origin telemetry; → Bug 1544598 - Part 1: record eTLD+1 & eTLD+2 origin telemetry;
Attachment #9058462 - Attachment is obsolete: true
Attachment #9058462 - Attachment is obsolete: false
Depends on: 1545033

Thanks Chris, we're exploring the feasibility of this in Bug 1545033.

Flags: needinfo?(senglehardt)
Attachment #9058635 - Attachment is obsolete: true
Attachment #9058462 - Attachment is obsolete: true
Depends on: 1545604
Attached patch WIP (obsolete) — Splinter Review
Flags: needinfo?(xeonchen) → needinfo?(dlee)

(In reply to Gary Chen [:xeonchen] from comment #7)

See Bug 1545033 Comment 5

Flags: needinfo?(dlee)
Attachment #9060025 - Attachment is obsolete: true
Depends on: 1548517
Attachment #9061648 - Attachment description: Bug 1544598 - use hash to record origin telemetry; → Bug 1544598 - Part 1: use hash to record origin telemetry;
Blocks: 1543712
Pushed by xeonchen@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/9a2bdd58f6c6
Part 0: fix assertion when EncodedSnapshot with an unknown origin; r=chutten
https://hg.mozilla.org/integration/autoland/rev/ce18611e1dd4
Part 1: use hash to record origin telemetry; r=Ehsan
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.