Open Bug 1364611 Opened 7 years ago Updated 2 years ago

Add telemetry to track complete matches per page load

Categories

(Toolkit :: Safe Browsing, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: francois, Unassigned)

Details

(Whiteboard: #sbv4-m10)

Attachments

(1 file, 2 obsolete files)

In order to be able to easily compare how our implementation performs compared to Chrome (and potentially discover bugs/regressions), we should add telemetry probes that keep track of:

- percentage of page loads that match a prefix
- percentage of page loads that match a full hash

Both of these would be keyed on the V4 lists:

- goog-phish-proto
- goog-malware-proto
- goog-unwanted-proto
Assignee: nobody → tnguyen
Thomas, I'm going to hold on to this bug while I confirm the exact nature of the numbers that Chrome collects, to make sure I got the description right.
Assignee: tnguyen → francois
I did not check deeply Chrome code but I doubt that Chrome will block a page if there's any unsafe content resource loading.
In the current FF, we just ignore content loading (img, css,...,) even nested iframe (show warning UI in iframe but still allow the top level page loading).
That would be a pretty big different which may have impact on the number of "page loading match". If Chrome collects all URL (include content + top level), I think we already had that telemetry probe
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #2)
> I did not check deeply Chrome code but I doubt that Chrome will block a page
> if there's any unsafe content resource loading.

I'm not sure what they do for images, but they block the whole page in the case of bad iframes: https://bugzilla.mozilla.org/show_bug.cgi?id=1195242#c21
I just make a small test, yep, they block img and link stylesheet to unsafe url, so I think any unsafe loading will make the top level page be blocked
Whiteboard: #sbv4-m7 → #sbv4-m8
Summary: Add telemetry to track prefix and complete matches per page load → Add telemetry to track complete matches per page load
I discussed options with the Chrome team and the following would match fairly closely the data they are able to track.

A single enumerated histogram keyed by threat type (phish, malware, unwanted):

0 = no match on the page
1 = warning page shown due to a match on the top-level URL
2 = warning page shown due to a match on a sub-resource (bug 1195242)

We should record this after every page load. Note that determining when a page has finished loading is not trivial.
Assignee: francois → nobody
(In reply to François Marier [:francois] from comment #5)
> I discussed options with the Chrome team and the following would match
> fairly closely the data they are able to track.
> 
> A single enumerated histogram keyed by threat type (phish, malware,
> unwanted):
> 
> 0 = no match on the page
> 1 = warning page shown due to a match on the top-level URL
> 2 = warning page shown due to a match on a sub-resource (bug 1195242)
> 
> We should record this after every page load. Note that determining when a
> page has finished loading is not trivial.

That may not cover the case multiple lists hit and multiple sub-resource loadings encounter the list (for example a page loading ends up with 3 blocked sub-resource loadings with different types). So we may have to use bits flag to record telemetry, something like
    enum PageLoadMatchFlag {
      NO_MATCH = 0x00,
      TOP_LEVEL_MATCH = 0x01,
      SUB_RESOURCE_MATCH = 0x02,
      MALWARE_LIST_MATCH = 0x04,
      PHISHING_LIST_MATCH = 0x08,
      UNWANTED_LIST_MATCH = 0x10,
    };
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #6)
> That may not cover the case multiple lists hit and multiple sub-resource
> loadings encounter the list (for example a page loading ends up with 3
> blocked sub-resource loadings with different types).

I will clarify these points with Google. Hopefully we can get to something that's simple to measure and a useful comparison.
Whiteboard: #sbv4-m8 → #sbv4-m9
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #6)
> That may not cover the case multiple lists hit and multiple sub-resource
> loadings encounter the list (for example a page loading ends up with 3
> blocked sub-resource loadings with different types).

We will be counting the percentage of top-level page views that trigger at least one Safe Browsing warning.

If a URL matches more than one list type, we can simply take the "most severe" one. The order is defined here: https://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#67-87

e.g. if a single top-level page has 3 subresources:

- 1 that matches the malware list
- 1 unwanted, and
- 1 phishing

then this will count only as 1 malware page view.
Assignee: nobody → ettseng
Status: NEW → ASSIGNED
Attached patch bug-1364611.patch (obsolete) — Splinter Review
A WIP patch.
It only added a new telemetry data in Histograms.json.

Francois,
According to comment 8, we don't need to key on the V4 lists as you said in comment 0, do we?
Could you take a look to see if the telemetry format is sane?
Flags: needinfo?(francois)
Flags: needinfo?(francois)
Whiteboard: #sbv4-m9 → #sbv4-m10
Comment on attachment 8907710 [details] [diff] [review]
bug-1364611.patch

Review of attachment 8907710 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Ethan Tseng [:ethan] from comment #9)
> According to comment 8, we don't need to key on the V4 lists as you said in
> comment 0, do we?
> Could you take a look to see if the telemetry format is sane?

Looks good.

We should check with some telemetry folks to see if this is the correct approach or a ping on each page load will cost too much in storage / processing.

::: toolkit/components/telemetry/Histograms.json
@@ +5019,5 @@
> +    "alert_emails": ["francois@mozilla.com", "safebrowsing-telemetry@mozilla.org"],
> +    "expires_in_version": "never",
> +    "releaseChannelCollection": "opt-out",
> +    "kind": "enumerated",
> +    "n_values": 16,

We probably only need 8 here.

@@ +5021,5 @@
> +    "releaseChannelCollection": "opt-out",
> +    "kind": "enumerated",
> +    "n_values": 16,
> +    "bug_numbers": [1364611],
> +    "description": "Whether Safe Browsing detected bad URLs in a page load (0 = none, 1 = malware, 2 = phishing, 3 = unwanted, 4 = others)."

I would rephrase as "Whether or not a Safe Browsing warning page was shown, either because of the top-level URL or the URL of a sub-resource, on a page load (0=none, 1=malware, 2=phishing, 3=unwanted, 4=other)"
Refresh the patch based on Francois' comment.
Attachment #8907710 - Attachment is obsolete: true
Fixed typos in the last patch.
Attachment #8913968 - Attachment is obsolete: true
Sorry, this bug has been pending on me for a long time.
It turned out I am the best person to work on this.  Hand it over to Dimi.
Assignee: ettseng → dlee
(In reply to Ethan Tseng [:ethan] - 57 Regression Engineering Support from comment #13)
> It turned out I am the best person to work on this.  Hand it over to Dimi.
               I am "not"
No longer blocks: SafeBrowsingv4Telemetry
Assignee: dimidiana → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: