Closed Bug 1340021 Opened 3 years ago Closed 3 years ago

Better internet health monitoring through security telemetry

Categories

(Core :: DOM: Security, defect, P2)

51 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: kmckinley, Assigned: kmckinley)

References

(Blocks 1 open bug)

Details

(Whiteboard: [domsecurity-active])

Attachments

(1 file)

In order to better monitor general internet health, we should move certain security-related telemetry to enable release collection as opt-out. We care deeply about these metrics, and they are often disseminated widely as measurements of, for example, what percentage of page-loads are over HTTPS.
First pass:
HTTPS-Related Telemetry
HTTP_TRANSACTION_IS_SSL
HTTP_PAGELOAD_IS_SSL
HTTP_SCHEME_UPGRADE
SSL_HANDSHAKE_VERSION
SSL_HANDSHAKE_RESULT
SSL_TIME_UNTIL_READY
SSL_TIME_UNTIL_HANDSHAKE_FINISHED
SSL_RESUMED_SESSION
SSL_OBSERVED_END_ENTITY_CERTIFICATE_LIFETIME

Password Manager Telemetry
PWMGR_FORM_AUTOFILL_RESULT
PWMGR_LOGIN_PAGE_SAFETY
PWMGR_NUM_PASSWORDS_PER_HOSTNAME
PWMGR_PASSWORD_INPUT_IN_FORM
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

This looks like a reasonable set to me.  Chris, Benjamin, what do you think?

> Requirements for opt-out data collection (Firefox Health Report)
> The data must provide user value.

The user value here is that we are better able to track the use of security technologies in the Web. That means we're better able to know how aggressive we can be in restricting new features to HTTPS, warning users for non-secure sites, etc.
Attachment #8837934 - Flags: review?(chutten)
Attachment #8837934 - Flags: review?(benjamin)
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

https://reviewboard.mozilla.org/r/112928/#review114894

The code changes are well formed.
Attachment #8837934 - Flags: review?(chutten) → review+
I'm a little concerned at the size of some of these probes suddenly being added to our most populous channel. Some of these probes have hundreds of buckets and will start coming from nearly 100% of our release population all day, every day. This could result in us blowing our data budget[1].

+mreid for impact. tl;dr - 13 probes changing to opt-out, total of 1279 buckets.

If this ends up being a problem, but we still want this data from release, we can always write new, lower-impact (lower-resolution?) probes that cover similar territory.

...this is clearly beyond the scope of this bug and the timelines we're interested in... but it seems as though what we're after is a representative sample of internet use, not a complete survey. We're not counting things, we just want to know proportions. Maybe we need some mechanism to write probes that we want to measure from some representative sub-populations over time. Maybe hash(clientId+probename) % 100 < some_defined_percentage.

+gfritzsche for client-side ideas. Maybe a new field on Histograms.json "opt_out_sample"?

[1]: https://metrics.services.mozilla.com/telemetry-budget-dashboard/
Flags: needinfo?(mreid)
Flags: needinfo?(gfritzsche)
Is the cost here bandwidth or just storage. If storage, then let's run this for a bit and see if it's chewing up too much and if so, do the sampling on the server side.
The concern would be about storage costs. This will probably not be an issue, but i'll leave it to mreid to comment.
Lets take sampling discussions to a separate bug.
Flags: needinfo?(gfritzsche)
Kate, where are we with this?

Separately, I have filed for: https://bugzilla.mozilla.org/show_bug.cgi?id=1342090 to just fix SSL_HANDSHAKE_RESULT
Flags: needinfo?(kmckinley)
It's ready to land, but needs data review from bsmedberg.
Flags: needinfo?(kmckinley) → needinfo?(benjamin)
(In reply to Chris H-C :chutten from comment #5)
> I'm a little concerned at the size of some of these probes suddenly being
> added to our most populous channel. Some of these probes have hundreds of
> buckets and will start coming from nearly 100% of our release population all
> day, every day. This could result in us blowing our data budget[1].
> 
> +mreid for impact. tl;dr - 13 probes changing to opt-out, total of 1279
> buckets.

For comparison, we currently have a grand total of 223 probes enabled on release, of which 40 contain histograms with bucket counts, totaling up to 1663 buckets.

This could end up being significant, and would likely push the per-ping size over the budget estimates from the early days of unified telemetry (the estimate was 20kb per ping, we're at ~84% of that in the wild). That said, we currently have a fair bit of breathing room in the overall storage we expected to use.

Redirecting to Travis for a yea or nay.
Flags: needinfo?(mreid) → needinfo?(tblow)
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

https://reviewboard.mozilla.org/r/112928/#review117238

I know that these all predate you, but as these are becoming more important and we're going to opt-out it's important that our data docs are correct. I'd like to see this again. And we agreed that the team will identify a particular person who's responsible for periodic monitoring of this data going forward.

::: toolkit/components/telemetry/Histograms.json:1713
(Diff revision 1)
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1340021],
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "boolean",
>      "description": "Whether a HTTP transaction was over SSL or not."

Based on in-person discussion, I'm going to request improved histogram descriptions for many of these (often very old) metrics.

Related, I'd like to see followup bugs to make sure that any metric marked as expires_in_version: never has an automated test for the telemetry. It's easy and rather common to accidentally break telemetry probes during refactoring.

For this particular one: please clarify that a value is recorded for every HTTP request from Firefox ("transaction" was a bit confusing because I wasn't sure whether keepalive meant we recorded new values or not).

::: toolkit/components/telemetry/Histograms.json:1721
(Diff revision 1)
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1340021],
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "boolean",
>      "description": "Whether a HTTP base page load was over SSL or not."

Need to clarify whether this is *toplevel* page loads in a browser tab, or any page load including subframes.

::: toolkit/components/telemetry/Histograms.json:1940
(Diff revision 1)
> -    "bug_numbers": [1250568],
> +    "bug_numbers": [1250568,1340021],
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "enumerated",
>      "n_values": 16,
>      "description": "SSL Version (1=tls1, 2=tls1.1, 3=tls1.2, 4=tls1.3)"

I believe that this is the TLS version negotiated as part of the handshake (and is therefore primarily dependent on the server)?

This could use clarity and also specify the relationship between this metric and SSL_HANDSHAKE_RESULT: is this metric only recorded for a *successful* handshake (when SSL_HANDSHAKE_RESULT is 0), or also in some error cases?

::: toolkit/components/telemetry/Histograms.json:1992
(Diff revision 1)
>      "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1340021],
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "boolean",
>      "description": "complete TLS connect that used TLS Sesison Resumption"

nit: spelling error in the description.

I presume that when this is recorded we still also record the other metrics such as SSL_HANDSHAKE_RESULT and SSL_TIME_UNTIL_\*  But it would probably be good to document and test that explicitly.

::: toolkit/components/telemetry/Histograms.json:2046
(Diff revision 1)
>      "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1340021],
>      "kind": "enumerated",
>      "n_values": 125,
>      "releaseChannelCollection": "opt-out",
>      "description": "The lifetime of accepted HTTPS server certificates, in weeks, up to 2 years. Bucket 105 is all end-entity HTTPS server certificates with a lifetime > 2 years."

As discussed, to reduce risk it would be best to bucket this into the largest possible buckets that still answer the questions you care about (that would also reduce storage costs). Could this be refactored to bucket by month or something larger?

::: toolkit/components/telemetry/Histograms.json:9511
(Diff revision 1)
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "linear",
>      "high": 21,
>      "n_buckets" : 20,
>      "description": "The number of passwords per hostname"

This needs to specify *when* the metric is recorded. Is this at startup? Or when you visit a site with a potentially filled form?

Other related questions to answer and test: if you visit the same site/form multiple times, do we record a value each time? Does that skew the problem you're trying to answer? Renaming probes is a heavy stick, but saying PWMGR_NUM_PASSWORDS_PER_FORM might be a better name if that's in fact what this is recording.

::: toolkit/components/telemetry/Histograms.json:9533
(Diff revision 1)
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1340021],
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "boolean",
>      "description": "Whether an <input type=password> is associated with a <form> when it is added to a document"

This seems like an implementation metric that we should collect temporarily but not permanently. If you agree, mark it for six months/FF60 ?
Attachment #8837934 - Flags: review?(benjamin) → review-
Without answering point by point, these interact with SSL_HANDSHAKE_RESULT in a sort of unfortunate way.

Specifically:
SSL_HANDSHAKE_VERSION and some others are reported in the NSS handshake callback which captures every successful handshake. However, SSL_HANDSHAKE_RESULT only captures the result when we try to read or write on the socket, with the impact being that connections which are cancelled (even if they complete the handshake) may not be recorded in SSL_HANDSHAKE_RESULT but will be captured in SSL_HANDSHAKE_VERSION
Flags: needinfo?(benjamin)
Priority: -- → P2
Whiteboard: [domsecurity-active]
Going to reorder some of this reply to put more interesting stuff at the top.


> Based on in-person discussion, I'm going to request improved histogram
> descriptions for many of these (often very old) metrics.

> ::: toolkit/components/telemetry/Histograms.json:1713
> (Diff revision 1)
> > +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> > +    "bug_numbers": [1340021],
> > +    "releaseChannelCollection": "opt-out",
> >      "expires_in_version": "never",
> >      "kind": "boolean",
> >      "description": "Whether a HTTP transaction was over SSL or not."

> For this particular one: please clarify that a value is recorded for every
> HTTP request from Firefox ("transaction" was a bit confusing because I
> wasn't sure whether keepalive meant we recorded new values or not).
> 

Updated, this is all HTTP requests that completed (not keepalives or similar).

> ::: toolkit/components/telemetry/Histograms.json:1721
> (Diff revision 1)
> > +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> > +    "bug_numbers": [1340021],
> > +    "releaseChannelCollection": "opt-out",
> >      "expires_in_version": "never",
> >      "kind": "boolean",
> >      "description": "Whether a HTTP base page load was over SSL or not."
> 
> Need to clarify whether this is *toplevel* page loads in a browser tab, or
> any page load including subframes.

Updated. This is for any first-party loads, so top-level loads, clicks, history, etc, not for iframes.

SSL_HANDSHAKE_VERSION
SSL_HANDSHAKE_RESULT
I took a look at the code and updated the description as it matches ekr's explanation above.

> 
> ::: toolkit/components/telemetry/Histograms.json:1992
> (Diff revision 1)
> >      "alert_emails": ["seceng-telemetry@mozilla.com"],
> > +    "bug_numbers": [1340021],
> > +    "releaseChannelCollection": "opt-out",
> >      "expires_in_version": "never",
> >      "kind": "boolean",
> >      "description": "complete TLS connect that used TLS Sesison Resumption"
> 
> nit: spelling error in the description.
> 
> I presume that when this is recorded we still also record the other metrics
> such as SSL_HANDSHAKE_RESULT and SSL_TIME_UNTIL_\*  But it would probably be
> good to document and test that explicitly.

Updated explanation. This metric is updated at the same time as SSL_TIME_UNTIL_HANDSHAKE_FINISHED.
 
> ::: toolkit/components/telemetry/Histograms.json:2046
> (Diff revision 1)
> >      "alert_emails": ["seceng-telemetry@mozilla.com"],
> > +    "bug_numbers": [1340021],
> >      "kind": "enumerated",
> >      "n_values": 125,
> >      "releaseChannelCollection": "opt-out",
> >      "description": "The lifetime of accepted HTTPS server certificates, in weeks, up to 2 years. Bucket 105 is all end-entity HTTPS server certificates with a lifetime > 2 years."
> 
> As discussed, to reduce risk it would be best to bucket this into the
> largest possible buckets that still answer the questions you care about
> (that would also reduce storage costs). Could this be refactored to bucket
> by month or something larger?

Moving this to a separate bug as this one was changing the expiration. It is already opt-out in release. We primarily need the resolution at the low-end for this metric.

> 
> ::: toolkit/components/telemetry/Histograms.json:9511
> (Diff revision 1)
> > +    "releaseChannelCollection": "opt-out",
> >      "expires_in_version": "never",
> >      "kind": "linear",
> >      "high": 21,
> >      "n_buckets" : 20,
> >      "description": "The number of passwords per hostname"
> 
> This needs to specify *when* the metric is recorded. Is this at startup? Or
> when you visit a site with a potentially filled form?
> 
> Other related questions to answer and test: if you visit the same site/form
> multiple times, do we record a value each time? Does that skew the problem
> you're trying to answer? Renaming probes is a heavy stick, but saying
> PWMGR_NUM_PASSWORDS_PER_FORM might be a better name if that's in fact what
> this is recording.

This metric is only calculated when the gather-telemetry topic is observed.

> 
> ::: toolkit/components/telemetry/Histograms.json:9533
> (Diff revision 1)
> > +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> > +    "bug_numbers": [1340021],
> > +    "releaseChannelCollection": "opt-out",
> >      "expires_in_version": "never",
> >      "kind": "boolean",
> >      "description": "Whether an <input type=password> is associated with a <form> when it is added to a document"
> 
> This seems like an implementation metric that we should collect temporarily
> but not permanently. If you agree, mark it for six months/FF60 ?

Good catch, this is if-def'd to only fire in early beta or earlier, removing from this list.

(In reply to Benjamin Smedberg [:bsmedberg] from comment #11)
> Comment on attachment 8837934 [details]
> Bug 1340021 - Collect better data regarding internet health on Release
> 
> https://reviewboard.mozilla.org/r/112928/#review117238
> 
> I know that these all predate you, but as these are becoming more important
> and we're going to opt-out it's important that our data docs are correct.
> I'd like to see this again. And we agreed that the team will identify a
> particular person who's responsible for periodic monitoring of this data
> going forward.

Noted, we will make sure it gets on our list.

> Related, I'd like to see followup bugs to make sure that any metric marked
> as expires_in_version: never has an automated test for the telemetry. It's
> easy and rather common to accidentally break telemetry probes during
> refactoring.

Follow-ups incoming for these measures:

HTTP_TRANSACTION_IS_SSL
HTTP_PAGELOAD_IS_SSL
HTTP_SCHEME_UPGRADE
SSL_HANDSHAKE_VERSION
SSL_HANDSHAKE_RESULT
SSL_TIME_UNTIL_READY
SSL_TIME_UNTIL_HANDSHAKE_FINISHED
SSL_RESUMED_SESSION
PWMGR_NUM_PASSWORDS_PER_HOSTNAME
PWMGR_FORM_AUTOFILL_RESULT
PWMGR_LOGIN_PAGE_SAFETY
Blocks: 1344065
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

https://reviewboard.mozilla.org/r/112928/#review119330

data-review=me for everything *except* PWMGR_NUM_PASSWORDS_PER_HOSTNAME. Might want to punt that one to a different change. There are also some serious technical issues with the gather-telemetry topic which doesn't fire reliably or consistently. For long-term measurement we should probably fix this to use a different mechanism before we make it permanent.

::: toolkit/components/telemetry/Histograms.json:9702
(Diff revision 2)
> +    "releaseChannelCollection": "opt-out",
>      "expires_in_version": "never",
>      "kind": "linear",
>      "high": 21,
>      "n_buckets" : 20,
> -    "description": "The number of passwords per hostname"
> +    "description": "The number of passwords per hostname. Only accumulated when the gather-telemetry topic is observed."

Still a question about this one, because it's so different from what I expected. Does this record a value for every hostname that has any passwords? So we'd never record a 0 value, but every time gather-telemetry runs it could record between one and hundreds of values?
Attachment #8837934 - Flags: review?(benjamin) → review+
Moving PWMGR_NUM_PASSWORDS_PER_HOSTNAME to a follow-on.

PWMGR_NUM_PASSWORDS_PER_HOSTNAME This records a value for every hostname in the database, so theoretically I assume if the hostname was in the database, but had no passwords (for some reason), it would record 0. However, in practice it records up to 20 buckets, with the low value being 1/hostname, and the upper value 20 or more. As I read it should only add up to 20 items each run. For example, if a user had 10 hostnames with 1 password, 3 with 2 passwords, and 1 with 5 passwords, the telemetry would gather 3 values, { 1: 10, 2: 3, 5: 1}. We don't track which hostnames have how many passwords.
Pushed by kmckinley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88c4c948364c
Collect better data regarding internet health on Release r=bsmedberg,chutten data-review=bsmedberg
https://hg.mozilla.org/mozilla-central/rev/88c4c948364c
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

Approval Request Comment
[Feature/Bug causing the regression]: None
[User impact if declined]: Incorrect telemetry for release channel
[Is this code covered by automated tests?]: Bug 1344065 has been filed to improve test coverage for these metrics
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: N/A
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Not risky
[Why is the change risky/not risky?]: This change changes telemetry definitions, not code.
[String changes made/needed]: None (telemetry only)
Attachment #8837934 - Flags: approval-mozilla-beta?
Attachment #8837934 - Flags: approval-mozilla-aurora?
Comment on attachment 8837934 [details]
Bug 1340021 - Collect better data regarding internet health on Release  data-review=bsmedberg

Last minute telemetry changes for 53 to make some items opt-out, passed data review. This should be in tomorrow's beta 9 build.
Attachment #8837934 - Flags: approval-mozilla-beta?
Attachment #8837934 - Flags: approval-mozilla-beta+
Attachment #8837934 - Flags: approval-mozilla-aurora?
Attachment #8837934 - Flags: approval-mozilla-aurora+
Mark, please let us know if you see any problems after this lands on beta.
Flags: needinfo?(mreid)
Flags: needinfo?(mreid)
See Also: → 1414388
Flags: needinfo?(tblow)
You need to log in before you can comment on or make changes to this bug.