Closed Bug 1414839 Opened 3 years ago Closed 3 years ago

Provide public dataset for SSL adoption in Firefox release

Categories

(Data Platform and Tools :: Datasets: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gfritzsche, Assigned: chutten)

References

(Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

Per bug 1413258, Lets Encrypts adoption stats are using the current public Telemetry aggregates for release, querying HTTP_PAGELOAD_IS_SSL.

That data is going away per the data preference changes of bug 1406390 et al, see [1].
We should provide an alternative public dataset covering that use-case from release data.

1: https://medium.com/@georg.fritzsche/data-preference-changes-in-firefox-58-2d5df9c428b5
It's no problem for me to change how that code works.
This would probably just be a Spark job that consumes the main_summary data and writes it out in some predictable format (csv or JSON). I would imagine it would be a set of dimensions (os, country, etc.), along with total pageloads, and SSL pageloads. All we need then is the dimensions that the Let's Encrypt team is interested in using.
I'm pretty confident the LE team would be super-pleased with as much data as we could give them. If there were dimensions available, they'd probably graph them. :)
Might be cheaper for now to schedule a sql.tmo query that outputs some JSON?
I currently don't see any dimensions being used for the Lets Encrypt stats:
https://letsencrypt.org/stats/

To keep that use-case working we probably just need a very basic dataset "SSL page load ratio per day".
Whiteboard: [measurement:client]
Whiteboard: [measurement:client] → [measurement:client:tracking]
(In reply to Chris H-C :chutten from comment #5)
> Might be cheaper for now to schedule a sql.tmo query that outputs some JSON?

AFAIK there isn't a way to schedule a query to output publicly accessible (or even privately) files. If you know of a way, I'm all ears.
See https://telemetry.mozilla.org/crashes. It sources its data[1] from the sql.tmo query "Crashes of the last few days"[2] which refreshes itself daily.

[1]: https://sql.telemetry.mozilla.org/api/queries/1092/results.json?api_key=f7dac61893e040ca59c76fd616f082479e2a1c85
[2]: https://sql.telemetry.mozilla.org/queries/1092
Assignee: nobody → gfritzsche
Points: --- → 2
Priority: -- → P1
I will not be working on this.
I assume that this gets picked up next iteration, not in this one.
Assignee: gfritzsche → nobody
Priority: P1 → P2
(In reply to Chris H-C :chutten from comment #8)
> See https://telemetry.mozilla.org/crashes. It sources its data[1] from the
> sql.tmo query "Crashes of the last few days"[2] which refreshes itself daily.
> 
> [1]:
> https://sql.telemetry.mozilla.org/api/queries/1092/results.
> json?api_key=f7dac61893e040ca59c76fd616f082479e2a1c85
> [2]: https://sql.telemetry.mozilla.org/queries/1092

This is great, let's absolutely do it this way. Much simpler and easier to maintain.
Note: Whatever we do for HTTP_PAGELOAD_IS_SSL, we should also query and export HTTP_TRANSACTION_IS_SSL. They're both useful counts for the ecosystem, even though currently only the first is graphed.

(Thanks, rlb, for the reminder!)
Based on this medium article, it seems like what's happening is that this metric is set to be "pre-release". Why don't we instead just change it to "release"?
Flags: needinfo?(jjones)
(In reply to Eric Rescorla (:ekr) from comment #12)
> Based on this medium article, it seems like what's happening is that this
> metric is set to be "pre-release". Why don't we instead just change it to
> "release"?

We fixed that particular part of the puzzle in bug 1340021

The problem after that was that we never performed an audit to ensure we were actually collecting it properly on release. It turns out that network telemetry does its own thing (see bug 1413258 comment 3) which needed to be rectified (also bug 1413258, shipping in 58). We now have a proper audit on the books (bug 1414388) that we'll likely not get to until after 57 is on release.
Flags: needinfo?(jjones)
When this aggregate is put together, it would be very helpful if the historical numbers -- even flawed as they are -- can be kept  available from Mozilla somehow. (Worst case: Take basically the dataset I'm using currently and serve a static copy as the historical record with a caveat that it has a sampling bias)
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P2 → P1
Whiteboard: [measurement:client:tracking] → [measurement:client]
Alrighty, we have ourselves a preliminary dataset where I provide dimensions for OS and country for all data submitted since last November: https://sql.telemetry.mozilla.org/queries/49323/source#table

Would this, refreshing daily, be an acceptable mechanism for powering Let's Encrypt's data needs?
Flags: needinfo?(jjones)
This looks perfect, thanks! I assume there's a stable URL I could use to source the table? 

And do we need to think ahead to when this table gets a lot bigger? (I'm fine with dealing with that in the future if you are!)
Flags: needinfo?(jjones)
If you'd like the JSON from it, you can get an API Key from the three-dots menu. That'll allow you to request the data in a way similar to how I do it for the crashdash: https://github.com/mozilla/telemetry-dashboard/blob/gh-pages/crashes/index.html#L425

The query hasn't been reviewed, so I wouldn't use it just yet. As for future considerations, I think your frontend will run into problems before sql.tmo does. Will awful things happen if this ends up falling a few days behind schedule very occasionally?

ni?frank for query review.
Flags: needinfo?(fbertsch)
Query looks correct to me. Problem is, without raw counts, these dimensions cannot be combined. For example, if they were interested in all Windows subsessions, they couldn't combine the ratios from all of the dates since they do not all have the same denominator.

:Chutten we could work around that without giving raw counts by adding a normalized count, where the sum across all rows = 1. Then ratios could be combined by re-normalizing the normalized counts for that dimension value (e.g. retrieving all Windows subsession normalized counts, and just normalizing those), and use those re-normalized values as the weight for each row.

I'm assuming that these low reporting ratios seem to be correct considering the network telemetry oddities.

Other points:
- Do we need HTTP_TRANSACTION_IS_SSL as well?
- Were there any other dimensions other than date, os, and country that Let's Encrypt would be interested in using?
Flags: needinfo?(fbertsch)
The idea for normalization is to add another value to each row: sum_for_dimensions(total_pageloads) / sum_over_all(total_pageloads). The value sum_over_all(total_pageloads) will be constant. This new value can be called normalized_pageloads.

Client-side: When combining ratios (say, for all Windows machines), simply renormalize this value: normalized_pageloads / SUM(normalized_pageloads), for each row that has os = 'Windows'. Note that if you were to take all rows, SUM(normalized_pageloads) == 1. This new value can be called dimension_normalized_pageloads.

Then taking the ratio is simply SUM(ratio * dimension_normalized_pageloads). Note that for a single row, dimension_normalized_pageloads ==  1, making the ratio unchanged.
I have updated the query to contain normalized_pageloads. This allows clients to combine ratios (including the reporting ratio) for statistics that require multiple rows.

And, yes, I double-checked with a spreadsheet that the numbers line up between the normalized counts and the true counts (which I temporarily added, and have now removed). This wasn't because I don't trust :frank's math, but because I didn't trust that I didn't find a way to foul this up somehow.

How's it look for your uses, :jcj?
r?:frank
Flags: needinfo?(jjones)
Flags: needinfo?(fbertsch)
Looks good! Thanks :chutten!
Flags: needinfo?(fbertsch)
This looks pretty good to me. It does look like there's less data coming out of this query, though, but perhaps that's just a function of the debugging? For most dates of late we only get Windows numbers from every reporting country. It appears I have to go back to 2017-08-30 to get any Darwin reports.

For these 2058 rows, sum(normalized_pageloads):

select count(1), sum(normalized_pageloads) from tls;
2058, 0.440019469410441

so I think it's getting truncated somewhere?
Flags: needinfo?(jjones)
I was dropping any combination that had less than a certain normalized pageloads count. Maybe I should raise that...
I've changed the query so it excludes any (date, os, country) tuple that had fewer than 5000 pageloads. I figure that's a reasonable privacy barrier to hit. :jcj, does it look any better?

If so, I'll start the Data Review process.
Flags: needinfo?(jjones)
I may not be able to check this until Friday; I've got PTO tomorrow and the rest of today looks shot... I hope that's OK. Sorry!
Flags: needinfo?(jjones)
Oops, that shouldn't have cleared. Touchscreen fail.
Flags: needinfo?(jjones)
This looks great. Some basic spreadsheet spot checks match up reasonably well with data from the existing telemetry (https://ct.tacticalsecret.com 's stuff)

So remaining things:

A) We'll want the output of this query to be available publicly somewhere; I don't mind using my API key to get it for Let's Encrypt, but there's interest from other parties too, and eventually Let's Encrypt might not want to be dependent on me for this info

B) I want to get a static form of this data from telemetry from before 1 Nov 2017 and host it someplace for posterity, too. Can we run that (enormous) query and host it someplace? If it's too large to host, I'd guess Let's Encrypt would host it for us.
Flags: needinfo?(jjones)
I'm... not sure how that will go. Here's where I rope in a Data Peer (rweiss in this case) to consult on how we might make a public dataset of release information.
Flags: needinfo?(rweiss)
Redirecting to :liuche
Flags: needinfo?(rweiss) → needinfo?(liuche)
I should explicitly state that there is a deadline of January 23th for Firefox 58 reaching release, and the aggregates release data going dark.

If we can get a decision and discussion sorted before the end of the week of Jan 8-12, then that gives :jcj the week of the 15-19 to implement the new solution for Let's Encrypt.
Rob: georg tells me that you were recently investigating tools for publishing datasets for public use. Do you have any comments about how we should publish this sort of thing?

Short background: Let's Encrypt needs a public dataset of release SSL use. :frank, :jcj, and I have worked out a dataset format that looks like it ought to work, and I've implemented it as a re:dash query. For Let's Encrypt we can probably "just" publish this on re:dash with an API key, but this seems like the sort of thing we should make more broadly available.
Flags: needinfo?(rmiller)
Right... so we've got a tool called 'ensemble' (https://github.com/mozilla/ensemble) that can be used to generate simple public facing dashboards from existing data sets. You can see a prototype of it serving up a version of the Firefox Hardware Report at https://moz-ensemble.herokuapp.com/dashboard/hardware.

The only catch is that it demands that your data be formatted in a particular way. We're in the process of writing up a description of the format requirements, but the short version is that it should look like:

{"date": "2018-01-04",
 "metrics": {
    "metric_name_0": {
        "value_label_0": 0.10,
        "value_label_1": 0.20,
        "value_label_2": 0.30,
        "value_label_3": 0.40
    },
    "metric_name_1": {
        "foo": 0.05,
        "bar": 0.15,
        "baz": 0.35,
        "bawlp": 0.45
    }
  },

 "date": "2018-01-05",
 "metrics": {
    "metric_name_0": {
        "value_label_0": 0.10,
        "value_label_1": 0.20,
        "value_label_2": 0.30,
        "value_label_3": 0.40
    },
    "metric_name_1": {
        "foo": 0.05,
        "bar": 0.15,
        "baz": 0.35,
        "bawlp": 0.45
    }
  },
}

If it's relatively easy for you to publish your data in this shape, then it should be pretty easy for us to spin up a site to host your data set. Do you know if your data can map onto this structure at all?
Flags: needinfo?(rmiller)
So for each metric_name there is a graph, with value_label each being their own line on the plot. The x is the date, the y is the numeric value.

We conversed on IRC about ensemble and it is a dashboarding product. For now, for Let's Encrypt, we need a data source more than we need a dashboard, so ensemble will have to wait.

Unless, :jcj, would data in ensemble's format be useful to you? More useful, even, than the JSON coming out of the sql.tmo query?
Flags: needinfo?(jjones)
It would be very cool in that the data would be basically already-graphed, but Let's Encrypt probably would want to self-host the graph JS and the dataset rather than load it from us.

I'm guessing that's not hard -- the main hardware report does it, after all -- so I'd suggest you aim for whatever's easiest to make public -- both for the historical data and what's in sql.tmo.
Flags: needinfo?(jjones)
Sorry for the delay, still catching up from PTO.

Since this is data that is already collected (and thus has passed data review at some point for internal use), the data stewarding question is "how can we share this publicly".

It looks like the data being collected is release opt-in users' [submission_date, os, country, reporting_ratio, normalized_pageloads, ratio] for countries >5000 pageloads, as listed in this dashboard? https://sql.telemetry.mozilla.org/queries/49323/source#table

For data review, I'd need some public documentation on what specific data we'd be releasing. My instinct is that in aggregate this is fine to share, and the benefit is really great. But there definitely needs to be documentation about what values, etc are being shared (since it's already been collected and presumably also passed data review, this shouldn't be too much of a problem). I can see some concerns if sharing this data wasn't covered in a data collection privacy policy, so I'll loop in Marshall - he'll need to see documentation too.

chutten, can you fill out this form in this bug and link to documentation of what is being collected? https://github.com/mozilla/data-review/blob/master/request.md

Marshall, question for you - is there anything that needs to be done to release a dashboard of histogram_parent_http_pageload_is_ssl, or is documenting it publicly good enough?
Flags: needinfo?(merwin)
Flags: needinfo?(liuche)
Flags: needinfo?(chutten)
Documentation publicly should be fine but I'd like to actually see the documentation.  Seems like this is just the data we release previously public plus os and country. If that is the case, the documentation should be sufficient.
Flags: needinfo?(merwin)
(In reply to Chenxia Liu [:liuche] - not actively working on Fennec from comment #35)
> It looks like the data being collected is release opt-in users'
> [submission_date, os, country, reporting_ratio, normalized_pageloads, ratio]
> for countries >5000 pageloads, as listed in this dashboard?
> https://sql.telemetry.mozilla.org/queries/49323/source#table

That is the data I hope to publish for Let's Encrypt's use, yes. 

> For data review, I'd need some public documentation on what specific data
> we'd be releasing. My instinct is that in aggregate this is fine to share,
> and the benefit is really great. But there definitely needs to be
> documentation about what values, etc are being shared (since it's already
> been collected and presumably also passed data review, this shouldn't be too
> much of a problem). I can see some concerns if sharing this data wasn't
> covered in a data collection privacy policy, so I'll loop in Marshall -
> he'll need to see documentation too.

Public documentation... would a comment on a bug suffice? A blog post? Or should I put a link under "Documentation" under https://telemetry.mozilla.org/ ? Is it enough that it's only expected to be used by Let's Encrypt (for the time being, at least)?
 
> chutten, can you fill out this form in this bug and link to documentation of
> what is being collected?
> https://github.com/mozilla/data-review/blob/master/request.md



    What questions will you answer with this data?

How much of the Web is loaded by Firefox users over SSL?

    Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:

Internet Health, generally speaking.

    What alternative methods did you consider to answer these questions? Why were they not sufficient?

I considered outputting the same data in different forms, but didn't consider trying to answer this question by other means.

    Can current instrumentation answer these questions?

Yes. HTTP_PAGELOAD_IS_SSL has been around for quite some time.

    List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the found on the Mozilla wiki.

HTTP_PAGELOAD_IS_SSL. Category 2. bug 1340021 covers when it was made opt-out.

    How long will this data be collected? Choose one of the following:

In perpetuity, as it was already. I'll be responsible for its continued health.

    What populations will you measure?

All Firefox users.

    Which release channels?

From 58 onwards.

    Which countries?

All.

    Which locales?

All.

    Any other filters? Please describe in detail below.

No filters.

    Please provide a general description of how you will analyze this data.

It will be published so that ratios of SSL adoption can be calculated by day, OS, and country/

    Where do you intend to share the results of your analysis?

Let's Encrypt will put it up at https://letsencrypt.org/stats/
Flags: needinfo?(chutten)
A link under Documentation on tmo would be perfect :)
Pending documentation, r+ for data review from me.

> Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? (see here, here, and here for examples). Refer to the appendix for "documentation" if more detail about documentation standards is needed.

Yes (on telemetry.mozilla.org)

> Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism) Provide details as to the control mechanism available.
Yes, telemetry switch

> If the request is for permanent data collection, is there someone who will monitor the data over time?**
chutten

> Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? **
Type 2

> Is the data collection request for default-on or default-off?
Default on

> Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?
no

> Is the data collection covered by the existing Firefox privacy notice?
cc-ed Marshall to double check about sharing this data publicly

    Does there need to be a check-in in the future to determine whether to renew the data? (Yes/No) (If yes, set a todo reminder or file a bug if appropriate)**

No, will be used to track SSL adoption over time
Blocks: 1430134
Here's the pull request for the documentation page: https://github.com/mozilla/telemetry-dashboard/pull/379

I invite feedback from any interested party: I tried to keep it just a condensate of what has been discussed here.
Note: The change to the Let's Encrypt website -- and associated implementing Javascript -- is now awaiting review/merge:

https://github.com/letsencrypt/website/pull/234

It appears to work fine, following the algorithm from comment 19.
Ooh, just look at all those reporting_ratios climb! US Windows users are up to 17% reporting for Jan 24.

Work here is done.
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.