Closed Bug 1307568 Opened 8 years ago Closed 7 years ago

ship diagnostics system add-on with Firefox to diagnose update problems

Categories

(Toolkit :: Add-ons Manager, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Felipe, Assigned: rhelmer)

References

(Blocks 1 open bug)

Details

(Whiteboard: [triaged][go-faster-system-addon])

Attachments

(4 files, 7 obsolete files)

We should design an experiment to run on the next release cycle, with a planned system add-on update, to more accurately understand how fast the update reaches users.

It should respect users' telemetry privacy choices, but maybe it should use a different endpoint than telemetry.

And more importantly, it should help answer the question on _why_ some users don't get the updates.
Blocks: 1307569
Rob Strong and I have been looking at the numbers for this, and both for the outofdate-notifications system add-on, and the websense hotfix, we saw the same numbers: only 55% of users received it after a considerable period of time
Depends on: 1305787
From bug 1293368 comment #55 - note: this add-on was pushed on 20160810 and the data was from then until 20161017. I also verified that the add-on is still being sent to Firefox 44.0.2 clients.

Since there have been questions about the effectiveness of the websense system add-on I took a look at the count of client ID's that have sent a ping for this add-on and the count of client ID's reporting 44.0, 44.0.1, and 44.0.2 as the highest version in the MainSummary parquet.

Client ID's reporting 44.0, 44.0.1, and 44.0.2
8167840

Client ID's in the out of date add-on datasource
4478979

This is around 55% of the clients which is the uptake Felipe has seen with the websense add-on.
We did ship the websense hotfix and the add-on to many more than 55% of 47.0.1 users. If we hadn't, then 45% of the usual population would still be held back on 47.0.1. (and 48.0.2).
Liz, what percentage of the population are you seeing that are above 47.0.1? Is it high enough to close bug 1305787 which you opened?
Flags: needinfo?(lhenry)
No, we still need to keep that bug open.
Flags: needinfo?(lhenry)
It appears that I no longer get the out of date hotfix add-on since the websense add-on was deployed but I ran the following queries against MainSummary

SELECT COUNT(DISTINCT client_id) AS Total FROM parquet.`s3://telemetry-parquet/main_summary/v3` WHERE submission_date_s3 > '20160809' AND (app_version = '44.0' OR app_version = '44.0.1' OR app_version = '44.0.2') AND channel = 'release'

which returned
14965024

Note: channel = 'release' should remove esr, etc. from the results.

I ran the following query against the out of date add-ons datasource.
SELECT COUNT(DISTINCT clientId) AS Total FROM tblOutOfDate WHERE `meta/submissionDate` > '20160809'

Note: tblOutOfDate is a table created using registerTempTable from the dataframe of the add-ons data.

which returned
4801478

4761794 out of the 4801478 had client id's in the MainSummary parquet which is 99.17% of the records.

So, either my assumptions are wrong, these simple queries are wrong, the telemetry data is wrong, a significant number of systems aren't receiving the hotfix, or perhaps something else. I need to focus my time on app update and hopefully someone that works on the system add-ons client code can take on the investigation from here.
As I understand it there is concern regarding "default" being excluded for the out of date add-on where "default" is the distribution id and distribution version. The distribution id is in the MainSummary parquet and excluding records where there is a distribution id returned 14028558 distinct client ID's.

The query used
SELECT COUNT(DISTINCT client_id) AS Total FROM parquet.`s3://telemetry-parquet/main_summary/v3` WHERE submission_date_s3 > '20160809' AND (app_version = '44.0' OR app_version = '44.0.1' OR app_version = '44.0.2') AND channel = 'release' AND distribution_id IS NULL

I also reran the first query in comment #6 to get an accurate number of the difference between clients with a distribution id and those without.

Distinct client IDs for 44.0, 44.0.1, and 44.0.2 from MainSummary as of 10/27
Total not excluding records with distribution IDs: 15175211
Total excluding records with a distribution ID   : 14028558
Difference                                       :  1146653

Note that these numbers are for all pings vs. the 1% from the longitudinal datasource.

The out of date add-on was distributed to 44.x excluding clients that sent default in the update url for the distribution id and distribution version. There is the possibility that some clients sent a distribution version without sending a distribution id though if they do which we have no reason to believe they would be a very small subset of clients. The out of date add-on also sends its own ping.

Distinct client IDs in the add-ons datasource as of 10/25
Total 4801478

4761794 of the 4801478 had client ID matches in MainSummary which is 99.17% of all client IDs in the add-ons datasource.

So, of the client IDs reporting 44.0, 44.0.1, and 44.0.2
Total excluding records with a distribution ID: 14028558
Total out of date add-on record               :  4801478
Record not in the out of date add-on records  :  9227080

I'm not trying to say that this is anywhere near an exact percentage of clients that aren't getting system add-ons. I am saying that if this data and the data for the websense is anywhere near correct that there is a serious problem with getting system add-ons pushed to clients and that needs to be investigated, etc.
Assignee: nobody → rhelmer
marking triaged and P3 - as rhelmer already took.  priority could be changed - but setting removes from untriaged list.
Priority: -- → P3
Whiteboard: triaged
rhelmer, any update on this?
Flags: needinfo?(rhelmer)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #9)
> rhelmer, any update on this?

Sorry I was on PTO on Friday, I have been thinking about how we can do this. I think Felipe's suggestion is a good start:

(In reply to :Felipe Gomes (needinfo me!) from comment #0)
> We should design an experiment to run on the next release cycle, with a
> planned system add-on update, to more accurately understand how fast the
> update reaches users.
> 
> It should respect users' telemetry privacy choices, but maybe it should use
> a different endpoint than telemetry.
> 
> And more importantly, it should help answer the question on _why_ some users
> don't get the updates.

The leading theory is that this is caused by MITM proxies (bug 1308251). If this is indeed the case, then it's probably breaking other things like Test Pilot, GMP, etc. so we should try to fix this in a common way (System Add-ons and GMP use the same update mechanism so that part is easy, Test Pilot uses normal add-ons with a custom update URL if I recall correctly.)

I think 52 would be a good release to do this experiment in, especially now that we have restartless system add-on update (bug 1204156). If the problem is that this is caused by MITM proxies that are e.g. breaking our pinned certs, then a lot of clients will just never see the available update, but we'll still get some useful data.

We should also ship a built-in system add-on with Firefox that attempts to report on if our update attempts are getting through at all, which we can use to try to explain the clients that are not getting updates.

I'll start working on the system add-on we'll need for above, and coordinate with folks about putting up a new endpoint as Felipe suggests. We'll need to think about how to set this up in such a way that MITM won't break it, and work with the security team on how exactly we should do that.
Flags: needinfo?(rhelmer)
(In reply to Robert Helmer [:rhelmer] from comment #10)
> The leading theory is that this is caused by MITM proxies (bug 1308251). If
> this is indeed the case, then it's probably breaking other things like Test
> Pilot, GMP, etc. so we should try to fix this in a common way (System
> Add-ons and GMP use the same update mechanism so that part is easy, Test
> Pilot uses normal add-ons with a custom update URL if I recall correctly.)

There's a telemetry experiment going on right now to look into MITM prevalence (bug 1311479) which hopefully should make it clear how much of a problem MITM proxies are.

This is likely to be breaking hotfix as well. I don't know if we have great data on this.

I'd suggest we wait until the data from the above experiment comes in, because I think the privacy issues in this bug might be a bit thorny.

Since telemetry is opt-in on release channels I don't think it'd be a useful signal - maybe it'd be better to honor users update settings instead. We also don't want to generate badwill by pinging a new endpoint that users have never seen before (and it's likely to be blocked in these locked-down environments anyway).

Maybe we could do something simpler like augment the AUS update ping with an extra param that indicates whether the system add-on/GMP update ping was able to connect or not?

We could pull this out of the Balrog logs, and see how it changes when things like bug 1308251 land.

Rob - what do you think about this ^?
Flags: needinfo?(robert.strong.bugs)
Those urls are already unique from the app update url. For example, the GMP url is
https://aus5.mozilla.org/update/3/GMP/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml

and the app update url is
https://aus5.mozilla.org/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%(noBug1296630v1)/%SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml

I personally think it would be a "good thing" ™ to create a test system add-on to try to figure out what is going on and just instrument the hell out of it. You could do things like check connecting to a site with a pinned certificate, test connecting to a site using the hacky cert pinning that was implemented before real cert pinning was available, etc. It should also have something that validates that how many systems of a specific Firefox version get the system add-on. Since Firefox 47.0.2 is a watershed perhaps all systems running Firefox 48.something or other to limit the number of systems that get that version.

You should check with telemetry folk to see what systems can send this data in case they already do or if there is a way to have all systems send it.
Flags: needinfo?(robert.strong.bugs)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #12)
> Those urls are already unique from the app update url. For example, the GMP
> url is
> https://aus5.mozilla.org/update/3/GMP/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/
> %LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.
> xml
> 
> and the app update url is
> https://aus5.mozilla.org/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/
> %BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%(noBug1296630v1)/
> %SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml
> 
> I personally think it would be a "good thing" ™ to create a test system
> add-on to try to figure out what is going on and just instrument the hell
> out of it. You could do things like check connecting to a site with a pinned
> certificate, test connecting to a site using the hacky cert pinning that was
> implemented before real cert pinning was available, etc. It should also have
> something that validates that how many systems of a specific Firefox version
> get the system add-on. Since Firefox 47.0.2 is a watershed perhaps all
> systems running Firefox 48.something or other to limit the number of systems
> that get that version.


Yep agree about doing a system add-on so we can instrument this - what I was thinking when I suggested this is that we can't make a connection to AUS from the GMP/SystemAddon code, but we can from app update but now I am realizing I can just make a connection from the system add-on so this wasn't well thought-out enough.

> You should check with telemetry folk to see what systems can send this data
> in case they already do or if there is a way to have all systems send it.

 
I have talked to them enough to think this is going to be a privacy problem and also a user perception problem, since we want data from all users not just those who have telemetry enabled... if we start connecting to telemetry servers when it's not enabled then that will be a problem.

Thinking through this more, we can do this by hitting AUS from the system add-on and avoiding pinning... I think we can side-step a lot of privacy concerns that way, and probably get more data too.

I'll start working on a system add-on.
Status: NEW → ASSIGNED
I'm kind of surprised regarding the privacy concern since this info should not be so detailed that it would make it possible to fingerprint them.

Also, the numbers I was able to get from Firefox 44 were in the millions which should be more than adequate for what you are trying to accomplish. See comment #7.

I truly can't imagine that what you will need to collect from these clients are anywhere close to being an issue compared to what we already collect from them.

Also, in comment #7 you can see that a significantly larger number of Firefox 44 clients were reporting standard telemetry than were reporting the out of date add-on telemetry.

Does that assuage your concerns and if not what specifically are your concerns?
Felipe, could you also give your opinion of the approach that should be taken especially as it relates to comment #11 through 14
Flags: needinfo?(felipc)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #14)
> I'm kind of surprised regarding the privacy concern since this info should
> not be so detailed that it would make it possible to fingerprint them.
> 
> Also, the numbers I was able to get from Firefox 44 were in the millions
> which should be more than adequate for what you are trying to accomplish.
> See comment #7.
> 
> I truly can't imagine that what you will need to collect from these clients
> are anywhere close to being an issue compared to what we already collect
> from them.
> 
> Also, in comment #7 you can see that a significantly larger number of
> Firefox 44 clients were reporting standard telemetry than were reporting the
> out of date add-on telemetry.
> 
> Does that assuage your concerns and if not what specifically are your
> concerns?

My primary concern is that people that have not enabled telemetry will see connections going to our telemetry servers. 

That said, it doesn't really matter to me where we collect the data.
(In reply to Robert Helmer [:rhelmer] from comment #16)
<snip>
> My primary concern is that people that have not enabled telemetry will see
> connections going to our telemetry servers. 
> 
> That said, it doesn't really matter to me where we collect the data.
The numbers in comment #7 are for people that are already submitting telemetry (I did nothing to make those clients submit the main telemetry ping and the clients that submitted the out of date add-on telemetry data were for all practical purpose included in those clients) and should more than suffice for what you are trying to accomplish.
It breaks down as follows for Firefox 44.x at the time I created the report in comment #7:
Total submitting telemetry data           : 14028558
Add-on records with matches in total above:  4801478 (99.17 of ALL out of date add-on records)
Clients without out of date add-on records:  9227080

So, for Firefox 44.x there were over 14 million clients submitting telemetry which should be more than enough and when adding this to a newer version there will be a much larger number.

The main issue is that you won't be able to test this by pushing out a system add-on. Instead, it will need to be included in the release similar to what was done in Firefox 47.0.2 with the websense add-on since you will need to test why these clients aren't getting system add-ons that are pushed to them.
Comment on attachment 8817842 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

This introduces a diagnostic system add-on that will ship with Firefox, which uses TelemetryLog to record information about errors and successes in the GMP/System add-on download path.

Felipe, I do intend to fix the TODOs, but what do you think of this as an approach? Anything I am missing that we should be looking for?

Note that I copied the private download functions out of ProductAddonChecker.jsm with the expectation that we'll modify them in future versions of this addon, so we can have the ability to verify changes we'd like to make before changing the code in Firefox.

This add-on sets a pref on first run, so it should only execute a single time. It cleans up after itself when uninstalled, so we should be able to easily add more diagnostics to this and upgrade it for future releases.

I have a few questions:

* Rob Strong suggested we might want to use keyed histograms - what do you think, and can we do this from an extension?

* it would be useful to check the profile to see if any system add-on updates have been installed, but it's hard to do this in a deterministic way and it relies on coordination with update server settings... while desirable I think we should do this later unless you see a good way to do so.
Flags: needinfo?(felipc)
Attachment #8817842 - Flags: feedback?(felipc)
Comment on attachment 8817842 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

https://reviewboard.mozilla.org/r/98060/#review98960

Sorry, lots of nits. I know that's probably not what you're looking for at this point, but...

Anyway, the only major changes I think we should make are:

1) Use a custom redirect handler, and record whether a non-built-in cert is being used, rather than aborting the request, and trying twice.
2) Add checks for the hotfix add-on, and possibly some arbitrary non-hotfix add-on.

Also, it might make sense to change the diagnosticPref to an int, rather than a bool, in case we need to deploy a second iteration of this to gather additional data, without making too much mess.

::: browser/extensions/diagnostics/bootstrap.js:29
(Diff revision 1)
> +  try {
> +    hasRun = Services.prefs.getBoolPref(diagnosticsPref);
> +  } catch (err) {
> +    Services.prefs.setBoolPref(diagnosticsPref, false);
> +  }

Please use Preferences.jsm rather than try-catch.

::: browser/extensions/diagnostics/bootstrap.js:55
(Diff revision 1)
> +  } else {
> +    TelemetryLog.log("DIAGNOSTICS_ERROR_UNEXPECTED_UPDATE_URL", [updateUrl]);
> +  }
> +
> +  // 2: try to connect to AUS.
> +  const url = UpdateUtils.formatUpdateURL(expectedUpdateURL);

Should we maybe try not to do these in parallel?

::: browser/extensions/diagnostics/bootstrap.js:78
(Diff revision 1)
> +
> +  // 3: try to connect to artifact server.
> +  const systemAddonTmpPath = downloadFile(systemAddonURL);
> +  systemAddonTmpPath
> +    .then(() =>
> +      TelemetryLog.log("DIAGNOSTICS_SUCCESS_DOWNLOADED_SYSTEM_ADDON")

We should either not actually save this file, or make sure we delete it on success.

::: browser/extensions/diagnostics/bootstrap.js:98
(Diff revision 1)
> +        TelemetryLog.log("DIAGNOSTICS_ERROR_SYSTEM_ADDON_UPDATES_NOT_RW")
> +      }
> +    } else {
> +      TelemetryLog.log("DIAGNOSTICS_ERROR_SYSTEM_ADDON_UPDATES_DIR_NOT_RW", updatesDir.permissions)
> +    }
> +  }

We should probably also test the hotfix update check and hotfix download, and probably update the update check URL for an arbitrary, non-hotfix add-on too.

::: browser/extensions/diagnostics/bootstrap.js:117
(Diff revision 1)
> +    let xhr = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
> +                  createInstance(Ci.nsISupports);
> +    xhr.onload = function(response) {
> +      console.info("downloadXHR File download. status=" + xhr.status);
> +      if (xhr.status != 200 && xhr.status != 206) {
> +        reject(Components.Exception("File download failed", xhr.status));

Nit: Template string. And please concistently use either Error or Exception.

::: browser/extensions/diagnostics/bootstrap.js:124
(Diff revision 1)
> +        yield f.file.close();
> +        yield OS.File.writeAtomic(path, new Uint8Array(xhr.response));

yield f.file.write(xhr.response);
    yield f.file.close();

::: browser/extensions/diagnostics/bootstrap.js:146
(Diff revision 1)
> +    } catch (ex) {
> +      reject(ex);
> +    }

No need for the try-catch. This happens automatically.

::: browser/extensions/diagnostics/bootstrap.js:169
(Diff revision 1)
> +    let request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
> +                  createInstance(Ci.nsISupports);

Please use `Cu.importGlobalProperties(["XMLHttpRequest"])`

::: browser/extensions/diagnostics/bootstrap.js:171
(Diff revision 1)
> +    // This is here to let unit test code override XHR
> +    if (request.wrappedJSObject) {
> +      request = request.wrappedJSObject;
> +    }

Yuck.

::: browser/extensions/diagnostics/bootstrap.js:176
(Diff revision 1)
> +    // This is here to let unit test code override XHR
> +    if (request.wrappedJSObject) {
> +      request = request.wrappedJSObject;
> +    }
> +    request.open("GET", url, true);
> +    request.channel.notificationCallbacks = new BadCertHandler(allowNonBuiltIn);

I think we'd be better off with a custom redirect handler here, and just recording whether the cert is non-built-in.

::: browser/extensions/diagnostics/bootstrap.js:193
(Diff revision 1)
> +    // HTTP/1.0 servers might not implement Cache-Control and
> +    // might only implement Pragma: no-cache
> +    request.setRequestHeader("Pragma", "no-cache");
> +
> +    let fail = (event) => {
> +      request = event.target;

Why?

::: browser/extensions/diagnostics/bootstrap.js:195
(Diff revision 1)
> +    request.setRequestHeader("Pragma", "no-cache");
> +
> +    let fail = (event) => {
> +      request = event.target;
> +      let status = getRequestStatus(request);
> +      let message = "Failed downloading XML, status: " + status +  ", reason: " + event.type;

Nit: Please use template strings.

::: browser/extensions/diagnostics/bootstrap.js:218
(Diff revision 1)
> +    request.addEventListener("error", fail, false);
> +    request.addEventListener("abort", fail, false);
> +    request.addEventListener("timeout", fail, false);
> +    request.addEventListener("load", success, false);

Nit: No need for third argument.
Comment on attachment 8817842 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

Looks nice! A couple more things that I'd test are:

- expected values of some more prefs, like app.update.enabled, app.update.auto, some of the timer prefs, and extensions.update.background.url (this one is not direct related to system add-ons but would be a nice side effect to also check if this pref is valid in the wild)

- I don't know if this is feasible or not, but it would also be nice to test you leaving the xpi file downloaded, wait a firefox restart, and check if no antivirus has removed the xpi.

Since you're using TelemetryLog, I think that comes as part of the main ping and the data will come too slow.. I suggest using a custom telemetry ping (maybe only in the failures cases?) to try to get a better sense faster of the results.

Also, I suggest including something in the report about the current version of the system add-on. That way, we can release an actual update to it via GoFaster and then see the split of installs reporting the 1.0 shipped and the updated version.
Attachment #8817842 - Flags: feedback?(felipc) → feedback+
Comment on attachment 8817842 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

https://reviewboard.mozilla.org/r/98060/#review98960

> We should probably also test the hotfix update check and hotfix download, and probably update the update check URL for an arbitrary, non-hotfix add-on too.

Agree but let's do this in a follow-up. I'd like to focus on system add-ons first, then start tackling the others.

> Nit: Template string. And please concistently use either Error or Exception.

I want to leave these alone for the first iteration, since it's exactly the same way we do it in Firefox right now. Let's measure before we start changing things.
Great feedback, thanks! 

(In reply to :Felipe Gomes (needinfo me!) from comment #22)
> Comment on attachment 8817842 [details]
> Bug 1307568 - add a diagnostic system add-on to measure uptake of various
> updates
> 
> Looks nice! A couple more things that I'd test are:
> 
> - expected values of some more prefs, like app.update.enabled,
> app.update.auto, some of the timer prefs, and
> extensions.update.background.url (this one is not direct related to system
> add-ons but would be a nice side effect to also check if this pref is valid
> in the wild)
> 
> - I don't know if this is feasible or not, but it would also be nice to test
> you leaving the xpi file downloaded, wait a firefox restart, and check if no
> antivirus has removed the xpi.

I've been thinking about this, it's a good idea... I think that we need to redesign this so it isn't a one-shot diagnostics so I'd like to do this in a followup.


> Since you're using TelemetryLog, I think that comes as part of the main ping
> and the data will come too slow.. I suggest using a custom telemetry ping
> (maybe only in the failures cases?) to try to get a better sense faster of
> the results.


It'll take some time to design this and get sign-off from privacy etc, and I want to push this as a test update now, so let's also do this later (but soon :) )


> Also, I suggest including something in the report about the current version
> of the system add-on. That way, we can release an actual update to it via
> GoFaster and then see the split of installs reporting the 1.0 shipped and
> the updated version.

We already report this in the telemetry environment data, I'd like to standardize on using that for measurement unless there's a reason not to. It makes tracking over time simpler.
Comment on attachment 8817842 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

https://reviewboard.mozilla.org/r/98060/#review107984

This is also going to need data review from a data peer.

::: browser/extensions/diagnostics/bootstrap.js:9
(Diff revision 2)
> +Cu.import("resource://gre/modules/CertUtils.jsm");
> +Cu.import("resource://gre/modules/FileUtils.jsm");
> +Cu.import("resource://gre/modules/osfile.jsm");
> +Cu.import("resource://gre/modules/Preferences.jsm");
> +Cu.import("resource://gre/modules/Task.jsm");
> +Cu.import("resource://gre/modules/TelemetryLog.jsm");
> +Cu.import("resource://gre/modules/UpdateUtils.jsm");
> +Cu.import("resource://gre/modules/Log.jsm");

Can we use lazy imports for most of these?

::: browser/extensions/diagnostics/bootstrap.js:24
(Diff revision 2)
> +const systemAddonURL = "https://ftp.mozilla.org/pub/system-addons/diagnostics/diagnostics@mozilla.org-ff51-v1.0.xpi";
> +const diagnosticsPref = "extensions.diagnostics.v1.hasRun";
> +// TODO use real SHA.
> +const systemAddonSHA = "abcd...";
> +const hashFunction = "sha256";
> +const updatesDir = FileUtils.getDir("ProfD", ["features"], false);

This causes sync IO. Please don't do it until necessary.

::: browser/extensions/diagnostics/bootstrap.js:31
(Diff revision 2)
> +
> +function startup() {
> +  let hasRun = Preferences.get(diagnosticsPref, false);
> +  if (!hasRun) {
> +    testSystemAddons();
> +    Preferences.set(diagnosticsPref, true);

We probably shouldn't set this until the task completes, so we don't wind up missing data from users who quit the browser before it finishes.

::: browser/extensions/diagnostics/bootstrap.js:52
(Diff revision 2)
> +  let serializer = Cc["@mozilla.org/xmlextras/xmlserializer;1"].
> +                   createInstance(Ci.nsIDOMSerializer);

Nit: I know this may be controversial, but please move the dot to the start of the second line. That's what most of our internal code uses these days.

::: browser/extensions/diagnostics/bootstrap.js:54
(Diff revision 2)
> +  // 2: try to connect to AUS.
> +  const url = UpdateUtils.formatUpdateURL(expectedUpdateURL);
> +  // try with default settings.
> +  let serializer = Cc["@mozilla.org/xmlextras/xmlserializer;1"].
> +                   createInstance(Ci.nsIDOMSerializer);
> +  await downloadXML(url)

These are going to happen very early during startup. I know that gives us the best chance of getting good data, but are we OK with the performance hit?

::: browser/extensions/diagnostics/bootstrap.js:55
(Diff revision 2)
> +    .then((xml) =>
> +      TelemetryLog.log("DIAGNOSTICS_SUCCESS_AUS_BUILT_IN_CERT", [serializer.serializeToString(xml)])
> +    )
> +    .catch(err =>
> +      TelemetryLog.log("DIAGNOSTICS_ERROR_AUS_BUILT_IN_CERT", [err])
> +    );

Please use one `.then` handler with success and failure callbacks. Same for the ones below.

::: browser/extensions/diagnostics/bootstrap.js:59
(Diff revision 2)
> +  await downloadXML(url)
> +    .then((xml) =>
> +      TelemetryLog.log("DIAGNOSTICS_SUCCESS_AUS_BUILT_IN_CERT", [serializer.serializeToString(xml)])
> +    )
> +    .catch(err =>
> +      TelemetryLog.log("DIAGNOSTICS_ERROR_AUS_BUILT_IN_CERT", [err])

The error might not be serializable. We should stringify it in some predictable way.

::: browser/extensions/diagnostics/bootstrap.js:62
(Diff revision 2)
> +  // try allowing non-built-in certs.
> +  await downloadXML(url, true)

Do we really need to do this if the built-in-only one passes?

::: browser/extensions/diagnostics/bootstrap.js:75
(Diff revision 2)
> +  // 3: try to connect to artifact server.
> +  await downloadFile(systemAddonURL)
> +    .then(function(file) {
> +      TelemetryLog.log("DIAGNOSTICS_SUCCESS_DOWNLOADED_SYSTEM_ADDON");
> +      let hasher = Cc["@mozilla.org/security/hash;1"].
> +      createInstance(Ci.nsICryptoHash);

Nit: Please move dot to this line and fix indentation.

::: browser/extensions/diagnostics/bootstrap.js:79
(Diff revision 2)
> +      let hasher = Cc["@mozilla.org/security/hash;1"].
> +      createInstance(Ci.nsICryptoHash);
> +      hasher.initWithString(hashFunction);
> +      let bytes;
> +      do {
> +        bytes = yield file.read(HASH_CHUNK_SIZE);

I think you need to make this an async function and use `await` here.

::: browser/extensions/diagnostics/bootstrap.js:88
(Diff revision 2)
> +      if (result) {
> +        TelemetryLog.log("DIAGNOSTICS_SUCCESS_SYSTEM_ADDON_SHA_MATCHES")
> +      } else {
> +        TelemetryLog.log("DIAGNOSTICS_ERROR_SYSTEM_ADDON_BAD_SHA")
> +      }
> +      file.remove(false);

Should probably `await` this just to be safe.

::: browser/extensions/diagnostics/bootstrap.js:97
(Diff revision 2)
> +    );
> +
> +  // 4: check if system add-on update files are present.
> +  // if system addon update dir exists, ensure it is r/w.
> +  if (updatesDir.exists()) {
> +    if (updatesDir.isReadable && updatesDir.isWritable) {

These are functions. Need `isReadable() && isWritable()` or this will always be true.

::: browser/extensions/diagnostics/bootstrap.js:123
(Diff revision 2)
> + *         with a JS exception in case of error.
> + */
> +function downloadFile(url) {
> +  return new Promise((resolve, reject) => {
> +    let xhr = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
> +                  createInstance(Ci.nsISupports);

Nit: No need for the `nsISupports` argument, and same formatting issues as above.

::: browser/extensions/diagnostics/bootstrap.js:130
(Diff revision 2)
> +      Task.spawn(function* () {
> +        let f = yield OS.File.openUnique(OS.Path.join(OS.Constants.Path.tmpDir, "tmpaddon"));
> +        let path = f.path;
> +        console.info(`Downloaded file will be saved to ${path}`);
> +        yield f.file.close();
> +        yield OS.File.writeAtomic(path, new Uint8Array(xhr.response));
> +        return path;
> +      }).then(resolve, reject);

Nit: `resolve(Task.spawn(...))`

::: browser/extensions/diagnostics/bootstrap.js:131
(Diff revision 2)
> +      if (xhr.status != 200 && xhr.status != 206) {
> +        reject(Components.Exception("File download failed", xhr.status));
> +        return;
> +      }
> +      Task.spawn(function* () {
> +        let f = yield OS.File.openUnique(OS.Path.join(OS.Constants.Path.tmpDir, "tmpaddon"));

Can this be `tmpaddon.xpi` just in case that causes interactions with things like security software?

::: browser/extensions/diagnostics/bootstrap.js:134
(Diff revision 2)
> +        yield f.file.close();
> +        yield OS.File.writeAtomic(path, new Uint8Array(xhr.response));

Is there a particular reason you do this rather than just `f.file.write(response)` before you close it?

::: browser/extensions/diagnostics/bootstrap.js:156
(Diff revision 2)
> +    } catch (ex) {
> +      reject(ex);
> +    }

No need for this. It will happen automatically.
Attachment #8817842 - Flags: review?(kmaglione+bmo) → review+
I'll finish up the review so we can land this in the tree, but it occurs to me that we could make a much simpler update-only add-on to ship ASAP that just does a single ping.

We have seen reports of some anti-virus software coming along and removing XPI files from the profile they don't recognize, so seeing telemetry pings that the add-on was installed that don't match the telemetry environment ping would be useful to have right now.
Attached file bootstrap.js (obsolete) —
This is a simple update-only system add-on to diagnose why the uptake rate is lower than expected.

:kmag - does this look reasonable?

rweiss/bsmedberg - do you know who could do a data review on this? I'd like to send this system add-on update to all Firefox users, and get a single custom ping on install and a single custom ping on startup. I don't need the client ID, and the add-on is removed on the next Firefox application update.

I'd like to make this available to both Release and Beta ASAP.
Attachment #8830071 - Flags: review?(kmaglione+bmo)
Attachment #8830071 - Flags: feedback?
(In reply to Robert Helmer [:rhelmer] from comment #28)
> Created attachment 8830071 [details]
> bootstrap.js
> 
> This is a simple update-only system add-on to diagnose why the uptake rate
> is lower than expected.
> 
> kmag - does this look reasonable?
> 
> rweiss/bsmedberg - do you know who could do a data review on this? I'd like
> to send this system add-on update to all Firefox users, and get a single
> custom ping on install and a single custom ping on startup. I don't need the
> client ID, and the add-on is removed on the next Firefox application update.
> 
> I'd like to make this available to both Release and Beta ASAP.
Flags: needinfo?(rweiss)
Flags: needinfo?(benjamin)
Attached file bootstrap.js (obsolete) —
silly typo
Attachment #8830071 - Attachment is obsolete: true
Attachment #8830071 - Flags: review?(kmaglione+bmo)
Attachment #8830071 - Flags: feedback?
Attachment #8830072 - Flags: review?(kmaglione+bmo)
Attached file bootstrap.js (obsolete) —
s/const/let/
Attachment #8830072 - Attachment is obsolete: true
Attachment #8830072 - Flags: review?(kmaglione+bmo)
Attachment #8830075 - Flags: review?(kmaglione+bmo)
Comment on attachment 8830075 [details]
bootstrap.js

I think we should probably use a more detailed name for the ping. Something like "system-addon-deployment-diagnostics", I guess.
Attachment #8830075 - Flags: review?(kmaglione+bmo) → review+
(In reply to Kris Maglione [:kmag] from comment #32)
> Comment on attachment 8830075 [details]
> bootstrap.js
> 
> I think we should probably use a more detailed name for the ping. Something
> like "system-addon-deployment-diagnostics", I guess.

Done. I also removed the "name" field from the payload since it seems redundant, now it just contains "version" and "phase".
Attached file diagnostics@mozilla.org-ff51-v1.0.xpi (obsolete) —
Here's the unsigned extension. I'll get it signed and hosted once we get sign-off from data folks.
Attachment #8830075 - Attachment is obsolete: true
> > rweiss/bsmedberg - do you know who could do a data review on this? I'd like
> > to send this system add-on update to all Firefox users, and get a single
> > custom ping on install and a single custom ping on startup. I don't need the
> > client ID, and the add-on is removed on the next Firefox application update.

I can do the data review. The data review happens on the documentation for the pings you're sending, for examples see:

* https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/addons-malware-ping.html / http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/addons-malware-ping.rst
* https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html / http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/crash-ping.rst

Why don't you want the clientID? It doesn't seem like it would hurt, and it could definitely help if we need to correlate this with usage patterns or other main-ping data.
Flags: needinfo?(rweiss)
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #35)
> > > rweiss/bsmedberg - do you know who could do a data review on this? I'd like
> > > to send this system add-on update to all Firefox users, and get a single
> > > custom ping on install and a single custom ping on startup. I don't need the
> > > client ID, and the add-on is removed on the next Firefox application update.
> 
> I can do the data review. The data review happens on the documentation for
> the pings you're sending, for examples see:
>
> *
> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/
> telemetry/data/addons-malware-ping.html /
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> docs/data/addons-malware-ping.rst
> *
> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/
> telemetry/data/crash-ping.html /
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/
> docs/data/crash-ping.rst

Thanks! I will put together some docs today.

> Why don't you want the clientID? It doesn't seem like it would hurt, and it
> could definitely help if we need to correlate this with usage patterns or
> other main-ping data.

I'll go ahead and include clientID - I agree that it might come in handy and won't hurt in any case.
Attachment #8817842 - Attachment is obsolete: true
Attached file diagnostics@mozilla.org-ff51-v1.0.xpi (obsolete) —
unsigned XPI that matches current telemetry doc
Attachment #8830103 - Attachment is obsolete: true
Comment on attachment 8830466 [details]
Bug 1307568 - document the system add-on diagnostic telemetry ping

https://reviewboard.mozilla.org/r/107214/#review108946

data-r=me
Attachment #8830466 - Flags: review?(benjamin) → review+
Comment on attachment 8830843 [details]
diagnostics@mozilla.org-ff51-v1.0.xpi

Jason, can you please sign this system add-on update?
Attachment #8830843 - Flags: feedback?(jthomas)
Attachment #8830843 - Flags: feedback?(jthomas) → feedback+
Please see attached.
Attachment #8831205 - Attachment mime type: application/x-xpinstall → application/octet-stream
:ryanvm - could you help with verifying the XPI in comment 43?
Flags: needinfo?(ryanvm)
Whiteboard: triaged → [triaged][go-faster-system-addon]
Flags: needinfo?(ryanvm) → needinfo?(rares.bologa)
Comment on attachment 8830948 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

https://reviewboard.mozilla.org/r/107618/#review109038

This mostly looks good, but I'm worried about the sanity of the test.

The telemetry code will also need data review, and I believe also someone assigned to do the analysis.

::: browser/app/profile/firefox.js:65
(Diff revision 2)
>  
>  // Check AUS for system add-on updates.
>  pref("extensions.systemAddon.update.url", "https://aus5.mozilla.org/update/3/SystemAddons/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml");
>  
> +// Diagnostics for system add-on updates.
> +pref("extensions.systemAddon.diagnostics.url", "https://ftp.mozilla.org/pub/system-addons/diagnostics/");

Can we use a default value in the pref getter rather than storing a preference?

::: browser/extensions/diagnostics/bootstrap.js:14
(Diff revision 2)
> +XPCOMUtils.defineLazyModuleGetter(this, "Preferences",
> +  "resource://gre/modules/Preferences.jsm");
> +XPCOMUtils.defineLazyModuleGetter(this, "FileUtils",
> +  "resource://gre/modules/FileUtils.jsm");
> +XPCOMUtils.defineLazyModuleGetter(this, "Task",
> +  "resource://gre/modules/Task.jsm");
> +XPCOMUtils.defineLazyModuleGetter(this, "TelemetryLog",
> +  "resource://gre/modules/TelemetryLog.jsm");
> +XPCOMUtils.defineLazyModuleGetter(this, "UpdateUtils",
> +  "resource://gre/modules/UpdateUtils.jsm");
> +XPCOMUtils.defineLazyModuleGetter(this, "Log",
> +  "resource://gre/modules/Log.jsm");

Please keep imports sorted.

::: browser/extensions/diagnostics/bootstrap.js:54
(Diff revision 2)
> +  for (let i = 0; i < input.length; ++i) {
> +    let hex = input.charCodeAt(i).toString(16);
> +    if (hex.length == 1) {
> +      hex = "0" + hex;
> +    }
> +    result += hex;
> +  }

Not that it matters, but possible alternative:

    let result = "";
    for (let char of input) {
      let code = char.charCodeAt(0);

      result += `0${code.toString(16)}`.slice(-2);
    }

::: browser/extensions/diagnostics/bootstrap.js:77
(Diff revision 2)
> +
> +  // 2: try to connect to AUS.
> +  const url = UpdateUtils.formatUpdateURL(updateUrl);
> +  // try with default settings.
> +  let serializer = Cc["@mozilla.org/xmlextras/xmlserializer;1"]
> +                   .createInstance(Ci.nsIDOMSerializer);

Nit: nsIDOMSerializer is a shim interface now, so just `.createInstance()` should suffice.

::: browser/extensions/diagnostics/bootstrap.js:136
(Diff revision 2)
> +
> +  // 4: check if system add-on update directories are present.
> +  // if system addon update dir exists, ensure it is r/w.
> +  const updatesDir = FileUtils.getDir("ProfD", ["features"], false);
> +  if (updatesDir.exists()) {
> +    if (updatesDir.isReadable() && updatesDir.isWritable()) {

It might also be a good idea to check `isExecutable` on Unixy systems...

::: browser/extensions/diagnostics/bootstrap.js:176
(Diff revision 2)
> +        let path = f.path;
> +        //console.info(`Downloaded file will be saved to ${path}`);
> +        yield f.file.close();
> +        yield OS.File.writeAtomic(path, new Uint8Array(xhr.response));

Same comment as before. Is there a reason you're closing the file and then using `writeAtomic` rather than just `f.file.write(response)`?

::: browser/extensions/diagnostics/bootstrap.js:200
(Diff revision 2)
> +    } catch (ex) {
> +      reject(ex);
> +    }

Still no need for this. The exception will automatically be converted to a rejection.

::: browser/extensions/diagnostics/test/browser/browser_check_installed.js:22
(Diff revision 2)
> +  for (let entry of TelemetryLog.entries()) {
> +    if (entry[0] == "DIAGNOSTICS_ERROR_UNEXPECTED_UPDATE_URL") {
> +      ok(true, "Bad update URL is detected correctly");
> +    } else if (entry[0] == "DIAGNOSTICS_SUCCESS_AUS_BUILT_IN_CERT") {
> +      ok(true, "Cert looks OK in test harness");
> +    } else if (entry[0] == "DIAGNOSTICS_WARN_SYSTEM_ADDON_UPDATES_DIR_DOES_NOT_EXIST") {
> +      ok(true, "Updates dir does not exist in testing environment");
> +    } else if (entry[0] == "DIAGNOSTICS_SUCCESS_DOWNLOADED_SYSTEM_ADDON") {
> +      ok(true, "Diagnostics test file was downloaded");
> +    } else if (entry[0] == "DIAGNOSTICS_SUCCESS_SYSTEM_ADDON_CHECKSUM_MATCHES") {
> +      ok(true, "Checksum for test file matches");
> +    } else {
> +      ok(false, "Unexpected telemetry entry", entry);
> +    }
> +  }

Hm. So this relies on our add-on reporting telemetry by this point, but nothing else... I'm not sure that's a safe bet. I'm also not sure it's safe to rely on all of the extension's startup code having run by this point.
Attachment #8830948 - Flags: review?(kmaglione+bmo)
Flags: needinfo?(rares.bologa) → needinfo?(madalin.cotetiu)
I have installed (used builds: Nightly 54.0a1 (2017-01-29) , FF Beta 52.0b1 and FF Release 51.0.1 in Win7 64 bit) the xpi added in comment 43 with the following results: 

After installing it on a clean profile 2 pings are triggered but both have "phase": "startup"
After removing and re-installing the add-on 2 pings are triggered one with "phase": "install" and one with "phase": "startup"

Is this the expected behavior? On the first install should both pings have the  "phase": "startup"?
Flags: needinfo?(madalin.cotetiu)
Blocks: 1323547
OK figured out why we were sending the same Telemetry payload twice - this fixes that and has r+ from kmag.

Jason could you please sign this system add-on update?
Attachment #8830843 - Attachment is obsolete: true
Attachment #8831205 - Attachment is obsolete: true
Flags: needinfo?(jthomas)
Please see attached.
Flags: needinfo?(jthomas)
Thanks for catching this! Can you please check the new version in comment 49.
Flags: needinfo?(madalin.cotetiu)
Liz, could you please review this for RelMan approval? Here is rhelmer's Intent to Ship email https://mail.mozilla.org/pipermail/gofaster/2017-January/000542.html
Flags: needinfo?(lhenry)
Yes, please release this whenever you are ready.
Flags: needinfo?(lhenry)
I have tested the xpi file from comment 49 (used builds: Nightly 54.0a1 (2017-01-31) , FF Beta 52.0b2 and FF Release 51.0.1 in Win7 64 bit and FF Release 51.0.1 and FF Beta 52.0b2 in Ubuntu 16.04 LTS 32 bit) and now the payloads looks ok. The first one has the "phase": "install" and the second "phase": "startup".
Flags: needinfo?(madalin.cotetiu)
We tested this on the release-sysaddon update channel using various platforms, versions, build architectures and locales, detailed test report available here:

        https://public.etherpad-mozilla.org/p/1335224&1307568
This has been deployed to release and beta.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Sorry, there's a patch that needs to land still, the bit we are going to ship with Firefox.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 8830948 [details]
Bug 1307568 - add a diagnostic system add-on to measure uptake of various updates

https://reviewboard.mozilla.org/r/107618/#review110734

r=me with issues addressed, but this also needs data review from a data collection peer.

::: browser/extensions/diagnostics/test/browser/browser_check_installed.js:27
(Diff revision 3)
> +    "DIAGNOSTICS_SUCCESS_DOWNLOADED_SYSTEM_ADDON",
> +    "DIAGNOSTICS_SUCCESS_SYSTEM_ADDON_CHECKSUM_MATCHES",
> +    "DIAGNOSTICS_WARN_SYSTEM_ADDON_UPDATES_DIR_DOES_NOT_EXIST"
> +  ];
> +
> +  while(true) {

Nit: Missing space before opening paren.

::: browser/extensions/diagnostics/test/browser/browser_check_installed.js:33
(Diff revision 3)
> +    let reasons = TelemetryLog.entries().map(entry => {
> +      if (entry[0].startsWith("DIAGNOSTICS")) {
> +        return entry[0];
> +      }
> +    });
> +    if (expected_results.every(elem => reasons.includes(elem))) {

We should also make sure we don't get unexpected DIAGNOSTICS log entries.

::: browser/extensions/diagnostics/test/browser/browser_check_installed.js:35
(Diff revision 3)
> +        return entry[0];
> +      }
> +    });
> +    if (expected_results.every(elem => reasons.includes(elem))) {
> +      break;
> +    }

We need something like a `yield new Promise(resolve => setTimeout(resolve, 0))` here, or this will just spin forever when it runs too early.

::: browser/extensions/diagnostics/test/browser/browser_check_installed.js:38
(Diff revision 3)
> +    if (expected_results.every(elem => reasons.includes(elem))) {
> +      break;
> +    }
> +  }
> +
> +  delete window.TelemetryLog;

Nit: It would probably be better not to do a global import of this in the first place.
Attachment #8830948 - Flags: review?(kmaglione+bmo) → review+
Summary: Understand how fast a system add-on update gets delivered to users → ship diagnostics system add-on with Firefox to diagnose update problems
No longer blocks: 1323547
Depends on: 1342982
See Also: → CVE-2020-12421
From recent telemetry environment data, we see about a persistent ~6% of clients don't ever apply system add-on updates.

I think bug 1346017 covers enough of what I wanted to ship built-in - I'll take a look at the telemetry coming back from that and see if we need to take further action here.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
See Also: CVE-2020-12421
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: