Closed Bug 1079563 Opened 10 years ago Closed 10 years ago

give access to about:support data to specific whitelisted Mozilla websites

Categories

(Firefox :: General, defect)

defect
Not set
normal
Points:
3

Tracking

()

RESOLVED FIXED
Firefox 36
Iteration:
36.2
Tracking Status
firefox35 --- fixed
firefox36 --- fixed

People

(Reporter: Gavin, Assigned: markh)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [fxgrowth])

Attachments

(2 files, 5 obsolete files)

I think this is relatively straightforward, requirements as I see them:
- a whitelist of origins allowed to request the data (to start: input.mozilla.org and support.mozilla.org, SSL-only)
- a mechanism for the site to request the data and receive a JSON response (firing an specially-named event against the window/document, I imagine, and receiving an event in return with the JSON paylod from Troubleshoot.jsm)

To simplify things, I'm going to suggest that we leave the opt-in entirely within the web page's control. That is, if a whitelisted site requests the data, we'll assume that they've received consent from the user to do so (e.g. via the flow described in attachment 800977 [details], or via a checkbox, etc.), and have informed them sufficiently. Since we're restricting this capability to a very few specific sites, we should be able to verify that that assumption holds.
Blocks: 761781
Depends on: 554174
Flags: qe-verify-
Flags: firefox-backlog+
Whiteboard: [fxgrowth]
We could use WebChannel.jsm for this to get E10S support for free.
Adding Will and Ricky for the SUMO/Input side of things.
Can we add BMO to the initial list, or is there some reason that's a bad idea?
I don't think you need my approval, but everything in this bug sounds fine with me from the Input side of things.
(In reply to :Gijs Kruitbosch from comment #4)
> Can we add BMO to the initial list, or is there some reason that's a bad
> idea?

Let's not get held up expanding the initial list - we can always do that later. I think the cost/benefit evaluation differs significantly per-site, so we should consider those separately.
I'm not convinced we should do this, but that might depend on a discussion about the product features.

My intention, and the thing that we're building out in self-support, is to expose this kind of data in the existing support API which is exposed to about:healthreport. This will also allow the site to take remedial actions. However, I am concerned about the sensitivity of much of this information and so we are requiring that the code which accesses this information be code-signed (implemenation looks it will be signed JARs), not just pinned via HTTPS.

If possible, I'd like to focus on building this infrastructure into the signed-code platform and not into input/SUMO.

Perhaps there is a smaller set of data (buildIDs) which is less sensitive and could be exposed to whitelisted sites, but we should still be very explicit about the data exposure.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)

> My intention, and the thing that we're building out in self-support, is to
> expose this kind of data in the existing support API which is exposed to
> about:healthreport.

What API is this? Or do are you just talking about the existing FHR data reporting mechanism?

I don't think we should block this work on FHR. This is a lightweight "pave the cowpath" improvement of an existing workflow. I think the evolution of it has basically looked like:

  1) Getting info about a user's install results in a tedious discussion, so we created about:support.
  2) Asking a user to cut'n'paste from about:support can still be complicated and confusing for users.
     An in-product API to automate this was requested, but we never prioritized it, so SUMO cleverly
     routed around the problem by making a restartless addon to do so.
  3) Needing to install an addon is still an undesirable multi-step UX, so an in-product API is still best.

IMO, this work will actually end up complementing FHR. I think there's always going to be a need for a human-in-the-loop to diagnose unusual/rare problems, making educated guesses, and more importantly for _finding_ new problems that can be automated by FHR. The last being particularly important -- an automated "$ADDON is causing your $PROBLEM" system is great, but needs someone to discover that $ADDON is the cause of $PROBLEM in the first place. SUMO is on the front-lines for that, and so we should help streamline that process.

Now, at some point we'll want tighter integration of the two -- some way of first checking to see if FHR can solve a user's problem, and if not collect more info for a human to help with. But my impression is that FHR is not currently useful to SUMO, and that making it useful will be a _lot_ more work than what this bug about. (FHR can certainly improve in parallel, but it just shouldn't block easy incremental improvements.)
Points: --- → 3
Assignee: nobody → mhammond
Status: NEW → ASSIGNED
Iteration: --- → 36.1
Matt, I have a couple of questions based on a conversation with bsmedberg about his concerns:

- can you describe some example use cases for how you would use this data? For example, one I know we talked about was "determine which addons lead to the most support requests", or "find out which addons are most highly correlated with specific support requests/feedback". Do you have a list of other examples?

- can you elaborate on which parts of the about:support data you consider highest-value? From our discussion, and based on the use cases above, I imagine modified prefs and add-ons are up there. I see less potential value in e.g. crash report IDs, detailed graphics driver data, and maybe others, but I wonder if you see things differently.
Flags: needinfo?(mgrimes)
Matt's at tribe so I'll fill in. In the future you can ping me if needed

The main example use cases will be:
* User comes to SUMO. As part of the Ask a Question flow, they provide their about:support date. We can use this to properly categorize and tag their question, analyze the data to detect known problems (like malicious extensions), and when used in aggregate this will provide a large field of data that we can analyze (see trends such as "70% of users complaining about slowness have addon X"). This will also help SUMO contributors give a higher level of support with  the first post instead of having to make guesses or request the user provide this data.

* On Input, we will be able to use this data to draw similar correlations but with a much larger data set.

As for what is high value, obviously, if we can have all of it and parse it on server side as needed that would be best. Lowest value to us will be Accessibility and Library Versions, but like I said, we can parse this on a case by case basis server side as well.
Flags: needinfo?(mgrimes)
Crash ID's and graphics information is actually extremely valuable for troubleshooting and data aggregation.
Graphics card info is really important for example right now where we have OMTC-related black screens on certain graphics cards but users are notoriously poor reporters of what kind of graphics card and driver they have. We basically need enough info to pass along potential blocklist candidates. I don't know if the more detailed graphics info is useful (like AzureFallbackCanvasBackend)

We don't need all of the crash reports total but reports in the last 3 days (which is what is currently in the visible about:support) would be nice. If someone comes in complaining that "Firefox crashes all the time", it'd be nice to know what they mean by "all the time" and what could be causing that crash.

Accessibility isn't terribly useful because most people know if they have it turned on and will tell us, but back when F7 was caret browsing, that was a big problem so there is some historical precedent.

I actually think most modified prefs aren't that useful. Especially ones that everyone has changed (like print. settings). There are very rare cases where someone has print problems that can be fixed by looking at the print prefs but there are usually dozens of lines of print prefs that mean little.

I'm fairly certain that library versions tie so strongly with browser version that we probably can't get useful info there.
I started looking at the breakdown for this.  ISTM that to be consistent we'd use the permission manager as the "whitelist", in the same way UITour does.

However, this doesn't seem like it will play well with WebChannel - which I agree would make sense here - but it wants a single origin.  A solution to that might be to allow the WebChannel "origin" to be either a URI or a callback - we'd use a callback.  This sounds better than enumerating permissions and adding one channel per found "allow" entry.

Matt, any thoughts on that?
Flags: needinfo?(MattN+bmo)
Also, can we make sure that the JSON returned has non-localized fields? It would be very very annoying to parse on the web side if we have to deal with all possible localizations.
(In reply to [:Cww] from comment #15)
> Also, can we make sure that the JSON returned has non-localized fields? It
> would be very very annoying to parse on the web side if we have to deal with
> all possible localizations.

It looks to me like Troubleshooting.jsm always returns the same data structure without consideration of locales, while aboutSupport itself is what applies localization to this raw data.  So this is fine (and indeed would make it more complicated if we *did* want localized values)
This is a rough sketch for what I had in mind in comment 14.  I think it implements everything mentioned in this bug - but it returns all Troubleshooting.jsm data rather than a subset pending a decision on that.  It's fairly light-weight:

 browser/app/default_permissions |  4 ++++
 browser/base/content/browser.js | 27 +++++++++++++++++++++++++++
 toolkit/modules/WebChannel.jsm  | 16 ++++++++++++----
 3 files changed, 43 insertions(+), 4 deletions(-)

It changes WebChannel.jsm to allow a callback instead of only allowing a single origin.  The callback used here explicitly checks for https and a default port, then checks the permission manager to see if there is an explicit "allow" for the host.

The HTML code to consume this would look something like:

// Code to get the JSON response - here we just stick it directly in the DOM.
window.addEventListener("WebChannelMessageToContent", function (e) {
  if (e.detail.id == "remote-troubleshooting") {
    document.getElementById("troubleshooting").textContent = JSON.stringify(e.detail.message, null, 2);
  }
});

// Code to initiate the request for the data:
var event = new window.CustomEvent("WebChannelMessageToChrome", {
  detail: {
    id: "remote-troubleshooting",
    message: {
      command: "request",
    },
  },
});
window.dispatchEvent(event);

Note that there are no tests (they'll be needed for both the new command and the changes to WebChannel) but thought I'd put it up here for feedback now.
Flags: needinfo?(MattN+bmo)
Attachment #8507585 - Flags: feedback?(MattN+bmo)
Comment on attachment 8507585 [details] [diff] [review]
0001-Bug-1079563-WIP-experiment-allow-sites-to-get-troubl.patch

Review of attachment 8507585 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/browser.js
@@ +1259,5 @@
>      gNavToolbox.addEventListener("customizationending", CustomizationHandler);
>  
> +    // Initialize "remote troubleshooting" code...
> +    let channel = new WebChannel("remote-troubleshooting", requestPrincipal => {
> +      // The permission manager operates on domain names rather than true

What the reason for not just registering 2 WebChannels? WebChannel doing the origin checking for you is one of the advantages and it seems like the overhead of a second WebChannel is minimal since it uses one global message manager listener at the broker. Am I missing a downside?

Untested example:

function remoteTroubleshootingHandler(id, data, target) {
 …
}

let remoteTroubleshootingChannels = new Map();
let remoteTroubleshootingOrigins = ["input.mozilla.org", "support.mozilla.org"];
for (let origin of remoteTroubleshootingOrigins) {
  let channel = new WebChannel("remote-troubleshooting", Service.io.newURI(origin, null, null));
  channel.listen(remoteTroubleshootingHandler);
  remoteTroubleshootingChannels.set(origin, channel);
}

You could populate remoteTroubleshootingOrigins from the Permission Manager if you want to use it.
Attachment #8507585 - Flags: feedback?(MattN+bmo) → feedback+
(In reply to Matthew N. [:MattN] (behind on reviews - focused on Loop) from comment #18)
> What the reason for not just registering 2 WebChannels? WebChannel doing the
> origin checking for you is one of the advantages and it seems like the
> overhead of a second WebChannel is minimal since it uses one global message
> manager listener at the broker. Am I missing a downside?

Not really - I just thought it a worthwhile improvement and well suited to interfacing this with the permissions manager, and any other code with more strict requirements than a simple origin check - eg, you will note this patch explicitly checks for https.

> You could populate remoteTroubleshootingOrigins from the Permission Manager
> if you want to use it.

That seems a little like overkill - I'd need to enumerate all PM entries which doesn't sound great.  If you really don't like the channel changes, I think it would be better to have, eg, a single pref with space delimited URLs and use that, creating as many channels as we find.

Personally I still prefer the channel change, but if you continue to think it's not worthwhile, how do you feel about a single pref?
Flags: needinfo?(MattN+bmo)
(In reply to Mark Hammond [:markh] from comment #19)
> (In reply to Matthew N. [:MattN] (behind on reviews - focused on Loop) from
> comment #18)
> > What the reason for not just registering 2 WebChannels? WebChannel doing the
> > origin checking for you is one of the advantages and it seems like the
> > overhead of a second WebChannel is minimal since it uses one global message
> > manager listener at the broker. Am I missing a downside?
> 
> Not really - I just thought it a worthwhile improvement and well suited to
> interfacing this with the permissions manager, and any other code with more
> strict requirements than a simple origin check - eg, you will note this
> patch explicitly checks for https.

The scheme (https) is already included in origin checks though so you're just re-implementing this for the permission manager usage AFAICT.

> > You could populate remoteTroubleshootingOrigins from the Permission Manager
> > if you want to use it.
> 
> That seems a little like overkill - I'd need to enumerate all PM entries
> which doesn't sound great.

It doesn't sound great but I was thinking we could add an API to the permission manager to be able to get an enumerator for a single permission if we want to avoid enumerating all.

> If you really don't like the channel changes, I
> think it would be better to have, eg, a single pref with space delimited
> URLs and use that, creating as many channels as we find.

I think using the permission manager makes sense but I dislike the callback approach where the implementer is left doing the origin checking themselves which is a footgun that WebChannel is supposed to be taking away.

> Personally I still prefer the channel change, but if you continue to think
> it's not worthwhile, how do you feel about a single pref?

I think a pref is better than the origin checking callback but I think enumerating the specific permission entries is best IMO.
Flags: needinfo?(MattN+bmo)
As discussed with Matt on IRC, I'm reluctant to import the permission manager and enumerate entries as part of delayed-startup and would much rather do some of that once an incoming request is made.  Matt indicated he'd possibly be happy with allowing an nsIURI or a string permission name and having the logic contained in WebChannel, which removes the foot-gun of allowing the callback to be defined externally.

This patch also makes a change to Troubleshoot.jsm to replace all occurrences of the profile dir with %PROFILE_DIR%.  This change is unconditional, so the sanitized value will also appear in about:support or if an addon etc uses Troubleshoot.jsm directly.

Matt, let me know if you are happy with this approach and I'll work on tests.
Attachment #8507585 - Attachment is obsolete: true
Attachment #8509202 - Flags: feedback?(MattN+bmo)
Comment on attachment 8509202 [details] [diff] [review]
0001-Bug-1079563-WIP-experiment-allow-sites-to-get-troubl.patch

Review of attachment 8509202 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/modules/WebChannel.jsm
@@ +133,5 @@
>   *
>   * @param id {String}
>   *        WebChannel id
>   * @param origin {nsIURI}
>   *        Valid origin that should be part of requests for this channel

Don't forget to update the JSDoc

@@ +142,1 @@
>      throw new Error("WebChannel id and origin are required.");

This comment maybe should be updated too

@@ +142,5 @@
>      throw new Error("WebChannel id and origin are required.");
>    }
>  
>    this.id = id;
> +  // origin can be either an nsIURI or a string represending a permission name.

Update comment

@@ +147,5 @@
> +  if (typeof originOrPermission == "string") {
> +    this._originCheckCallback = requestPrincipal => {
> +      // The permission manager operates on domain names rather than true
> +      // origins (bug 1066517).  To mitigate that, we explicitly check that
> +      // the scheme is https://.  This can be possibly removed once 1066517.

Remove: "This can be possibly removed once 1066517."

@@ +150,5 @@
> +      // origins (bug 1066517).  To mitigate that, we explicitly check that
> +      // the scheme is https://.  This can be possibly removed once 1066517.
> +      let uri = Services.io.newURI(requestPrincipal.origin, null, null);
> +      let result = false;
> +      if (uri.scheme == "https") {

Negate this and do an early return if uri.scheme != "https". Then you don't need the temporary |result| variable.

@@ +158,5 @@
> +        result = perm == Ci.nsIPermissionManager.ALLOW_ACTION;
> +      }
> +      return result;
> +    }
> +  } else {

You don't need an |else| after a |return|

@@ -158,2 @@
>     */
> -  origin: null,

For debugging I can see it being useful to know what origin a webchannel is for. Some tests may have been accessing this.
Attachment #8509202 - Flags: feedback?(MattN+bmo) → feedback+
The WebChannel tests used services-common/async.js and event-loop spinning to make tests async.  This is a trivial test-only change that moves to using promises and is broken out so the "real" changes in the next patch are obvious.  This is trivial enough it's r=me, but feel free to look if you like ;)
Attachment #8509883 - Flags: review+
All feedback addressed, although:

> You don't need an |else| after a |return|

We do in that case - the return is part of the local function, not the control flow of the outer function.

This version also has lots of tests:

* troubleshooting tests check the ProfD is indeed replaced with %PROFILE_DIR%
* webchannel tests grow tests for the permission string.
* new remoteTroubleshooting tests to test this specific functionality.

(I could split this a little, particularly the ProfD change and tests, let me know if you'd prefer I did that).

I think this is ready for review, but not formally requesting it until we agree we actually want to land this (Matt - please feel free to ignore this feedback request until then if you are busy).  needinfo Gavin to push this forward and let us know when it's OK to land.

Note specifically:

* No prefs are removed from the set we return, but note that *all* prefs returned by Troubleshoot.jsm have ProfD replaced with %PROFILE_DIR%, which IIUC addresses the identified privacy risks with the prefs.  Note this means the substitutions are made for about:support and for any addons that use Troubleshoot.jsm.

* Crash reports are removed before being returned to the remote requestor (ie, they will remain in about:support and addons using the module directly)

* Pinning these sites remains unresolved.  Discussion is ongoing but the timing of that is largely out of our control.
Attachment #8509202 - Attachment is obsolete: true
Flags: needinfo?(gavin.sharp)
Attachment #8509892 - Flags: feedback?(MattN+bmo)
Attachment #8509883 - Flags: review+
Summary: breakdown: give access to about:support data to specific whitelisted Mozilla websites → give access to about:support data to specific whitelisted Mozilla websites
Iteration: 36.1 → 36.2
Attachment #8509892 - Flags: feedback?(MattN+bmo) → feedback+
Blocks: 1091944
Chatted with Mark and bsmedberg offline - we'll omit prefs for now (Mark filed bug 1091944 as a followup). Otherwise this is good to go.
Flags: needinfo?(gavin.sharp)
Based on discussion with Gavin and Benjamin, we are excluding prefs completely (see bug 1091944).  This version removes the ProfD-sanitization done by the previous version and removes prefs from the payload (with tests updated accordingly)

Try at https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=6011a4c00710
Attachment #8509892 - Attachment is obsolete: true
Attachment #8514701 - Flags: review?(MattN+bmo)
Attachment #8514701 - Flags: review?(MattN+bmo) → review+
(In reply to Mark Hammond [:markh] from comment #26)
> Try at
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=6011a4c00710

I wasn't expecting such a quick review (thanks!) so I killed that try and pushed it.  "fx-team" is just an alias for "try" anyway, right? ;)

https://hg.mozilla.org/integration/fx-team/rev/cb3179f8ea2b
https://hg.mozilla.org/integration/fx-team/rev/780d8f101fe0
sorry mark had to backout this out for test failures/ memory leak reports like https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1051353&repo=fx-team
The problem I've been having is that the leak detector code runs while the last window remains open, and this works as the detector looks for references to windows, not the windows themselves.

The tests work around problems doing this by explicitly calling some .uninit()/.unload() functions just before it does the leak detection.

This patch adds a new top-level object to browser.js - gBrowserWebChannelManager - which currently manages exactly 1 channel, but it seems likely this will grow over time.  The .init() method sets them up and .uninit() tears them down.  This .uninit() method is called both by Unload() and also by browser-test.js just before it does the leak checking.  This cleans everything up and seems to make everyone happy.

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e11f4292cdc7
Attachment #8514701 - Attachment is obsolete: true
Attachment #8517936 - Flags: review?(MattN+bmo)
Comment on attachment 8517936 [details] [diff] [review]
0002-Bug-1079563-part-2-allow-white-listed-sites-to-reque.patch

Review of attachment 8517936 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking the time to figure this out Mark.
Attachment #8517936 - Flags: review?(MattN+bmo) → review+
The problem with having the listener in browser.js is that each window registers a new channel with the same ID, so each window's listener gets called for a single content event - meaning the content gets 1 response from each opened window.  Moving the channel to nsBrowserGlue solves this problem and also bypasses all the leak problems we have (no window was harmed in the creation of this channel!)  This version never unregisters the channel, but that's fine.

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=469d7ae57de1
Attachment #8517936 - Attachment is obsolete: true
Attachment #8518537 - Flags: review?(MattN+bmo)
Comment on attachment 8518537 [details] [diff] [review]
0002-Bug-1079563-part-2-allow-white-listed-sites-to-reque.patch

Review of attachment 8518537 [details] [diff] [review]:
-----------------------------------------------------------------

A test for the multiple window case would have been nice but it's probably unlikely to regress.
Attachment #8518537 - Flags: review?(MattN+bmo) → review+
https://hg.mozilla.org/mozilla-central/rev/df923a9845b9
https://hg.mozilla.org/mozilla-central/rev/f70a6f39c278
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 36
Gavin, I seem to recall that we wanted this on Aurora - should I request uplift?
Flags: needinfo?(gavin.sharp)
Yes please.
Flags: needinfo?(gavin.sharp)
Comment on attachment 8518537 [details] [diff] [review]
0002-Bug-1079563-part-2-allow-white-listed-sites-to-reque.patch

NOTE: This request is for both patches in this bug (the first is necessary for the second to apply)

Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: Users will not be able to use "remote troubleshooting" facilities on sumo.
[Describe test coverage new/current, TBPL]: has test coverage.
[Risks and why]: Risk is limited to the "remote troubleshooting" feature and should have zero impact in normal operations.
[String/UUID change made/needed]: None
Attachment #8518537 - Flags: approval-mozilla-aurora?
Attachment #8518537 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: