Closed Bug 1211542 Opened 9 years ago Closed 8 years ago

Collect domain of shared URL in Hello through a whitelist

Categories

(Hello (Loop) :: Client, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RT, Assigned: Mardak)

References

Details

User Story

In order to understand what people share as URL context on Hello we should collect the domain of the context URLs which are part of Hello rooms although only if they are generic enough:

- Use a whitelist database of the top 2000 used websites from Alexa customized as follows:
---Substract the sites from the Adult Alexa category
--- Add all the subdomains of:
google.com (e.g., drive.google.com – I don’t believe there are nation-specific versions of these subproperties)
yahoo.com (e.g., games.yahoo.com find via pentest-tools.com)
developer.mozilla.org, bugzilla.mozilla.org, wiki.mozilla.org, out of curiosity, maybe other mozilla.org or firefox.com subdomains
hello.firefox.com (why would this even happen?  I don’t know but would be curious if it does)
itunes.apple.com
---Add the top 500 of these categories:
Home
Health
Recreation > Travel
Shopping
Business > Real Estate
Business > Consumer Goods and Services
Business > E-Commerce

- From the desktop client, when a room gets created or when a new tab is shared, get the domain name of the context URL and check if it belongs to the whitelist database
---If the context URL domain name does not belong to the whitelist, do nothing
---If the context URL domain name belongs to the whitelist, send that domain name to Google analytics through its API

This implementation should be documented on readthedocs.org for openness reasons.
See https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/ as a reference


As a secondary step we should look into using the RAPPOR algorithm - https://github.com/google/rappor
RAPPOR requires a large number of samples and our current traffic would not provide us anything usable with RAPPOR.

Attachments

(4 files)

      No description provided.
User Story: (updated)
Depends on: 1202338
User Story: (updated)
Ed, could the top 500/1000 websites come from https://hg.mozilla.org/mozilla-central/file/ad64b12d3b01/browser/modules/DirectoryLinksProvider.jsm#l68 or from the CS database keeping in mind this would be a clicker website implementation?
Flags: needinfo?(edilee)
User Story: (updated)
Rank: 25
Priority: -- → P2
User Story: (updated)
Rank: 25 → 22
Rank: 22 → 17
Priority: P2 → P1
Alexa top lists can be found here: https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-

I think we should use the Alexa list, and do a run through that to clean up the list slightly.  Also we should add some second-level domains to the list, such as drive.google.com, games.yahoo.com, etc.  (The Alexa list only has top-level domains.)
User Story: (updated)
Rank: 17 → 22
Priority: P1 → P2
Rank: 22 → 18
Priority: P2 → P1
I have a lot of context on this bug that isn't yet in the ticket; be sure to contact me before starting any work.
User Story: (updated)
Updated user story field with latest details from Ian's investigations.
User Story: (updated)
Assignee: nobody → edilee
Flags: needinfo?(edilee)
Flags: needinfo?(ianb)
A list of sites was used to power content services inadjacency matching for bug 1159884. Bug 1159884 comment 0 had some ideas of using existing lists while bug 1160596 created a list usable as a plain list of domains as well as md5 hashed version.
For "--- Add all the subdomains of:" does that mean we shouldn't look at subdomain matches of other sites, i.e., only match exact domains (and +www.)?
US updated with a requirement to document the implementation on readthedocs.org
User Story: (updated)
For the server: we'll implement a new endpoint on the loop server.  We should decide if we want to aggregate these pings or not; if we don't aggregate we could do something like GET https://loop-server/log-domains?domain=google.com – but if we want to reduce the number of pings (which I'm leaning towards) then maybe PUT {endpoint}, body: {"google.com": 3, "stackoverflow.com": 4, "other": 10}

I've described the server component in Bug 1246728
Flags: needinfo?(ianb)
See Also: → 1246728
There's 4275 .com/net/org/edu/gov sites and 830 other. Even if we were treating this as US-focused, it's not simple to select certain TLDs as there's some sites popular worldwide or sites using "odd" TLDs.
Attachment #8717933 - Flags: feedback?(rtestard)
Attachment #8717933 - Flags: feedback?(ianb)
When should the domains be sent? After a call has ended?

I'm assuming we would send to <loop.server pref>/log-domains so https://loop.services.mozilla.com/v0/log-domains

Bug 1246728 mentions authentication. What should Firefox be using?
Flags: needinfo?(ianb)
Comment on attachment 8717933 [details]
top2000 + top500 of 7 categories - duplicates - adult = 5105domains

That is considerably more domains than I had expected – I thought adding those extra categories would only add a handful of sites.  I also have realized that the Alexa categories are old and poorly maintained (based on old dmoz categories I think).  I think this means many modern sites are not categorized, so the top-500 of these categories are old or relatively obscure sites that we don't need to track.

As such I think it would be better to construct the list without any of the added categories (though if you could same some of the sites that wouldn't be on the list otherwise that would be interesting to review in detail).

I'd like to aim for no more than 2500 domains.

The HAWK authentication is something we send with other requests to the server.  I'm not sure how that is sent, but I'm assuming there's other examples of a request that we can copy for this case?  Mark should be able to point out an example.
Flags: needinfo?(ianb) → needinfo?(standard8)
Attachment #8717933 - Flags: feedback?(ianb) → feedback-
A quick glance shows adding the top 200 shopping sites includes sites like these that aren't in the top 2000:

petsmart.com
joann.com
carmax.com
saksfifthavenue.com
Attachment #8718163 - Flags: feedback?(ianb)
Attachment #8717933 - Flags: feedback?(rtestard)
Comment on attachment 8718163 [details]
top2000 + top200 of 7 categories - duplicates - adult = 2484 domains

Great, that's a good sized list, and those are good extra sites to include
Attachment #8718163 - Flags: feedback?(ianb) → feedback+
Blocks: 1248602
(In reply to Ian Bicking (:ianb) from comment #11)
> The HAWK authentication is something we send with other requests to the
> server.  I'm not sure how that is sent, but I'm assuming there's other
> examples of a request that we can copy for this case?  Mark should be able
> to point out an example.

Ah, if we're using the HAWK authentication, then we just need to make sure we go through hawkRequestInternal - that will use the correct creds for if we're signed in or not.
Flags: needinfo?(standard8)
Summary: Collect domain of shared URL in Hello through a whitelist → [Meta] Collect domain of shared URL in Hello through a whitelist
Depends on: 1246728
Rank: 18 → 10
Attachment #8731216 - Flags: review?(dmose)
Attachment #8731216 - Flags: review?(dmose) → review?(dcritchley)
Comment on attachment 8731216 [details] [review]
[loop] Mardak:bug-1211542-domains > mozilla:master

Looks good.
Attachment #8731216 - Flags: review?(dcritchley) → review+
Summary: [Meta] Collect domain of shared URL in Hello through a whitelist → Collect domain of shared URL in Hello through a whitelist
Ed: the try builds failed - they need the patch from bug 1256694.
https://github.com/mozilla/loop/commit/2f196ca4578b1686d93781708d47c8d2fce0efc5
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1257817
Blocks: 1260857
Blocks: 1261467
there is some feedback in the dev.platform thread that requests that this get backed out, at least until the privacy review has been completed.  is that being considered?
(In reply to chris hofmann from comment #20)
> there is some feedback in the dev.platform thread that requests that this
> get backed out, at least until the privacy review has been completed.  is
> that being considered?

I've just responded to that thread. Currently this logging is by default disabled (by preference). At this stage, we won't enable it until the review is completed.

Hence, I don't think we need to backout for the time being.
(In reply to Ian Bicking (:ianb) from comment #13)
> Comment on attachment 8718163 [details]
> top2000 + top200 of 7 categories - duplicates - adult = 2484 domains
> 
> Great, that's a good sized list, and those are good extra sites to include

does this list have some bias/inaccurate reflection against our pool of international users?

I haven't checked yet but if this gets turned on we should evaluate against a list like we have at
https://wiki.mozilla.org/L10n:TopSiteTesting to see if it accounts for top sites visited in our various supported languages and does so without introducing bias, or user data leaks.

Just checking the list against TLD's it would suggest that sites in Japan are over represented when compared to other places like Germany, France, Russia, Brazil, China and others where we have more users.

Here are counts of the various TLD's in the list.

1638 com 
 110 net  
  80 org
  78 jp  
  46 ru  40 cn  37 uk  30 de
  24 in  22 gov  22 br
  19 edu  18 fr  17 it  15 pl  14 tw  13 ir  12 tv  12 me  11 tr
   9 au   8 ua   8 to   8 id   8 es   7 vn   7 kr   7 co   6 se   6 cz   6 cc   6 ca
   5 nl   5 io   5 gr   
   4 nz   4 no   4 mx   4 info   4 eu   4 by   4 ar
   3 za   3 us   3 sk   3 pk   3 my   3 ma   3 il   3 hk   3 ag
   2 ve   2 st  2 ro  2 pt 2 ph  2 ng  2 ly  2 li   2 la   2 kz  2 is  2 fm  2 fi  2 cl  2 biz   

many TLD's with only 1 domain represented.
The list also implies that we will get bias toward this set of educational institutions.

academia.edu berkeley.edu cimic.rutgers.edu  cornell.edu ebusiness.mit.edu
euro.ecom.cmu.edu gia.edu goizueta.emory.edu harvard.edu
knowledge.wharton.upenn.edu mba.ncsu.edu
med.umich.edu mheducation.com mit.edu psu.edu purdue.edu
reduxmediia.com sdsu.edu stanford.edu umich.edu urmc.rochester.edu

and miss a potential interesting use case at some institutions that would be using hello to as a shared browsing experience to do homework with a partner or study group.  again, if this gets expanded it should be done so with international considerations in mind.

wildcarding for .edu might be a possibility but that puts us at risk for leaking user data for very small institutuions if that gets passed to our servers.
looks like we will only match on "facebook.com"

that would seem to miss a wide variety of possible social messaging and status reporting v.  game playing v. picture sharing, v. .... that happens on facebook.

If we do pass along the directory info along with this domain it may also be difficult to ascertain what task the user is trying to accomplish without some deep analysis of the full urls and some deconstruction of how particular apps work on facebook.

https://www.google.com/search?q=farmville+url tells of a variety of things that might be done related to just facebook farmville urls.  e.g. inviting others to a game, playing against others, playing with others to reach a common goal, sharing/showing off the results of your game playing with others.  how might we sort out all those possibilities from just this game playing analysis and how might they guide UI/UX withing hello.  Again, seems better/faster/and more respectful of privacy to just ask users what they were trying to accomplish.
The adult list used to filter out sites came from bug 1160596 of the adult category: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

The whitelist landed with the fix is slightly different from what was originally attached to the bug to work with the format expected by the whitelist checking code (eTLD+1): https://github.com/mozilla/loop/blob/2f196ca4578b1686d93781708d47c8d2fce0efc5/add-on/chrome/modules/DomainWhitelist.jsm
I still don't understand the rational for filtering adult sites.  Can someone explain?

Also here is an example of the facebook/game problem.
If we are somehow able to determine many users are sharing sites like

https://apps.facebook.com/farmville-two/ 
https://apps.facebook.com/inthemafia/
...

we still lack important context around why the link was shared.

Was it to invite/entice another to the game?
Was it to to link instruction/help on how to play the game?
Was it to play the game with/or against the another and use hello as an out of band communication channel?
Was it to boast of a high score to another or some achievement in a recent game playing session?
Was it to talk about a co-worker/family members addiction/obsession with the game?
Was it to rant about a bug in the game or browser instability recently experience while trying to play the game?

All these would be common themes across many sites and applications and potentially a set of common use cases that hello might be optimizing for, but we would have no way to breakdown the frequency of each of these types of interaction in total or by individual user.  If I understand the design right we won't even get this level of detail about game playing of facebook.  We will just get an indication of facebook.com.  Is that correct?

Same kind of problem would result for google, bing, yahoo searches.  Will know users might be sharing search results, but have little insight into the rational for sharing, or the details of what's being shared, and to what end.  e.g. evaluate a set of products, share some news stories, etc...

If hello's use is a reflection of general browsing use and facebook and google are involved in most hello sessions we will gain very little in this data collection or analysis.
I don't know if I should try and reopen this bug since the list has landed but the logging feature is turned off, but I found a few problems with at least the porn filtering intent of the current list, and maybe more categories that maybe should be considered as sensitive.

I put those in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467#c5

Should this bug be re-opened to fix these problems, or should I file one or more new bugs?
Flags: needinfo?(edilee)
We can do followups in separate bugs, e.g., bug 1260857 and 1263020.
Blocks: 1263020
Flags: needinfo?(edilee)
No longer blocks: 1260857
(In reply to Mark Banner (:standard8) from comment #21)
> (In reply to chris hofmann from comment #20)
> > there is some feedback in the dev.platform thread that requests that this
> > get backed out, at least until the privacy review has been completed.  is
> > that being considered?
> 
> I've just responded to that thread. Currently this logging is by default
> disabled (by preference). At this stage, we won't enable it until the review
> is completed.
> 
> Hence, I don't think we need to backout for the time being.

I'm concerned that this code landed without privacy review, about the optics of having it in the tree and including it in a release (even it if is preffed off). Others have shared further concerns in the dev-platform thread. The fact that it's behind a pref is good but what happens if the pref is flipped unintentionally (accidentally or a malicious actor)? Is this code isolated enough that it can be easily backed out? What is the value of keeping it in the tree given that we're not going to use it in Firefox 46 without shipping an update for the add-on?

https://groups.google.com/forum/#!topic/mozilla.dev.platform/nyVkCx-_sFw
Flags: needinfo?(edilee)
Seems like this should have privacy review.
Flags: needinfo?(benjamin)
here is the result of the quick and dirty spider that was intended to get site meta data to assist in the categorization of the sites and maybe help better understand what kind of sites that are in the current white list.  It did help turn up another porn site that needs to be removed, but not as many sites have "meta keyword=" and "meta description=" data as I would have hoped for.

The scan did turn up lots of cruft in the whitelist that we are unlikely to get hits on and might be worth pruning.   These are website development companies that inject info and links back to their firms as a way of spamming their way into alexa top site rankings, and the current whitelist has quite a few ad networks that users are unlikely to be sharing with hello.

It would be good to get some addtional eyes on this list espcially for foreign language sites to make sure we aren't doing anything unintended to the whitelist for these.
Attached file spider-summary.txt
rough spider summary with self reporting categorization info about sites in the current whitelist.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> Seems like this should have privacy review.

privacy review is in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467 and https://bugzilla.mozilla.org/show_bug.cgi?id=1262284
To answer comment 29, I have no problem shipping this code preffed off, and I'm not particularly worried about optics. If the Hello team is still working towards having this on, then removing the code seems like unnecessary churn.
Flags: needinfo?(benjamin)
Cancelling request per comment 34
Flags: needinfo?(edilee)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: