Closed Bug 1211542 Opened 9 years ago Closed 8 years ago

Collect domain of shared URL in Hello through a whitelist

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: RT, Assigned: Mardak)

References

Details

User Story

In order to understand what people share as URL context on Hello we should collect the domain of the context URLs which are part of Hello rooms although only if they are generic enough:

- Use a whitelist database of the top 2000 used websites from Alexa customized as follows:
---Substract the sites from the Adult Alexa category
--- Add all the subdomains of:
google.com (e.g., drive.google.com – I don’t believe there are nation-specific versions of these subproperties)
yahoo.com (e.g., games.yahoo.com find via pentest-tools.com)
developer.mozilla.org, bugzilla.mozilla.org, wiki.mozilla.org, out of curiosity, maybe other mozilla.org or firefox.com subdomains
hello.firefox.com (why would this even happen?  I don’t know but would be curious if it does)
itunes.apple.com
---Add the top 500 of these categories:
Home
Health
Recreation > Travel
Shopping
Business > Real Estate
Business > Consumer Goods and Services
Business > E-Commerce

- From the desktop client, when a room gets created or when a new tab is shared, get the domain name of the context URL and check if it belongs to the whitelist database
---If the context URL domain name does not belong to the whitelist, do nothing
---If the context URL domain name belongs to the whitelist, send that domain name to Google analytics through its API

This implementation should be documented on readthedocs.org for openness reasons.
See https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/ as a reference


As a secondary step we should look into using the RAPPOR algorithm - https://github.com/google/rappor
RAPPOR requires a large number of samples and our current traffic would not provide us anything usable with RAPPOR.

Attachments

(4 files)

top2000 + top500 of 7 categories - duplicates - adult = 5105domains 8 years ago Ed Lee :Mardak 70.82 KB, text/plain	ianbicking : feedback-	Details
top2000 + top200 of 7 categories - duplicates - adult = 2484 domains 8 years ago Ed Lee :Mardak 31.22 KB, text/plain	ianbicking : feedback+	Details
[loop] Mardak:bug-1211542-domains > mozilla:master 8 years ago GitHub Autolander Bot 40 bytes, text/x-github-pull-request	dcritchley : review+	Details \| Review
spider-summary.txt 8 years ago chris hofmann 1.91 MB, text/plain		Details

Romain Testard [:RT]

Reporter

Description

•

9 years ago

      No description provided.

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

User Story: (updated)

Depends on: 1202338

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

User Story: (updated)

Romain Testard [:RT]

Reporter

Comment 1

•

9 years ago

Ed, could the top 500/1000 websites come from https://hg.mozilla.org/mozilla-central/file/ad64b12d3b01/browser/modules/DirectoryLinksProvider.jsm#l68 or from the CS database keeping in mind this would be a clicker website implementation?

Flags: needinfo?(edilee)

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

User Story: (updated)

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

Rank: 25

Priority: -- → P2

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

User Story: (updated)

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

Rank: 25 → 22

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

Rank: 22 → 17

Priority: P2 → P1

Ian Bicking (:ianbicking)

Comment 2

•

9 years ago

Alexa top lists can be found here: https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-

I think we should use the Alexa list, and do a run through that to clean up the list slightly.  Also we should add some second-level domains to the list, such as drive.google.com, games.yahoo.com, etc.  (The Alexa list only has top-level domains.)

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

User Story: (updated)

Romain Testard [:RT]

Reporter

Updated

•

9 years ago

Rank: 17 → 22

Priority: P1 → P2

Ian Bicking (:ianbicking)

Updated

•

8 years ago

Rank: 22 → 18

Priority: P2 → P1

Ian Bicking (:ianbicking)

Comment 3

•

8 years ago

I have a lot of context on this bug that isn't yet in the ticket; be sure to contact me before starting any work.

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

User Story: (updated)

Romain Testard [:RT]

Reporter

Comment 4

•

8 years ago

Updated user story field with latest details from Ian's investigations.

User Story: (updated)

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Assignee: nobody → edilee

Flags: needinfo?(edilee)

Ian Bicking (:ianbicking)

Updated

•

8 years ago

Flags: needinfo?(ianb)

Ed Lee :Mardak

Assignee

Comment 5

•

8 years ago

A list of sites was used to power content services inadjacency matching for bug 1159884. Bug 1159884 comment 0 had some ideas of using existing lists while bug 1160596 created a list usable as a plain list of domains as well as md5 hashed version.

Ed Lee :Mardak

Assignee

Comment 6

•

8 years ago

For "--- Add all the subdomains of:" does that mean we shouldn't look at subdomain matches of other sites, i.e., only match exact domains (and +www.)?

Romain Testard [:RT]

Reporter

Comment 7

•

8 years ago

US updated with a requirement to document the implementation on readthedocs.org

User Story: (updated)

Ian Bicking (:ianbicking)

Comment 8

•

8 years ago

For the server: we'll implement a new endpoint on the loop server.  We should decide if we want to aggregate these pings or not; if we don't aggregate we could do something like GET https://loop-server/log-domains?domain=google.com – but if we want to reduce the number of pings (which I'm leaning towards) then maybe PUT {endpoint}, body: {"google.com": 3, "stackoverflow.com": 4, "other": 10}

I've described the server component in Bug 1246728

Flags: needinfo?(ianb)

Comment 9

•

8 years ago

Attached file top2000 + top500 of 7 categories - duplicates - adult = 5105domains — Details

There's 4275 .com/net/org/edu/gov sites and 830 other. Even if we were treating this as US-focused, it's not simple to select certain TLDs as there's some sites popular worldwide or sites using "odd" TLDs.

Attachment #8717933 - Flags: feedback?(rtestard)

Attachment #8717933 - Flags: feedback?(ianb)

Ed Lee :Mardak

Assignee

Comment 10

•

8 years ago

When should the domains be sent? After a call has ended?

I'm assuming we would send to <loop.server pref>/log-domains so https://loop.services.mozilla.com/v0/log-domains

Bug 1246728 mentions authentication. What should Firefox be using?

Flags: needinfo?(ianb)

Ian Bicking (:ianbicking)

Comment 11

•

8 years ago

Comment on attachment 8717933 [details]
top2000 + top500 of 7 categories - duplicates - adult = 5105domains

That is considerably more domains than I had expected – I thought adding those extra categories would only add a handful of sites.  I also have realized that the Alexa categories are old and poorly maintained (based on old dmoz categories I think).  I think this means many modern sites are not categorized, so the top-500 of these categories are old or relatively obscure sites that we don't need to track.

As such I think it would be better to construct the list without any of the added categories (though if you could same some of the sites that wouldn't be on the list otherwise that would be interesting to review in detail).

I'd like to aim for no more than 2500 domains.

The HAWK authentication is something we send with other requests to the server.  I'm not sure how that is sent, but I'm assuming there's other examples of a request that we can copy for this case?  Mark should be able to point out an example.

Flags: needinfo?(ianb) → needinfo?(standard8)

Attachment #8717933 - Flags: feedback?(ianb) → feedback-

Ed Lee :Mardak

Assignee

Comment 12

•

8 years ago

Attached file top2000 + top200 of 7 categories - duplicates - adult = 2484 domains — Details

A quick glance shows adding the top 200 shopping sites includes sites like these that aren't in the top 2000:

petsmart.com
joann.com
carmax.com
saksfifthavenue.com

Attachment #8718163 - Flags: feedback?(ianb)

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Attachment #8717933 - Flags: feedback?(rtestard)

Ian Bicking (:ianbicking)

Comment 13

•

8 years ago

Comment on attachment 8718163 [details]
top2000 + top200 of 7 categories - duplicates - adult = 2484 domains

Great, that's a good sized list, and those are good extra sites to include

Attachment #8718163 - Flags: feedback?(ianb) → feedback+

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Blocks: 1248602

Mark Banner (:standard8)

Comment 14

•

8 years ago

(In reply to Ian Bicking (:ianb) from comment #11)
> The HAWK authentication is something we send with other requests to the
> server.  I'm not sure how that is sent, but I'm assuming there's other
> examples of a request that we can copy for this case?  Mark should be able
> to point out an example.

Ah, if we're using the HAWK authentication, then we just need to make sure we go through hawkRequestInternal - that will use the correct creds for if we're signed in or not.

Flags: needinfo?(standard8)

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Summary: Collect domain of shared URL in Hello through a whitelist → [Meta] Collect domain of shared URL in Hello through a whitelist

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Depends on: 1246728

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Rank: 18 → 10

GitHub Autolander Bot

Comment 15

•

8 years ago

Attached file [loop] Mardak:bug-1211542-domains > mozilla:master — Details

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Attachment #8731216 - Flags: review?(dmose)

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Attachment #8731216 - Flags: review?(dmose) → review?(dcritchley)

David Critchley (:dcritch)

Comment 16

•

8 years ago

Comment on attachment 8731216 [details] [review]
[loop] Mardak:bug-1211542-domains > mozilla:master

Looks good.

Attachment #8731216 - Flags: review?(dcritchley) → review+

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Summary: [Meta] Collect domain of shared URL in Hello through a whitelist → Collect domain of shared URL in Hello through a whitelist

Ed Lee :Mardak

Assignee

Comment 17

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=66fc56e9206a

Mark Banner (:standard8)

Comment 18

•

8 years ago

Ed: the try builds failed - they need the patch from bug 1256694.

Ed Lee :Mardak

Assignee

Comment 19

•

8 years ago

https://github.com/mozilla/loop/commit/2f196ca4578b1686d93781708d47c8d2fce0efc5

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Blocks: 1257817

Ed Lee :Mardak

Assignee

Updated

•

8 years ago

Blocks: 1260857

Romain Testard [:RT]

Reporter

Updated

•

8 years ago

Blocks: 1261467

chris hofmann

Comment 20

•

8 years ago

there is some feedback in the dev.platform thread that requests that this get backed out, at least until the privacy review has been completed.  is that being considered?

Mark Banner (:standard8)

Comment 21

•

8 years ago

(In reply to chris hofmann from comment #20)
> there is some feedback in the dev.platform thread that requests that this
> get backed out, at least until the privacy review has been completed.  is
> that being considered?

I've just responded to that thread. Currently this logging is by default disabled (by preference). At this stage, we won't enable it until the review is completed.

Hence, I don't think we need to backout for the time being.

chris hofmann

Comment 22

•

8 years ago

(In reply to Ian Bicking (:ianb) from comment #13)
> Comment on attachment 8718163 [details]
> top2000 + top200 of 7 categories - duplicates - adult = 2484 domains
> 
> Great, that's a good sized list, and those are good extra sites to include

does this list have some bias/inaccurate reflection against our pool of international users?

I haven't checked yet but if this gets turned on we should evaluate against a list like we have at
https://wiki.mozilla.org/L10n:TopSiteTesting to see if it accounts for top sites visited in our various supported languages and does so without introducing bias, or user data leaks.

Just checking the list against TLD's it would suggest that sites in Japan are over represented when compared to other places like Germany, France, Russia, Brazil, China and others where we have more users.

Here are counts of the various TLD's in the list.

1638 com 
 110 net  
  80 org
  78 jp  
  46 ru  40 cn  37 uk  30 de
  24 in  22 gov  22 br
  19 edu  18 fr  17 it  15 pl  14 tw  13 ir  12 tv  12 me  11 tr
   9 au   8 ua   8 to   8 id   8 es   7 vn   7 kr   7 co   6 se   6 cz   6 cc   6 ca
   5 nl   5 io   5 gr   
   4 nz   4 no   4 mx   4 info   4 eu   4 by   4 ar
   3 za   3 us   3 sk   3 pk   3 my   3 ma   3 il   3 hk   3 ag
   2 ve   2 st  2 ro  2 pt 2 ph  2 ng  2 ly  2 li   2 la   2 kz  2 is  2 fm  2 fi  2 cl  2 biz   

many TLD's with only 1 domain represented.

chris hofmann

Comment 23

•

8 years ago

The list also implies that we will get bias toward this set of educational institutions.

academia.edu berkeley.edu cimic.rutgers.edu  cornell.edu ebusiness.mit.edu
euro.ecom.cmu.edu gia.edu goizueta.emory.edu harvard.edu
knowledge.wharton.upenn.edu mba.ncsu.edu
med.umich.edu mheducation.com mit.edu psu.edu purdue.edu
reduxmediia.com sdsu.edu stanford.edu umich.edu urmc.rochester.edu

and miss a potential interesting use case at some institutions that would be using hello to as a shared browsing experience to do homework with a partner or study group.  again, if this gets expanded it should be done so with international considerations in mind.

wildcarding for .edu might be a possibility but that puts us at risk for leaking user data for very small institutuions if that gets passed to our servers.

chris hofmann

Comment 24

•

8 years ago

looks like we will only match on "facebook.com"

that would seem to miss a wide variety of possible social messaging and status reporting v.  game playing v. picture sharing, v. .... that happens on facebook.

If we do pass along the directory info along with this domain it may also be difficult to ascertain what task the user is trying to accomplish without some deep analysis of the full urls and some deconstruction of how particular apps work on facebook.

https://www.google.com/search?q=farmville+url tells of a variety of things that might be done related to just facebook farmville urls.  e.g. inviting others to a game, playing against others, playing with others to reach a common goal, sharing/showing off the results of your game playing with others.  how might we sort out all those possibilities from just this game playing analysis and how might they guide UI/UX withing hello.  Again, seems better/faster/and more respectful of privacy to just ask users what they were trying to accomplish.

Ed Lee :Mardak

Assignee

Comment 25

•

8 years ago

The adult list used to filter out sites came from bug 1160596 of the adult category: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

The whitelist landed with the fix is slightly different from what was originally attached to the bug to work with the format expected by the whitelist checking code (eTLD+1): https://github.com/mozilla/loop/blob/2f196ca4578b1686d93781708d47c8d2fce0efc5/add-on/chrome/modules/DomainWhitelist.jsm

chris hofmann

Comment 26

•

8 years ago

I still don't understand the rational for filtering adult sites. Can someone explain?

Also here is an example of the facebook/game problem.
If we are somehow able to determine many users are sharing sites like

https://apps.facebook.com/farmville-two/
https://apps.facebook.com/inthemafia/
...

we still lack important context around why the link was shared.

Was it to invite/entice another to the game?
Was it to to link instruction/help on how to play the game?
Was it to play the game with/or against the another and use hello as an out of band communication channel?
Was it to boast of a high score to another or some achievement in a recent game playing session?
Was it to talk about a co-worker/family members addiction/obsession with the game?
Was it to rant about a bug in the game or browser instability recently experience while trying to play the game?

All these would be common themes across many sites and applications and potentially a set of common use cases that hello might be optimizing for, but we would have no way to breakdown the frequency of each of these types of interaction in total or by individual user. If I understand the design right we won't even get this level of detail about game playing of facebook. We will just get an indication of facebook.com. Is that correct?

Same kind of problem would result for google, bing, yahoo searches. Will know users might be sharing search results, but have little insight into the rational for sharing, or the details of what's being shared, and to what end. e.g. evaluate a set of products, share some news stories, etc...

If hello's use is a reflection of general browsing use and facebook and google are involved in most hello sessions we will gain very little in this data collection or analysis.

chris hofmann

Comment 27

•

8 years ago

I don't know if I should try and reopen this bug since the list has landed but the logging feature is turned off, but I found a few problems with at least the porn filtering intent of the current list, and maybe more categories that maybe should be considered as sensitive.

I put those in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467#c5

Should this bug be re-opened to fix these problems, or should I file one or more new bugs?

chris hofmann

Updated

•

8 years ago

Flags: needinfo?(edilee)

Ed Lee :Mardak

Assignee

Comment 28

•

8 years ago

We can do followups in separate bugs, e.g., bug 1260857 and 1263020.

Blocks: 1263020

Flags: needinfo?(edilee)

chris hofmann

Updated

•

8 years ago

No longer blocks: 1260857

Lawrence Mandel [:lmandel] (use needinfo)

Comment 29

•

8 years ago

(In reply to Mark Banner (:standard8) from comment #21)
> (In reply to chris hofmann from comment #20)
> > there is some feedback in the dev.platform thread that requests that this
> > get backed out, at least until the privacy review has been completed.  is
> > that being considered?
> 
> I've just responded to that thread. Currently this logging is by default
> disabled (by preference). At this stage, we won't enable it until the review
> is completed.
> 
> Hence, I don't think we need to backout for the time being.

I'm concerned that this code landed without privacy review, about the optics of having it in the tree and including it in a release (even it if is preffed off). Others have shared further concerns in the dev-platform thread. The fact that it's behind a pref is good but what happens if the pref is flipped unintentionally (accidentally or a malicious actor)? Is this code isolated enough that it can be easily backed out? What is the value of keeping it in the tree given that we're not going to use it in Firefox 46 without shipping an update for the add-on?

https://groups.google.com/forum/#!topic/mozilla.dev.platform/nyVkCx-_sFw

Flags: needinfo?(edilee)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 30

•

8 years ago

Seems like this should have privacy review.

Flags: needinfo?(benjamin)

chris hofmann

Comment 31

•

8 years ago

here is the result of the quick and dirty spider that was intended to get site meta data to assist in the categorization of the sites and maybe help better understand what kind of sites that are in the current white list.  It did help turn up another porn site that needs to be removed, but not as many sites have "meta keyword=" and "meta description=" data as I would have hoped for.

The scan did turn up lots of cruft in the whitelist that we are unlikely to get hits on and might be worth pruning.   These are website development companies that inject info and links back to their firms as a way of spamming their way into alexa top site rankings, and the current whitelist has quite a few ad networks that users are unlikely to be sharing with hello.

It would be good to get some addtional eyes on this list espcially for foreign language sites to make sure we aren't doing anything unintended to the whitelist for these.

chris hofmann

Comment 32

•

8 years ago

Attached file spider-summary.txt — Details

rough spider summary with self reporting categorization info about sites in the current whitelist.

chris hofmann

Comment 33

•

8 years ago

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30)
> Seems like this should have privacy review.

privacy review is in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467 and https://bugzilla.mozilla.org/show_bug.cgi?id=1262284

Benjamin Smedberg

Comment 34

•

8 years ago

To answer comment 29, I have no problem shipping this code preffed off, and I'm not particularly worried about optics. If the Hello team is still working towards having this on, then removing the code seems like unnecessary churn.

Flags: needinfo?(benjamin)

Mark Banner (:standard8)

Comment 35

•

8 years ago

Cancelling request per comment 34

Flags: needinfo?(edilee)

You need to log in before you can comment on or make changes to this bug.