Closed
Bug 1211542
Opened 9 years ago
Closed 8 years ago
Collect domain of shared URL in Hello through a whitelist
Categories
(Hello (Loop) :: Client, defect, P1)
Hello (Loop)
Client
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RT, Assigned: Mardak)
References
Details
User Story
In order to understand what people share as URL context on Hello we should collect the domain of the context URLs which are part of Hello rooms although only if they are generic enough: - Use a whitelist database of the top 2000 used websites from Alexa customized as follows: ---Substract the sites from the Adult Alexa category --- Add all the subdomains of: google.com (e.g., drive.google.com – I don’t believe there are nation-specific versions of these subproperties) yahoo.com (e.g., games.yahoo.com find via pentest-tools.com) developer.mozilla.org, bugzilla.mozilla.org, wiki.mozilla.org, out of curiosity, maybe other mozilla.org or firefox.com subdomains hello.firefox.com (why would this even happen? I don’t know but would be curious if it does) itunes.apple.com ---Add the top 500 of these categories: Home Health Recreation > Travel Shopping Business > Real Estate Business > Consumer Goods and Services Business > E-Commerce - From the desktop client, when a room gets created or when a new tab is shared, get the domain name of the context URL and check if it belongs to the whitelist database ---If the context URL domain name does not belong to the whitelist, do nothing ---If the context URL domain name belongs to the whitelist, send that domain name to Google analytics through its API This implementation should be documented on readthedocs.org for openness reasons. See https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/ as a reference As a secondary step we should look into using the RAPPOR algorithm - https://github.com/google/rappor RAPPOR requires a large number of samples and our current traffic would not provide us anything usable with RAPPOR.
Attachments
(4 files)
No description provided.
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Comment 1•9 years ago
|
||
Ed, could the top 500/1000 websites come from https://hg.mozilla.org/mozilla-central/file/ad64b12d3b01/browser/modules/DirectoryLinksProvider.jsm#l68 or from the CS database keeping in mind this would be a clicker website implementation?
Flags: needinfo?(edilee)
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Updated•9 years ago
|
Rank: 25
Priority: -- → P2
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Updated•9 years ago
|
Rank: 25 → 22
Reporter | ||
Updated•9 years ago
|
Rank: 22 → 17
Priority: P2 → P1
Comment 2•9 years ago
|
||
Alexa top lists can be found here: https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites- I think we should use the Alexa list, and do a run through that to clean up the list slightly. Also we should add some second-level domains to the list, such as drive.google.com, games.yahoo.com, etc. (The Alexa list only has top-level domains.)
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Updated•9 years ago
|
Rank: 17 → 22
Priority: P1 → P2
Updated•8 years ago
|
Rank: 22 → 18
Priority: P2 → P1
Comment 3•8 years ago
|
||
I have a lot of context on this bug that isn't yet in the ticket; be sure to contact me before starting any work.
Reporter | ||
Updated•8 years ago
|
User Story: (updated)
Reporter | ||
Comment 4•8 years ago
|
||
Updated user story field with latest details from Ian's investigations.
User Story: (updated)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → edilee
Flags: needinfo?(edilee)
Updated•8 years ago
|
Flags: needinfo?(ianb)
Assignee | ||
Comment 5•8 years ago
|
||
A list of sites was used to power content services inadjacency matching for bug 1159884. Bug 1159884 comment 0 had some ideas of using existing lists while bug 1160596 created a list usable as a plain list of domains as well as md5 hashed version.
Assignee | ||
Comment 6•8 years ago
|
||
For "--- Add all the subdomains of:" does that mean we shouldn't look at subdomain matches of other sites, i.e., only match exact domains (and +www.)?
Reporter | ||
Comment 7•8 years ago
|
||
US updated with a requirement to document the implementation on readthedocs.org
User Story: (updated)
Comment 8•8 years ago
|
||
For the server: we'll implement a new endpoint on the loop server. We should decide if we want to aggregate these pings or not; if we don't aggregate we could do something like GET https://loop-server/log-domains?domain=google.com – but if we want to reduce the number of pings (which I'm leaning towards) then maybe PUT {endpoint}, body: {"google.com": 3, "stackoverflow.com": 4, "other": 10} I've described the server component in Bug 1246728
Flags: needinfo?(ianb)
See Also: → 1246728
Assignee | ||
Comment 9•8 years ago
|
||
There's 4275 .com/net/org/edu/gov sites and 830 other. Even if we were treating this as US-focused, it's not simple to select certain TLDs as there's some sites popular worldwide or sites using "odd" TLDs.
Attachment #8717933 -
Flags: feedback?(rtestard)
Attachment #8717933 -
Flags: feedback?(ianb)
Assignee | ||
Comment 10•8 years ago
|
||
When should the domains be sent? After a call has ended? I'm assuming we would send to <loop.server pref>/log-domains so https://loop.services.mozilla.com/v0/log-domains Bug 1246728 mentions authentication. What should Firefox be using?
Flags: needinfo?(ianb)
Comment 11•8 years ago
|
||
Comment on attachment 8717933 [details]
top2000 + top500 of 7 categories - duplicates - adult = 5105domains
That is considerably more domains than I had expected – I thought adding those extra categories would only add a handful of sites. I also have realized that the Alexa categories are old and poorly maintained (based on old dmoz categories I think). I think this means many modern sites are not categorized, so the top-500 of these categories are old or relatively obscure sites that we don't need to track.
As such I think it would be better to construct the list without any of the added categories (though if you could same some of the sites that wouldn't be on the list otherwise that would be interesting to review in detail).
I'd like to aim for no more than 2500 domains.
The HAWK authentication is something we send with other requests to the server. I'm not sure how that is sent, but I'm assuming there's other examples of a request that we can copy for this case? Mark should be able to point out an example.
Flags: needinfo?(ianb) → needinfo?(standard8)
Attachment #8717933 -
Flags: feedback?(ianb) → feedback-
Assignee | ||
Comment 12•8 years ago
|
||
A quick glance shows adding the top 200 shopping sites includes sites like these that aren't in the top 2000: petsmart.com joann.com carmax.com saksfifthavenue.com
Attachment #8718163 -
Flags: feedback?(ianb)
Assignee | ||
Updated•8 years ago
|
Attachment #8717933 -
Flags: feedback?(rtestard)
Comment 13•8 years ago
|
||
Comment on attachment 8718163 [details]
top2000 + top200 of 7 categories - duplicates - adult = 2484 domains
Great, that's a good sized list, and those are good extra sites to include
Attachment #8718163 -
Flags: feedback?(ianb) → feedback+
Comment 14•8 years ago
|
||
(In reply to Ian Bicking (:ianb) from comment #11) > The HAWK authentication is something we send with other requests to the > server. I'm not sure how that is sent, but I'm assuming there's other > examples of a request that we can copy for this case? Mark should be able > to point out an example. Ah, if we're using the HAWK authentication, then we just need to make sure we go through hawkRequestInternal - that will use the correct creds for if we're signed in or not.
Flags: needinfo?(standard8)
Reporter | ||
Updated•8 years ago
|
Summary: Collect domain of shared URL in Hello through a whitelist → [Meta] Collect domain of shared URL in Hello through a whitelist
Reporter | ||
Updated•8 years ago
|
Rank: 18 → 10
Comment 15•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8731216 -
Flags: review?(dmose)
Assignee | ||
Updated•8 years ago
|
Attachment #8731216 -
Flags: review?(dmose) → review?(dcritchley)
Comment 16•8 years ago
|
||
Comment on attachment 8731216 [details] [review] [loop] Mardak:bug-1211542-domains > mozilla:master Looks good.
Attachment #8731216 -
Flags: review?(dcritchley) → review+
Assignee | ||
Updated•8 years ago
|
Summary: [Meta] Collect domain of shared URL in Hello through a whitelist → Collect domain of shared URL in Hello through a whitelist
Assignee | ||
Comment 17•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=66fc56e9206a
Comment 18•8 years ago
|
||
Ed: the try builds failed - they need the patch from bug 1256694.
Assignee | ||
Comment 19•8 years ago
|
||
https://github.com/mozilla/loop/commit/2f196ca4578b1686d93781708d47c8d2fce0efc5
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 20•8 years ago
|
||
there is some feedback in the dev.platform thread that requests that this get backed out, at least until the privacy review has been completed. is that being considered?
Comment 21•8 years ago
|
||
(In reply to chris hofmann from comment #20) > there is some feedback in the dev.platform thread that requests that this > get backed out, at least until the privacy review has been completed. is > that being considered? I've just responded to that thread. Currently this logging is by default disabled (by preference). At this stage, we won't enable it until the review is completed. Hence, I don't think we need to backout for the time being.
Comment 22•8 years ago
|
||
(In reply to Ian Bicking (:ianb) from comment #13) > Comment on attachment 8718163 [details] > top2000 + top200 of 7 categories - duplicates - adult = 2484 domains > > Great, that's a good sized list, and those are good extra sites to include does this list have some bias/inaccurate reflection against our pool of international users? I haven't checked yet but if this gets turned on we should evaluate against a list like we have at https://wiki.mozilla.org/L10n:TopSiteTesting to see if it accounts for top sites visited in our various supported languages and does so without introducing bias, or user data leaks. Just checking the list against TLD's it would suggest that sites in Japan are over represented when compared to other places like Germany, France, Russia, Brazil, China and others where we have more users. Here are counts of the various TLD's in the list. 1638 com 110 net 80 org 78 jp 46 ru 40 cn 37 uk 30 de 24 in 22 gov 22 br 19 edu 18 fr 17 it 15 pl 14 tw 13 ir 12 tv 12 me 11 tr 9 au 8 ua 8 to 8 id 8 es 7 vn 7 kr 7 co 6 se 6 cz 6 cc 6 ca 5 nl 5 io 5 gr 4 nz 4 no 4 mx 4 info 4 eu 4 by 4 ar 3 za 3 us 3 sk 3 pk 3 my 3 ma 3 il 3 hk 3 ag 2 ve 2 st 2 ro 2 pt 2 ph 2 ng 2 ly 2 li 2 la 2 kz 2 is 2 fm 2 fi 2 cl 2 biz many TLD's with only 1 domain represented.
Comment 23•8 years ago
|
||
The list also implies that we will get bias toward this set of educational institutions. academia.edu berkeley.edu cimic.rutgers.edu cornell.edu ebusiness.mit.edu euro.ecom.cmu.edu gia.edu goizueta.emory.edu harvard.edu knowledge.wharton.upenn.edu mba.ncsu.edu med.umich.edu mheducation.com mit.edu psu.edu purdue.edu reduxmediia.com sdsu.edu stanford.edu umich.edu urmc.rochester.edu and miss a potential interesting use case at some institutions that would be using hello to as a shared browsing experience to do homework with a partner or study group. again, if this gets expanded it should be done so with international considerations in mind. wildcarding for .edu might be a possibility but that puts us at risk for leaking user data for very small institutuions if that gets passed to our servers.
Comment 24•8 years ago
|
||
looks like we will only match on "facebook.com" that would seem to miss a wide variety of possible social messaging and status reporting v. game playing v. picture sharing, v. .... that happens on facebook. If we do pass along the directory info along with this domain it may also be difficult to ascertain what task the user is trying to accomplish without some deep analysis of the full urls and some deconstruction of how particular apps work on facebook. https://www.google.com/search?q=farmville+url tells of a variety of things that might be done related to just facebook farmville urls. e.g. inviting others to a game, playing against others, playing with others to reach a common goal, sharing/showing off the results of your game playing with others. how might we sort out all those possibilities from just this game playing analysis and how might they guide UI/UX withing hello. Again, seems better/faster/and more respectful of privacy to just ask users what they were trying to accomplish.
Assignee | ||
Comment 25•8 years ago
|
||
The adult list used to filter out sites came from bug 1160596 of the adult category: https://github.com/matthewruttley/contentfilter/blob/master/sites.json The whitelist landed with the fix is slightly different from what was originally attached to the bug to work with the format expected by the whitelist checking code (eTLD+1): https://github.com/mozilla/loop/blob/2f196ca4578b1686d93781708d47c8d2fce0efc5/add-on/chrome/modules/DomainWhitelist.jsm
Comment 26•8 years ago
|
||
I still don't understand the rational for filtering adult sites. Can someone explain? Also here is an example of the facebook/game problem. If we are somehow able to determine many users are sharing sites like https://apps.facebook.com/farmville-two/ https://apps.facebook.com/inthemafia/ ... we still lack important context around why the link was shared. Was it to invite/entice another to the game? Was it to to link instruction/help on how to play the game? Was it to play the game with/or against the another and use hello as an out of band communication channel? Was it to boast of a high score to another or some achievement in a recent game playing session? Was it to talk about a co-worker/family members addiction/obsession with the game? Was it to rant about a bug in the game or browser instability recently experience while trying to play the game? All these would be common themes across many sites and applications and potentially a set of common use cases that hello might be optimizing for, but we would have no way to breakdown the frequency of each of these types of interaction in total or by individual user. If I understand the design right we won't even get this level of detail about game playing of facebook. We will just get an indication of facebook.com. Is that correct? Same kind of problem would result for google, bing, yahoo searches. Will know users might be sharing search results, but have little insight into the rational for sharing, or the details of what's being shared, and to what end. e.g. evaluate a set of products, share some news stories, etc... If hello's use is a reflection of general browsing use and facebook and google are involved in most hello sessions we will gain very little in this data collection or analysis.
Comment 27•8 years ago
|
||
I don't know if I should try and reopen this bug since the list has landed but the logging feature is turned off, but I found a few problems with at least the porn filtering intent of the current list, and maybe more categories that maybe should be considered as sensitive. I put those in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467#c5 Should this bug be re-opened to fix these problems, or should I file one or more new bugs?
Updated•8 years ago
|
Flags: needinfo?(edilee)
Assignee | ||
Comment 28•8 years ago
|
||
We can do followups in separate bugs, e.g., bug 1260857 and 1263020.
Blocks: 1263020
Flags: needinfo?(edilee)
Comment 29•8 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #21) > (In reply to chris hofmann from comment #20) > > there is some feedback in the dev.platform thread that requests that this > > get backed out, at least until the privacy review has been completed. is > > that being considered? > > I've just responded to that thread. Currently this logging is by default > disabled (by preference). At this stage, we won't enable it until the review > is completed. > > Hence, I don't think we need to backout for the time being. I'm concerned that this code landed without privacy review, about the optics of having it in the tree and including it in a release (even it if is preffed off). Others have shared further concerns in the dev-platform thread. The fact that it's behind a pref is good but what happens if the pref is flipped unintentionally (accidentally or a malicious actor)? Is this code isolated enough that it can be easily backed out? What is the value of keeping it in the tree given that we're not going to use it in Firefox 46 without shipping an update for the add-on? https://groups.google.com/forum/#!topic/mozilla.dev.platform/nyVkCx-_sFw
Flags: needinfo?(edilee)
Comment 31•8 years ago
|
||
here is the result of the quick and dirty spider that was intended to get site meta data to assist in the categorization of the sites and maybe help better understand what kind of sites that are in the current white list. It did help turn up another porn site that needs to be removed, but not as many sites have "meta keyword=" and "meta description=" data as I would have hoped for. The scan did turn up lots of cruft in the whitelist that we are unlikely to get hits on and might be worth pruning. These are website development companies that inject info and links back to their firms as a way of spamming their way into alexa top site rankings, and the current whitelist has quite a few ad networks that users are unlikely to be sharing with hello. It would be good to get some addtional eyes on this list espcially for foreign language sites to make sure we aren't doing anything unintended to the whitelist for these.
Comment 32•8 years ago
|
||
rough spider summary with self reporting categorization info about sites in the current whitelist.
Comment 33•8 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #30) > Seems like this should have privacy review. privacy review is in https://bugzilla.mozilla.org/show_bug.cgi?id=1261467 and https://bugzilla.mozilla.org/show_bug.cgi?id=1262284
Comment 34•8 years ago
|
||
To answer comment 29, I have no problem shipping this code preffed off, and I'm not particularly worried about optics. If the Hello team is still working towards having this on, then removing the code seems like unnecessary churn.
Flags: needinfo?(benjamin)
You need to log in
before you can comment on or make changes to this bug.
Description
•