Closed Bug 1159884 Opened 9 years ago Closed 9 years ago

Implement inadjacency with a hardcoded list of hashed sites

Categories

(Firefox :: New Tab Page, defect)

defect
Not set
normal
Points:
13

Tracking

()

VERIFIED FIXED
Firefox 41
Iteration:
41.1 - May 25
Tracking Status
firefox40 --- verified
firefox41 --- verified

People

(Reporter: kghim, Assigned: mzhilyaev)

References

(Blocks 1 open bug)

Details

(Whiteboard: .002)

Attachments

(5 files, 6 obsolete files)

Problem: commercial partners do not want their content to be negatively associated with adult content. In the context of Suggested Tiles, this means no sponsored or affiliate tiles should appear within same browser viewport. As an example, MGM would not want a 007 DVD release to be appearing within the same page (in particular, directly next to) where PirateBay tile appears.

Scope: 
- Prevent Suggested Tiles from appearing when a blacklisted site appears within the user top 100 frecent history
- Limit to adult content sites for 39 release
- Utilize preexisting open source blacklists such as:
http://urlblacklist.com/?sec=download
http://dsi.ut-capitole.fr/blacklists/index_en.php
http://www.squidblacklist.org/

Success criteria:
- Physical verification: with a test campaign Suggested Tile, verify that the test tile does not appear with the blacklisted site within the viewport of the new tab page
- Data verification: with a given test campaign's report, verify that the blacklisted site did not record an impression. I assume we would have to use Rappor or RR for this.
Whiteboard: .002
(In reply to Kevin Ghim from comment #0)
> - Data verification: with a given test campaign's report, verify that the
> blacklisted site did not record an impression. I assume we would have to use
> Rappor or RR for this.
We have not discussed using RAPPOR or randomized response for sending back user history, so I'm not sure what you're suggesting for "test campaign's report." And even with RAPPOR/RR, it wouldn't tell us for sure if a blacklisted site was visible.

We don't send impressions about which history tiles were shown.
jterry, is there an acceptable way to shorten the lists? http://dsi.ut-capitole.fr/blacklists/index_en.php has an adult list with over 1million sites. Can we do just the top X number of adult sites?
Flags: needinfo?(jterry)
Taking this off of priority list until we hear back from Lass on how agencies work with Twitter and Facebook's user generated content adjacency issues.
Whiteboard: .002 → .?
In that we might not need to implement this at all? Or we just don't know the details of the list?

If it's the latter, maksik can continue implementing something to support a list that we decide to go with.
Just for posterity, what we discussed earlier today and my thoughts:

So there are several lines of thinking that interrelate on this issue I think that it might be best if we set aside some time just on this to discuss further and solve in more detail. If needed I am in.

BD side - as we discussed, we can't and therefore shouldn't agree to terms that guarantee to a partner that we can prevent any type of adjacency issues (mainly pinning is a user control feature that we do not have any control on). But, to close this issue with them and help we can offer that we will make reasonable efforts to control for certain types of adjaceny (if they ask what this means we would have to be more and more specific so the devil is in the details of what we can control technologically and I would need to know if we can do this if at all...how?), and that upon a notice from the partner we can implement certain take-down control (i.e., turning a Pinned Tile into a History Tile (Ed to confirm this can be done easily), or perhaps moving the partner Tile to a new location in the grid).

Blacklisting - I think we need to think through carefully again and operate very transparently and share with the world our plan. In my view, Mozilla should curate this list. We can check out publicly available data and then create our own. Any lists that we get from others or that have license terms I need to be looped in and let's discuss the rules of the road more deeply for each as we go. I think we should involve policy on this early as well.

I think we need to define what we mean when we say "adult content" or any other similarly blacklisted category or site and the who, what and why very carefully.

I don't think we should let partners dictate at all what goes on our lists.
I don't think we should rely on others lists or license them as there are many problems with that (cost, SLA, 3rd party curating the web for Mozilla and policy problems etc..)

This is probably the flip of the "what partners we will work with list" / whitelist, so perhaps we can leverage that as we build this one and again policy needs to be involved.

I'd like to understand how RAPPOR works here a bit more. If we are doing user data stuff here too btw let's add Marshall and make sure he understands this piece.
Whiteboard: .? → .002
Blocks: 1160372
Blocks: 1140185
(In reply to Ed Lee :Mardak from comment #2)
> jterry, is there an acceptable way to shorten the lists?
> http://dsi.ut-capitole.fr/blacklists/index_en.php has an adult list with
> over 1million sites. Can we do just the top X number of adult sites?
As discussed, we'll do a best effort while being mindful of disk usage. We'll produce a list in bug 1160596 to hardcode into Firefox in this bug.
Depends on: 1160596
Flags: needinfo?(jterry)
maksik said that MattN had some comments about blacklisting:

1) don't have a cleartext list of adult sites in Firefox
2) potential size/disk/memory issues with having a large list (100k+?)
3) could potentially reuse some of safebrowsing

For (1), what's the concern here? We could hash eTLD+1 and match, but it's not really hiding anything. A list of porn sites is not hard to find, and hashes as strings probably makes (2) worse.

There was a suggestion from dev.planning on using a bloom filter, which should help address (1) and (2) and provide some false positives to add to some deniability if somehow we tracked that the user triggered on the blacklist.
Flags: needinfo?(MattN+bmo)
Oh I see MattN's comments in bug 1105376.
Flags: needinfo?(MattN+bmo)
Depends on: 1161201
(In reply to Ed Lee :Mardak from comment #7)
> maksik said that MattN had some comments about blacklisting:
> 
> 1) don't have a cleartext list of adult sites in Firefox

I agree that we shouldn't do this.

We need to either push back on the requirement or find another solution.
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #9)
> (In reply to Ed Lee :Mardak from comment #7)
> > 1) don't have a cleartext list of adult sites in Firefox
> I agree that we shouldn't do this.
Just to add some notes from IRC:

The concern isn't so much of plaintext or not, i.e., encoding/decoding (rot13,atob) vs 1-way hash doesn't change the worry.

Part of the reason why this is required is for legal compliance of contracts from advertisers that want to protect their brand, so Mozilla will provide a best effort of not juxtaposing questionable content with the paid content.

We could try to avoid the problem by not having it in the contract in the first place, but this then leaves a significant amount of money on the table from brands that would pay for this feature/protection. We have foregone money in building a system that protects user data and privacy, but I don't think this feature impacts users much at all.

This lack of user impact is also an argument against this because there isn't much user value gained for implementing this either.

From the business side, I hear that negative adjacency is one of the largest if not the largest limiter of how much money can be made.
No longer blocks: 1160372
Following up from earlier, I think it might be time to have a meeting on this topic as a group so we can be on the same page, avoid surprises and plan the right path to get to release without hiccups or stalls, Kevin will you lead this on this topic? Add all the relevant / key folks here so we can continue the discussion from earlier today.
Flags: needinfo?(kghim)
We'll include example.com to the list for people to be able to test without actually going to an adult site.
Assignee: nobody → mzhilyaev
Iteration: --- → 40.2 - 27 Apr
Points: --- → 13
Component: Tiles → New Tab Page
Product: Content Services → Firefox
Flags: needinfo?(kghim)
(In reply to Geoff Piper from comment #11)
> Following up from earlier, I think it might be time to have a meeting on
> this topic as a group so we can be on the same page, avoid surprises and
> plan the right path to get to release without hiccups or stalls, Kevin will
> you lead this on this topic? Add all the relevant / key folks here so we can
> continue the discussion from earlier today.

One possible technical approach here is to use the safe browsing technology
to carry the blacklist. This would avoid needing to have any list on the browser
and would also let us leverage existing technology. Geoff, I'd be happy to help
out here as needed.
Iteration: 40.2 - 27 Apr → 41.1 - May 25
Attachment #8604971 - Flags: review?(msamuel)
Blocks: 1164303
Comment on attachment 8604971 [details] [diff] [review]
v1. initial implementation of negative adjacency

Review of attachment 8604971 [details] [diff] [review]:
-----------------------------------------------------------------

Max,

Are you planning to land a non-invertible transform before shipping?
(In reply to Eric Rescorla (:ekr) from comment #15)
> Are you planning to land a non-invertible transform before shipping?
We could, what's the benefit of doing that?
> Are you planning to land a non-invertible transform before shipping?

just to clarify that this patch is for getting client's negative adjacency logic right.
it will be trivial to make client download black list from "save browsing" endpoint and/or use crypto hash.

To Ed's point, the black list of ill reputable sites is available: anyone can reconstruct the blacklist client has (or downloads).   "non-invertible transform" does not seem to provide any more protection as the simplest clear-text obfuscation (such as base64)... right?
My understanding was that there was some concern about having the list of
blacklisted sites on the client (hence the interest in Bloom filtrs,
hashing, etc.) I don't personally have a strong opinion on whether this
is necessary, but I would observe that having the data base64ed
(or any trivially invertible transform) looks a lot more like having
the list on the client than not.
One specific example I heard was to avoid false positive matches of some computer file scanner searching for porn domains triggering on plaintext lists in Firefox. So basic base64 could avoid that.

The bloom filter was more in the context of adding false positive site matches to add noise to when a site isn't shown, so that a malicious listener wouldn't know for sure if it was truly a blacklisted site triggering.
Comment on attachment 8604971 [details] [diff] [review]
v1. initial implementation of negative adjacency

cancelling review until safe browsing issues are considered (see Bug 1164303)
Attachment #8604971 - Flags: review?(msamuel)
Comment on attachment 8604971 [details] [diff] [review]
v1. initial implementation of negative adjacency

Requesting review again since synchronous calls to nsIURIClassifier.idl is OK.
Attachment #8604971 - Flags: review?(msamuel)
Comment on attachment 8604971 [details] [diff] [review]
v1. initial implementation of negative adjacency

Review of attachment 8604971 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/modules/DirectoryLinksProvider.jsm
@@ +745,5 @@
>        this._topSitesWithSuggestedLinks.add(changedLinkSite);
>        return true;
>      }
> +
> +    // always run _updateSuggestedTile if changedLinkSite is black listed

You're sending aLink to _isSiteBlackListed() not changedLinkSite so this comment is a bit confusing.

@@ +747,5 @@
>      }
> +
> +    // always run _updateSuggestedTile if changedLinkSite is black listed
> +    // and there are tiles configured to avoid it
> +    if (this._avoidNegativeSites && this._isSiteBlackListed(aLink)) {

`_isSiteBlackListed` seems like a misleading name since you're passing it a link object rather than a site

@@ +803,5 @@
>        }
>      }
> +    // since newTabLinks are available, set _newTabHasNegativeSite here
> +    // note that _shouldUpdateSuggestedTile is called by _updateSuggestedTile
> +    this._newTabHasNegativeSite = this._avoidNegativeSites && this._checkForNegativeSites(newTabLinks);

I would expect this._newTabHasNegativeSite to get set whenever newtab links change. _getCurrentTopSiteCount() does get called indirectly whenever that happens, but it's not super clear.

Maybe a comment here that we are setting _newTabHasNegativeSite when links change or to move this code to a more explicit location?

@@ +942,5 @@
>  
> +  /****** Negative adjacency *********/
> +
> +  /**
> +   * Loads sites balcklist for negative adjacency

balcklist -> blacklist

@@ +980,5 @@
> +      }
> +      // get the domain from host
> +      try {
> +        // extract base domain from host
> +        baseDomain = eTLD.getBaseDomainFromHost(host);

You would only get here if !link.baseDomain. So by the end of this try/catch, you might either have a baseDomain that was achieved through NewTabUtils.extractSite() OR eTLD.getBaseDomainFromHost(host). Was this on purpose? If so, could you explain in a comment in which case you use each approach?

@@ +997,5 @@
> +   * Checks if new tab has black listed site
> +   * @param new tab links (or nothing, in which case NewTabUtils.links.getLinks() is called
> +   * @return true if new tab shows negative site
> +   */
> +  _checkForNegativeSites: function DirectoryLinksProvider_checkForNegativeSites(newTabLink) {

newTabLink -> newTabLinks
Attachment #8604971 - Flags: review?(msamuel)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #9)
> I agree that we shouldn't do this.
Could you explain your concern, so we can figure out how we should move forwards?
Flags: needinfo?(gavin.sharp)
dcamp pointed out how the blacklist is not quite the right name and could cause people to think the wrong thing. We're not actually blocking content from a list of sites, and we're actually closer to blocking ads. (Hey! We're shipping with a built in ad blocker? ;))

It sounds like we need a name for the list and the behavior it self.

"safety list" and "check safety"

e.g., if (tile.checkSafety) SAFETY_LIST.enforce()
maksik suggested inadjacency. That seems plain enough. :) We've got frecency and now inadjacency! Hopefully people don't get confused with adjacency lists or frecent/recency. ;)
I actually like safety theme better.  inadjacency seems over-academic (and hard to spell) :)
Comment on attachment 8604971 [details] [diff] [review]
v1. initial implementation of negative adjacency

>         link.lastVisitDate = rawLinks.suggested.length - position;
>+        // check if link wants to avoid negative sites
>+        if (link.avoidNegativeSites) {
>+          this._avoidNegativeSites = true;
The naming convention of values from the server aren't camel case, so use:

if (link.check_inadjacency)
Attached patch Bug_1159884.patch.v2 (obsolete) — Splinter Review
replaced mentions of black and negative with inadjacency
Attachment #8604971 - Attachment is obsolete: true
Attachment #8608379 - Flags: review?(adw)
Attachment #8608379 - Attachment is obsolete: true
Attachment #8608379 - Flags: review?(adw)
Attachment #8608389 - Flags: review?(adw)
I tried to catch you on IRC, but what I think is:
- shipping a not-too-big, non-reversible in-product blacklist "for now" is acceptable, assuming technically feasible and not too risky for your timeline etc.
- we should really move away from that towards something like bug 1164303 in the not-too-distant future
Flags: needinfo?(gavin.sharp)
make file names more explicit
Attachment #8608389 - Attachment is obsolete: true
Attachment #8608389 - Flags: review?(adw)
Attachment #8608392 - Flags: review?(adw)
Attachment #8608392 - Attachment is patch: true
Attachment #8608392 - Attachment mime type: text/x-patch → text/plain
Comment on attachment 8608392 [details] [diff] [review]
v4.  explicit naming of newTab.inadjacent.json file

Review of attachment 8608392 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/modules/DirectoryLinksProvider.jsm
@@ +976,5 @@
> +      this._inadjacentSites = new Set();
> +      if (jsonObject && jsonObject["domains"]) {
> +        // iterate through inadjacent sites
> +        jsonObject["domains"].forEach(site => {
> +          this._inadjacentSites.add(site);

You can just do new Set(jsonObject.domains) since Sets can be initialized with arrays.
Attachment #8608392 - Flags: review?(adw) → review+
Summary: Implement adult related content negative adjacency → Implement inadjacency with a hardcoded list of sites
Summary: Implement inadjacency with a hardcoded list of sites → Implement inadjacency with a hardcoded list of hashed sites
(In reply to Drew Willcoxon :adw from comment #32)
> You can just do new Set(jsonObject.domains) since Sets can be initialized
> with arrays.
Neat. ;)

-      this._inadjacentSites = new Set();
-      if (jsonObject && jsonObject["domains"]) {
-        // iterate through inadjacent sites
-        jsonObject["domains"].forEach(site => {
-          this._inadjacentSites.add(site);
-        });
-      }
+      this._inadjacentSites = new Set(jsonObject && jsonObject.domains);

new Set(undefined).size == 0

Saved some more lines :)
Attached patch v4.1 (obsolete) — Splinter Review
Attachment #8608392 - Attachment is obsolete: true
Attached patch v4.2 (obsolete) — Splinter Review
Attachment #8608425 - Attachment is obsolete: true
Attached patch v4.3Splinter Review
Attachment #8608427 - Attachment is obsolete: true
Attached patch documentation v1Splinter Review
Attachment #8608564 - Flags: feedback?(benjamin)
Comment on attachment 8608564 [details] [diff] [review]
documentation v1

f? per https://wiki.mozilla.org/Firefox/Data_Collection

There's no additional data being collected, but the server provides an extra boolean to restrict when a suggestion can be shown.
https://hg.mozilla.org/mozilla-central/rev/3257b096219c
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 41
Comment on attachment 8608564 [details] [diff] [review]
documentation v1

"The inadjacency list is packaged with Firefox" would be better if you specified the source file which contains the list.
Attachment #8608564 - Flags: feedback?(benjamin) → feedback+
Attached image verify screenshot
verified on osx nightly 2015-05-26

1) create new profile
2) visit enough mozilla related sites (at least 8) to get suggestion to appear on new tab page [see attachment first row]
3) visit example.com
4) open new tab page and see example.com but no suggestion [see attachment second row]
5) block example.com tile and open new tab to see suggestion return
6) drag example.com from another tab into newtab to pin tile and cause suggestion to disappear [see row third row]
Sorry step 1a:

change pref in about:config
browser.newtabpage.directory.source
to
https://people.mozilla.org/~elee/suggested.1159884.json
Status: RESOLVED → VERIFIED
Comment on attachment 8610024 [details] [diff] [review]
for aurora (Mardak will land)

Approval Request Comment: See bug 1140185 comment 11
Attachment #8610024 - Flags: approval-mozilla-aurora?
Comment on attachment 8610024 [details] [diff] [review]
for aurora (Mardak will land)

approval-mozilla-aurora+ granted in bug 1140185 comment 14
Attachment #8610024 - Flags: approval-mozilla-aurora?
Confirming the fix using the STR from comment 46 and comment 47 on latest Aurora, build ID: 20150528004000.

Testing was performed on Windows 7 64-bit, Mac OS X 10.9.5 and Ubuntu 14.04 32-bit.
QA Contact: cornel.ionce
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.