Closed Bug 1160596 Opened 5 years ago Closed 5 years ago

Create a general blacklist for negative adjacency

Categories

(Content Services Graveyard :: Tiles, defect)

defect
Not set
Points:
5

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
41.1 - May 25

People

(Reporter: Mardak, Assigned: mruttley)

References

Details

(Whiteboard: .001)

Attachments

(1 file, 3 obsolete files)

We need a blacklist to hardcode in bug 1159884. We have various sources of blacklist data that we want to combine. We also want to filter probably based on site ranking/popularity to reduce the total number of entries to lessen impact on disk/memory usage of Firefox.
Blocks: 1159884
Here's the most current blacklist: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

I'm continually updating it. In terms of the bloom filter, we have to decide at a legal level what error rate is acceptable (perhaps none?). Maksik mentioned that he will investigate if there is any existing BF implementation in FF. We also have to decide which hashing function is best. My suggestion is that we use a popular function like murmur: https://gist.github.com/raycmorgan/588423
mxr points to C++ implementation of bloom filter
http://mxr.mozilla.org/mozilla-central/source/mfbt/BloomFilter.h

I am not sure how to surface it to the jsm module
When checking false positive rate, please run BloomFilter on 1m of alexa with adult sites removed and see how many actual false positives we get.
There's a bloom filter implemented in bug 1138022 https://bugzilla.mozilla.org/attachment.cgi?id=8571008&action=diff

And I don't think it matters too much which exact bloom filter function is used. I expect it to be more of a calculation that can be confirmed with an actual implementation. E.g., given alexa top 1m domains, how what's the collision/false positive rate for a bloom filter with X bits.

Sure, different functions will generate different hashes and which exact sites collide will be different, but in general, I would think it's more of the size of blacklist requiring how many bits and how many additional bits to lessen false positives.
This article has a rule of thumb:

http://corte.si/%2Fposts/code/bloom-filter-rules-of-thumb/index.html
1 - One byte per item in the input set gives about a 2% false positive rate.

If we had 1024 blacklisted items, we would get 2% false positive with 1KB bloom filter. If we want 100k items, we would need roughly 100KB bloom filter with 2% false positive rate. Increasing to 10% false positive is roughly half the size.
Mardak: Are we doing domain, subdomain or path level matching? 

This is a great list we can integrate: http://dsi.ut-capitole.fr/blacklists/index_en.php but requires a bit more than just domain to get at various google groups and live.com specifics.
I believe we'll start with just the domain matching for now.
I've integrated some more data from a comscore source and we now have over 2900 domains: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

The counts per category are as follows:
 - drugs: 19
 - gambling: 154
 - adult: 2743
 - alcohol: 50

All domains are in the latest daily Alexa top 1m sites.
mruttley, can you calculate the maxP of those blacklisted sites as one adgroup?
The MaxP of blacklist groups is currently:

Drugs  0.114959
Gambling  0.094614
Adult  0.020494
Alcohol  0.074813

All Together  0.019026
Matthew, could you generate json file of the form:

{"blacklist": [ base64(bad_site_1), base64(bad_site_2), .... , base64("example.com") ]}

Note that I included "example.com" in the list, so we can browser test negative adjacency without actually going to pron sites.  I am using btoa() call to generate base64 in the client, so it will be advisable to use that for json file generation.

When the file is generated, do, please, attach to the bug
Flags: needinfo?(mruttley)
maksik: Here is the file: https://github.com/matthewruttley/contentfilter/blob/master/sitesb64.json
Flags: needinfo?(mruttley)
Updated with b64 encoding of example.com as requested by maksik
Status: NEW → RESOLVED
Iteration: 40.3 - 11 May → 41.1 - May 25
Points: --- → 5
Closed: 5 years ago
Resolution: --- → FIXED
Why are we reinventing the wheel and not using the Safe Browsing data format and server? It supports more than just domain matches and is already usable by JS in Firefox. I don't understand how one second it's argued that we need this ASAP and then on the other hand extra work is made for ourselves.
Do you know who can help with using the existing safe browsing data format and server? If it is indeed simple and low risk to uplift to 39, then we could go that route.
:gcp or :mmc (not sure of availability) knows the client code in toolkit and looking at the server code[1] that I linked to before, it would seem like rtilder does. It looks like the client changes would be low risk for uplift IMO. I didn't realize you were trying to rush this into Beta which doesn't seem like a good idea for any of these solutions as there could be performance issues to be handled (especially with a custom solution).

[1] https://github.com/mozilla-services/shavar
Attached file md5.json (obsolete) —
md5 encoding of the blacklist
Attached file md5 and b64 version of the blacklist (obsolete) —
Attachment #8604906 - Attachment is obsolete: true
Attachment #8608241 - Attachment is obsolete: true
We need to change format of the json file. "blacklist" needs to be replaced with :"domains".  It should be

{
  "domains": [
    .....
  ]
}

could you please make a correction
Flags: needinfo?(mruttley)
Also available here: https://github.com/matthewruttley/contentfilter/blob/master/md5_b64.json
Attachment #8608315 - Attachment is obsolete: true
Flags: needinfo?(mruttley)
What are the changes?

939a940,941
>         "mNlYGAOPc6KIMW8ITyBzIg==", 
>         "bK045TkBlz+/3+6n6Qwvrg==", 
1065d1066
<         "t1O9jSNjg4DTIv/Za4NbtA==", 
1163d1163
<         "+k5lDb+QdNc9iZ01hL5yBg==", 
1244d1243
<         "2qK2ZEY9LgdKSTaLf6VnLA==", 
1369a1369
>         "hIJA+1QGuKEj+3ijniyBSQ==", 
1528d1527
<         "E2lvMXqHdTw0x+KCKVnblg==", 
1973a1973
>         "8DtgIyYiNFqDc5qVrpFUng==", 
2070d2069
<         "RzX2OfSFEd//LhZwRwzBVw==", 
2211a2211
>         "O7JiE0bbp583G6ZWRGBcfw==", 
2925a2926
>         "gYgCu/qUpXWryubJauuPNw==", 
3010a3012
>         "+YVxSyViJfrme/ENe1zA7A==", 
3089a3092
>         "VZX1FnyC8NS2k3W+RGQm4g==",
Ah I see the commit:

https://github.com/matthewruttley/contentfilter/commit/048e1410da2408d4e79ef0ea8001691606fe9af4

The commit message doesn't quite explain the changes in the list though. Any reason for those changes?
Flags: needinfo?(mruttley)
The changes reflect updates in how contentfilter is created. Contentfilter is generated by sites only found in the Alexa top 1m sites, which change every day. Thus each time I regenerate the list, a few sites change.

I used backticks in the commit message which seem to have eliminated most/all of it :/ TIL.
Flags: needinfo?(mruttley)
You need to log in before you can comment on or make changes to this bug.