Closed Bug 1160596 Opened 9 years ago Closed 9 years ago

Create a general blacklist for negative adjacency

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Iteration:

41.1 - May 25

People

(Reporter: Mardak, Assigned: mruttley)

References

Details

(Whiteboard: .001)

Attachments

(1 file, 3 obsolete files)

sitesb64.json - base 64 encoding of the blacklist 9 years ago Matthew Ruttley [:mruttley] 97.71 KB, application/json		Details
md5.json 9 years ago Matthew Ruttley [:mruttley] 140.96 KB, application/json		Details
md5 and b64 version of the blacklist 9 years ago Matthew Ruttley [:mruttley] 115.83 KB, application/json		Details
Updated md5+b64 version of the blacklist 9 years ago Matthew Ruttley [:mruttley] 115.94 KB, application/json		Details

Ed Lee :Mardak

Reporter

Description

•

9 years ago

We need a blacklist to hardcode in bug 1159884. We have various sources of blacklist data that we want to combine. We also want to filter probably based on site ranking/popularity to reduce the total number of entries to lessen impact on disk/memory usage of Firefox.

Ed Lee :Mardak

Reporter

Updated

•

9 years ago

Blocks: 1159884

Matthew Ruttley [:mruttley]

Assignee

Comment 1

•

9 years ago

Here's the most current blacklist: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

I'm continually updating it. In terms of the bloom filter, we have to decide at a legal level what error rate is acceptable (perhaps none?). Maksik mentioned that he will investigate if there is any existing BF implementation in FF. We also have to decide which hashing function is best. My suggestion is that we use a popular function like murmur: https://gist.github.com/raycmorgan/588423

maxim zhilyaev

Comment 2

•

9 years ago

mxr points to C++ implementation of bloom filter
http://mxr.mozilla.org/mozilla-central/source/mfbt/BloomFilter.h

I am not sure how to surface it to the jsm module

maxim zhilyaev

Comment 3

•

9 years ago

When checking false positive rate, please run BloomFilter on 1m of alexa with adult sites removed and see how many actual false positives we get.

Ed Lee :Mardak

Reporter

Comment 4

•

9 years ago

There's a bloom filter implemented in bug 1138022 https://bugzilla.mozilla.org/attachment.cgi?id=8571008&action=diff

And I don't think it matters too much which exact bloom filter function is used. I expect it to be more of a calculation that can be confirmed with an actual implementation. E.g., given alexa top 1m domains, how what's the collision/false positive rate for a bloom filter with X bits.

Sure, different functions will generate different hashes and which exact sites collide will be different, but in general, I would think it's more of the size of blacklist requiring how many bits and how many additional bits to lessen false positives.

Ed Lee :Mardak

Reporter

Comment 5

•

9 years ago

There's this wikipedia article:

http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives

Ed Lee :Mardak

Reporter

Comment 6

•

9 years ago

This article has a rule of thumb:

http://corte.si/%2Fposts/code/bloom-filter-rules-of-thumb/index.html
1 - One byte per item in the input set gives about a 2% false positive rate.

If we had 1024 blacklisted items, we would get 2% false positive with 1KB bloom filter. If we want 100k items, we would need roughly 100KB bloom filter with 2% false positive rate. Increasing to 10% false positive is roughly half the size.

Matthew Ruttley [:mruttley]

Assignee

Comment 7

•

9 years ago

Mardak: Are we doing domain, subdomain or path level matching? 

This is a great list we can integrate: http://dsi.ut-capitole.fr/blacklists/index_en.php but requires a bit more than just domain to get at various google groups and live.com specifics.

Ed Lee :Mardak

Reporter

Comment 8

•

9 years ago

I believe we'll start with just the domain matching for now.

Matthew Ruttley [:mruttley]

Assignee

Comment 9

•

9 years ago

I've integrated some more data from a comscore source and we now have over 2900 domains: https://github.com/matthewruttley/contentfilter/blob/master/sites.json

The counts per category are as follows:
 - drugs: 19
 - gambling: 154
 - adult: 2743
 - alcohol: 50

All domains are in the latest daily Alexa top 1m sites.

Ed Lee :Mardak

Reporter

Comment 10

•

9 years ago

mruttley, can you calculate the maxP of those blacklisted sites as one adgroup?

Matthew Ruttley [:mruttley]

Assignee

Comment 11

•

9 years ago

The MaxP of blacklist groups is currently:

Drugs  0.114959
Gambling  0.094614
Adult  0.020494
Alcohol  0.074813

All Together  0.019026

maxim zhilyaev

Comment 12

•

9 years ago

Matthew, could you generate json file of the form:

{"blacklist": [ base64(bad_site_1), base64(bad_site_2), .... , base64("example.com") ]}

Note that I included "example.com" in the list, so we can browser test negative adjacency without actually going to pron sites.  I am using btoa() call to generate base64 in the client, so it will be advisable to use that for json file generation.

When the file is generated, do, please, attach to the bug

Flags: needinfo?(mruttley)

Matthew Ruttley [:mruttley]

Assignee

Comment 13

•

9 years ago

maksik: Here is the file: https://github.com/matthewruttley/contentfilter/blob/master/sitesb64.json

Flags: needinfo?(mruttley)

Matthew Ruttley [:mruttley]

Assignee

Comment 14

•

9 years ago

Attached file sitesb64.json - base 64 encoding of the blacklist (obsolete) — Details

Updated with b64 encoding of example.com as requested by maksik

Ed Lee :Mardak

Reporter

Updated

•

9 years ago

Status: NEW → RESOLVED

Iteration: 40.3 - 11 May → 41.1 - May 25

Points: --- → 5

Closed: 9 years ago

Resolution: --- → FIXED

Matthew N. [:MattN]

Comment 15

•

9 years ago

Why are we reinventing the wheel and not using the Safe Browsing data format and server? It supports more than just domain matches and is already usable by JS in Firefox. I don't understand how one second it's argued that we need this ASAP and then on the other hand extra work is made for ourselves.

Ed Lee :Mardak

Reporter

Comment 16

•

9 years ago

Do you know who can help with using the existing safe browsing data format and server? If it is indeed simple and low risk to uplift to 39, then we could go that route.

Matthew N. [:MattN]

Comment 17

•

9 years ago

:gcp or :mmc (not sure of availability) knows the client code in toolkit and looking at the server code[1] that I linked to before, it would seem like rtilder does. It looks like the client changes would be low risk for uplift IMO. I didn't realize you were trying to rush this into Beta which doesn't seem like a good idea for any of these solutions as there could be performance issues to be handled (especially with a custom solution).

[1] https://github.com/mozilla-services/shavar

Matthew Ruttley [:mruttley]

Assignee

Comment 18

•

9 years ago

Attached file md5.json (obsolete) — Details

md5 encoding of the blacklist

Matthew Ruttley [:mruttley]

Assignee

Comment 19

•

9 years ago

Attached file md5 and b64 version of the blacklist (obsolete) — Details

Attachment #8604906 - Attachment is obsolete: true

Attachment #8608241 - Attachment is obsolete: true

maxim zhilyaev

Comment 20

•

9 years ago

We need to change format of the json file. "blacklist" needs to be replaced with :"domains".  It should be

{
  "domains": [
    .....
  ]
}

could you please make a correction

Flags: needinfo?(mruttley)

Matthew Ruttley [:mruttley]

Assignee

Comment 21

•

9 years ago

Attached file Updated md5+b64 version of the blacklist — Details

Also available here: https://github.com/matthewruttley/contentfilter/blob/master/md5_b64.json

Attachment #8608315 - Attachment is obsolete: true

Flags: needinfo?(mruttley)

Ed Lee :Mardak

Reporter

Comment 22

•

9 years ago

What are the changes?

939a940,941
>         "mNlYGAOPc6KIMW8ITyBzIg==", 
>         "bK045TkBlz+/3+6n6Qwvrg==", 
1065d1066
<         "t1O9jSNjg4DTIv/Za4NbtA==", 
1163d1163
<         "+k5lDb+QdNc9iZ01hL5yBg==", 
1244d1243
<         "2qK2ZEY9LgdKSTaLf6VnLA==", 
1369a1369
>         "hIJA+1QGuKEj+3ijniyBSQ==", 
1528d1527
<         "E2lvMXqHdTw0x+KCKVnblg==", 
1973a1973
>         "8DtgIyYiNFqDc5qVrpFUng==", 
2070d2069
<         "RzX2OfSFEd//LhZwRwzBVw==", 
2211a2211
>         "O7JiE0bbp583G6ZWRGBcfw==", 
2925a2926
>         "gYgCu/qUpXWryubJauuPNw==", 
3010a3012
>         "+YVxSyViJfrme/ENe1zA7A==", 
3089a3092
>         "VZX1FnyC8NS2k3W+RGQm4g==",

Ed Lee :Mardak

Reporter

Comment 23

•

9 years ago

Ah I see the commit:

https://github.com/matthewruttley/contentfilter/commit/048e1410da2408d4e79ef0ea8001691606fe9af4

The commit message doesn't quite explain the changes in the list though. Any reason for those changes?

Flags: needinfo?(mruttley)

Matthew Ruttley [:mruttley]

Assignee

Comment 24

•

9 years ago

The changes reflect updates in how contentfilter is created. Contentfilter is generated by sites only found in the Alexa top 1m sites, which change every day. Thus each time I regenerate the list, a few sites change.

I used backticks in the commit message which seem to have eliminated most/all of it :/ TIL.

Flags: needinfo?(mruttley)

You need to log in before you can comment on or make changes to this bug.