Generate white list from similar sites data and compare with alexa top-1m list

RESOLVED FIXED

Status

Content Services Graveyard
Tiles
P1
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: maxim zhilyaev, Assigned: maxim zhilyaev)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: .?)

Attachments

(1 attachment, 1 obsolete attachment)

Comment hidden (empty)
(Assignee)

Comment 1

3 years ago
Created attachment 8584843 [details]
sim-sites.whitelist  - contains white list extracted from similar sites data

Format of the file:
site | occurrences in sites listing | alexa rank
(Assignee)

Comment 2

3 years ago
The file was generated by collecting all sites in similar-sites data and counting number of times each site was mentioned in any of the lists.  Then alexa rank was assigned to a site if it was listed in alexa top-1m.csv, otherwise 9999999 was assigned.

The sites where chosen if:
- sim-sites occurrence is 2 or higher
- or if alexa rank is below 200000

The resulting list is 50557 entries and covers 86% of EdRules (which is better then alexa's original white list).  I would recommend using the attached white list over 50K of alexa top-1m.csv
(Assignee)

Comment 3

3 years ago
Created attachment 8585786 [details]
remove some junk domains

Removed junk domains like:
4.cn
6.cn
com.
d.cn
g.cn
i.ua
j.mp
o.cn
org.
q.gs
t.cn
t.co
u.tv
w.cn
Attachment #8584843 - Attachment is obsolete: true
(Assignee)

Comment 4

3 years ago
Per conversation with Mardak, closing this bug as white list seems to perform well, and it's difficult to improve it by adding more sites from Alexa or increasing selection rank for a single sim-site occurrence.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.