Closed Bug 1148474 Opened 9 years ago Closed 9 years ago

Generate white list from similar sites data and compare with alexa top-1m list

Categories

(Content Services Graveyard :: Tiles, defect, P1)

defect
Points:
8

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
39.3 - 30 Mar

People

(Reporter: mzhilyaev, Assigned: mzhilyaev)

References

Details

(Whiteboard: .?)

Attachments

(1 file, 1 obsolete file)

      No description provided.
Format of the file:
site | occurrences in sites listing | alexa rank
The file was generated by collecting all sites in similar-sites data and counting number of times each site was mentioned in any of the lists.  Then alexa rank was assigned to a site if it was listed in alexa top-1m.csv, otherwise 9999999 was assigned.

The sites where chosen if:
- sim-sites occurrence is 2 or higher
- or if alexa rank is below 200000

The resulting list is 50557 entries and covers 86% of EdRules (which is better then alexa's original white list).  I would recommend using the attached white list over 50K of alexa top-1m.csv
Removed junk domains like:
4.cn
6.cn
com.
d.cn
g.cn
i.ua
j.mp
o.cn
org.
q.gs
t.cn
t.co
u.tv
w.cn
Attachment #8584843 - Attachment is obsolete: true
Per conversation with Mardak, closing this bug as white list seems to perform well, and it's difficult to improve it by adding more sites from Alexa or increasing selection rank for a single sim-site occurrence.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: