Let's take the subdomain impressions from bug 1062708 and calculate some classification coverage precision/recall. We'll probably have at least 2 sets of numbers: one for just unique subdomain coverage and another weighted by the number of impressions per subdomain.
Resetting to next iteration as i am currently working on sites co-occurrence data infernyx rules and analytics. If there's a pressing business need to get this data soon, I would like to look at it next iteration
site_stats_daily need to be rebuilt in order to run classification currently there only 20 sites in this table: psql (9.3.1, server 8.0.2) Type "help" for help. tiles=> select count(distinct(url)) from site_stats_daily where url != ''; count ------- 20 (1 row)
Only 6% of all the tile urls were classified by UP classification we use in ID. The list of categorized sites here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.sites_categorized Statistics per UP category is here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats UP classifier will perform poorly on just domains or hosts. Also note that site impressions can not be simply added to get impression count for a category: same sports sites may occur on the same new-tab page. We need better strategy for audience sizing and do to so we need to understand audience segmentation. I am closing this bug as it's formally fixed, but we MUST talk about how we segment and size the audience. This deserves a separate story bug in my opinion
What's the coverage weighted by impression? E.g., espn.go.com has 30789 impressions and was categorized but some.random.site.com with 1 impression wasn't categorized.
These will be columns 3 and 5 of https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats category,sites,impressions,% of total sites,% of total impressions UNCATEGORIZED,38043,5191729,93.38,86.66 fashion,7,61,0.02,0.01 ... sports,395,111758,0.97,15.94 Number in third column is the sum of all site impressions falling into the category Number in fifth column is the % of category impressions sum over total impressions sum All uncategorized site were put into UNCATEGORIZED category, which covers 93.38% of all sites, and 86.66% of the impression sum of all site's impressions.