Analyze top sites experiment data with Interest Dashboard classifier



Content Services Graveyard
Classification Engine
3 years ago
3 years ago


(Reporter: Mardak, Assigned: maxim zhilyaev)


Firefox Tracking Flags

(Not tracked)


(Whiteboard: .003)



3 years ago
Let's take the subdomain impressions from bug 1062708 and calculate some classification coverage precision/recall. We'll probably have at least 2 sets of numbers: one for just unique subdomain coverage and another weighted by the number of impressions per subdomain.

Comment 1

3 years ago
Resetting to next iteration as i am currently working on sites co-occurrence data infernyx rules and analytics. If there's a pressing business need to get this data soon, I would like to look at it next iteration
Iteration: 38.2 - 9 Feb → 38.3 - 23 Feb
Points: --- → 13

Comment 2

3 years ago
site_stats_daily need to be rebuilt in order to run classification currently there only 20 sites in this table:
psql (9.3.1, server 8.0.2)
Type "help" for help.

tiles=> select count(distinct(url)) from site_stats_daily where url != '';
(1 row)

Comment 3

3 years ago
Only 6% of all the tile urls were classified by UP classification we use in ID.

The list of categorized sites here:
Statistics per UP category is here:

UP classifier will perform poorly on just domains or hosts.
Also note that site impressions can not be simply added to get impression count for a category: same sports sites may occur on the same new-tab page.

We need better strategy for audience sizing and do to so we need to understand audience segmentation.
I am closing this bug as it's formally fixed, but we MUST talk about how we segment and size the audience.
This deserves a separate story bug in my opinion
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 4

3 years ago
What's the coverage weighted by impression? E.g., has 30789 impressions and was categorized but with 1 impression wasn't categorized.

Comment 5

3 years ago
These will be columns 3 and 5 of

category,sites,impressions,% of total sites,% of total impressions

Number in third column is the sum of all site impressions falling into the category
Number in fifth column is the % of category impressions sum over total impressions sum

All uncategorized site were put into UNCATEGORIZED category, which covers 93.38% of all sites, and 86.66% of the impression sum of all site's impressions.
You need to log in before you can comment on or make changes to this bug.