Closed Bug 1126891 Opened 9 years ago Closed 9 years ago

Analyze top sites experiment data with Interest Dashboard classifier

Categories

(Content Services Graveyard :: Classification Engine, defect)

defect
Not set
normal
Points:
13

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
38.3 - 23 Feb

People

(Reporter: Mardak, Assigned: mzhilyaev)

References

Details

(Whiteboard: .003)

Let's take the subdomain impressions from bug 1062708 and calculate some classification coverage precision/recall. We'll probably have at least 2 sets of numbers: one for just unique subdomain coverage and another weighted by the number of impressions per subdomain.
Resetting to next iteration as i am currently working on sites co-occurrence data infernyx rules and analytics. If there's a pressing business need to get this data soon, I would like to look at it next iteration
Iteration: 38.2 - 9 Feb → 38.3 - 23 Feb
Points: --- → 13
site_stats_daily need to be rebuilt in order to run classification currently there only 20 sites in this table:
psql (9.3.1, server 8.0.2)
Type "help" for help.

tiles=> select count(distinct(url)) from site_stats_daily where url != '';
 count 
-------
    20
(1 row)
Only 6% of all the tile urls were classified by UP classification we use in ID.

The list of categorized sites here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.sites_categorized
Statistics per UP category is here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats

UP classifier will perform poorly on just domains or hosts.
Also note that site impressions can not be simply added to get impression count for a category: same sports sites may occur on the same new-tab page.

We need better strategy for audience sizing and do to so we need to understand audience segmentation.
I am closing this bug as it's formally fixed, but we MUST talk about how we segment and size the audience.
This deserves a separate story bug in my opinion
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
What's the coverage weighted by impression? E.g., espn.go.com has 30789 impressions and was categorized but some.random.site.com with 1 impression wasn't categorized.
These will be columns 3 and 5 of https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats

category,sites,impressions,% of total sites,% of total impressions
UNCATEGORIZED,38043,5191729,93.38,86.66
fashion,7,61,0.02,0.01
...
sports,395,111758,0.97,15.94


Number in third column is the sum of all site impressions falling into the category
Number in fifth column is the % of category impressions sum over total impressions sum

All uncategorized site were put into UNCATEGORIZED category, which covers 93.38% of all sites, and 86.66% of the impression sum of all site's impressions.
You need to log in before you can comment on or make changes to this bug.