Analyze top sites experiment data with Interest Dashboard classifier

RESOLVED FIXED

Status

Content Services Graveyard
Classification Engine
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Mardak, Assigned: maxim zhilyaev)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: .003)

(Reporter)

Description

2 years ago
Let's take the subdomain impressions from bug 1062708 and calculate some classification coverage precision/recall. We'll probably have at least 2 sets of numbers: one for just unique subdomain coverage and another weighted by the number of impressions per subdomain.
(Assignee)

Comment 1

2 years ago
Resetting to next iteration as i am currently working on sites co-occurrence data infernyx rules and analytics. If there's a pressing business need to get this data soon, I would like to look at it next iteration
Iteration: 38.2 - 9 Feb → 38.3 - 23 Feb
Points: --- → 13
(Assignee)

Comment 2

2 years ago
site_stats_daily need to be rebuilt in order to run classification currently there only 20 sites in this table:
psql (9.3.1, server 8.0.2)
Type "help" for help.

tiles=> select count(distinct(url)) from site_stats_daily where url != '';
 count 
-------
    20
(1 row)
(Assignee)

Comment 3

2 years ago
Only 6% of all the tile urls were classified by UP classification we use in ID.

The list of categorized sites here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.sites_categorized
Statistics per UP category is here: https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats

UP classifier will perform poorly on just domains or hosts.
Also note that site impressions can not be simply added to get impression count for a category: same sports sites may occur on the same new-tab page.

We need better strategy for audience sizing and do to so we need to understand audience segmentation.
I am closing this bug as it's formally fixed, but we MUST talk about how we segment and size the audience.
This deserves a separate story bug in my opinion
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Reporter)

Comment 4

2 years ago
What's the coverage weighted by impression? E.g., espn.go.com has 30789 impressions and was categorized but some.random.site.com with 1 impression wasn't categorized.
(Assignee)

Comment 5

2 years ago
These will be columns 3 and 5 of https://people.mozilla.org/~mzhilyaev/tiles/en-US.US.cats_stats

category,sites,impressions,% of total sites,% of total impressions
UNCATEGORIZED,38043,5191729,93.38,86.66
fashion,7,61,0.02,0.01
...
sports,395,111758,0.97,15.94


Number in third column is the sum of all site impressions falling into the category
Number in fifth column is the % of category impressions sum over total impressions sum

All uncategorized site were put into UNCATEGORIZED category, which covers 93.38% of all sites, and 86.66% of the impression sum of all site's impressions.
You need to log in before you can comment on or make changes to this bug.