Closed Bug 1136234 Opened 11 years ago Closed 11 years ago

Estimate amount of overcounting/undercounting of impressions for a set of sites

Categories

(Content Services Graveyard :: Tiles, defect)

defect
Not set
normal
Points:
13

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
39.1 - 9 Mar

People

(Reporter: Mardak, Assigned: mzhilyaev)

References

Details

(Whiteboard: .006)

Because we normally only save co-occurrence data and individual site data, it's hard to compute the total number of impressions for a set of sites that has 3 or more items. This comes from needing to subtract the impressions that overlap multiple sites, e.g., computing the impression counts of siteA OR siteB = countA + countB - countAB. For 3 sites, we would need countA + countB + countC - countAB - countAC - countBC + countABC, but we don't have countABC. The upperbound would be just countA + countB + countC. The lowerbound would be countA + countB + countC - countAB - countAC - countBC.
Assignee: nobody → mzhilyaev
Iteration: --- → 39.1 - 9 Mar
Points: --- → 13
Theoretical and practical consideration is here: https://docs.google.com/a/mozilla.com/document/d/10CLpuWzlL4nkvzKO2niKP_sOilK6lmC3iueiZ7k8qGo For a list of N sites, the impressions limits are: Upper = SUM(sites count) - MAX(sites co-occurrence) Lower = SUM(sites count) - SUM(sites co-occurrence) Expected = Lower + (Upper - Lower) / 2 Error = (Upper - Lower) / 2*Expected I run tests using edrules categories, the results for the first 10 most populous cats are below: Politics Total 33869 Upper 31929 Lower 27726 Expected 29828 Error 13.5476733271 % Sports Total 26040 Upper 24685 Lower 22904 Expected 23795 Error 9.43475520067 % Movies Total 24580 Upper 24030 Lower 23649 Expected 23840 Error 3.10402684564 % Television Total 20378 Upper 20038 Lower 19877 Expected 19958 Error 2.10441928049 % Technology Total 18693 Upper 18064 Lower 14151 Expected 16108 Error 16.0479264962 % Soccer Total 13151 Upper 11923 Lower 11669 Expected 11796 Error 11.486944727 % Football Total 12866 Upper 12796 Lower 12749 Expected 12773 Error 0.72809833242 % Business Total 10550 Upper 10208 Lower 8600 Expected 9404 Error 12.1863037006 % Video-Gmes Total 9641 Upper 9219 Lower 8418 Expected 8819 Error 9.32078466946 % Humor Total 7017 Upper 6902 Lower 6646 Expected 6775 Error 3.57195571956 % The average per category error is 6.5%. Some categories have a site or two, that generate most of the impressions in which case estimation error is low. Yet, some categories (Technology and Politics) have numerous sites contributing significant number of impressions to the total, in which case estimation error goes up, as sites also tend to co-occur frequently. It's possible, that Technology or Politics sites are highly correlated and tend to be more of case 1, where same user (newtab) generates impressions containing more than 2 sites in the list, but have no way of knowing. Under uniform assumption, estimation error could be large in this case. Also note: - 15% error is, perhaps, could be acceptable - numbers could be severely distorted by one-user-making-many-newtab-impressions issue - frequency cap could change numbers significantly
Resolving the bug fixed. If other research is needed, re-open
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: Measure amount of overcounting/undercounting of impressions compared to actual → Estimate amount of overcounting/undercounting of impressions for a set of sites
Whiteboard: .? → .006
Blocks: 1136977
You need to log in before you can comment on or make changes to this bug.