Closed Bug 1144790 Opened 9 years ago Closed 1 month ago

Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection

Categories

(Content Services Graveyard :: Tiles, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: mruttley, Unassigned)

Details

(Whiteboard: .?)

In order to perform inventory projection (as well as several other experiments), we need lots of data about newtab usage. 

The holy grail of data that I would like per day is:

{
    'day': '2015-03-17',
    'users': 
        [
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        ]
}

This would solve basically every data need I have right now.
Summary: Anonymized newtab URL+Time stats per day for Inventory Projection → Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection
Iteration: --- → 39.2 - 23 Mar
We don't store the full set of tiles for any given user for privacy and data practices reason. We only store aggregate single and cooccurence site data.
Sorry, got the brackets wrong. This is more correct:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}
Mardak: OK how about this - we ship LICA in telemetry (it is very small) and make this, even more anonymously:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}
Actually, locale would be great as well: 


{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
                'locale': 'en-US'
            },
            ...etc
        }
}
Locale and country is something we already have with the telemetry data. Having LICA doesn't help in projecting inventory sizes of a set of sites. Do we even know if using LICA in the context new tab data even make sense?

I don't see why this bug is employee only, so opening it up.
Group: mozilla-employee-confidential
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: .?
Iteration: 39.2 - 23 Mar → 39.3 - 30 Mar
This requires shipping UP in telemetry experiment.
Porting UP into telemetry could be wildly resisted by telemetry people.
So, i an unsure why is this required at this stage, and which immediate biz benefits it brings?

Currently adgroupd are just list of targeted sites and we already collect sufficient data to predict inventory for that. Need Kevin to provide clearer directions on why we need it now.
Flags: needinfo?(kghim)
I'm suggesting shipping a classifier so then we can further anonymize the data.

I wouldn't know who the data was from, or what sites people are looking at - just the counts from the classifications of their 9 tiles. I don't think that's too bad.  

The latest most accurate LICA would be 35kbs compressed including payload. 

We need it now so I can accurately predict revenue for ad campaigns for our team. It is very high priority.
(In reply to Matthew Ruttley [:mruttley] from comment #7)
> I'm suggesting shipping a classifier so then we can further anonymize the
> data.

We can't do it for this telemetry experiment.  Perhaps in later experiments.

> 
> The latest most accurate LICA would be 35kbs compressed including payload. 

How was accuracy of LICA measured?

> We need it now so I can accurately predict revenue for ad campaigns for our
> team. It is very high priority.

AdGroups are currently just list of sites. For the current incarnation of the product the data provided by next telemetry experiment will be sufficient for the purpose of inventory estimation from AdGroups of interest.
(In reply to maxim zhilyaev from comment #6)
> This requires shipping UP in telemetry experiment.
> Porting UP into telemetry could be wildly resisted by telemetry people.
> So, i an unsure why is this required at this stage, and which immediate biz
> benefits it brings?
> 
> Currently adgroupd are just list of targeted sites and we already collect
> sufficient data to predict inventory for that. Need Kevin to provide clearer
> directions on why we need it now.

Seems like the data we need to receive needs to be consistent and accurately reflect the audience we're going to display Suggested Tiles. Per Bug 1135738, we need 100% of the sample. The type of analysis and eventual tool that needs to be built into Zenko needs to be Mozcat (or IAB) categories. In the tool, the data will be sliced into criteria that's audience packageable to agencies: projections based on categories, date ranges, geo being the minimum. I imagine this data would need to be updated as least every 2 weeks to be current and relevant.
Flags: needinfo?(kghim)
(In reply to Kevin Ghim from comment #9)
> I imagine this data would need to be updated as least every 2 weeks
> to be current and relevant.

Our telemetry experiment runs for two weeks.
Since we want to use full beta, no other experiments will be able to run, while we run.
If we run experiment every 2 weeks, then NO other experiment will ever run :)
We have to be somewhat reasonable with the telemetry channel - perhaps every other months or something.
OR we build something of our own, using existing tiles infrastructure.
Iteration: 39.3 - 30 Mar → 40.1 - 13 Apr
This requires some larger discussion around putting classification engine, what rules, etc into Firefox before we can implement and get the data.
Assignee: mzhilyaev → nobody
Iteration: 40.1 - 13 Apr → ---
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.