Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection

NEW
Unassigned

Status

Content Services Graveyard
Tiles
P1
normal
3 years ago
3 years ago

People

(Reporter: mruttley, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: .?)

(Reporter)

Description

3 years ago
In order to perform inventory projection (as well as several other experiments), we need lots of data about newtab usage. 

The holy grail of data that I would like per day is:

{
    'day': '2015-03-17',
    'users': 
        [
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        ]
}

This would solve basically every data need I have right now.
(Reporter)

Updated

3 years ago
Summary: Anonymized newtab URL+Time stats per day for Inventory Projection → Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection
(Reporter)

Updated

3 years ago
Iteration: --- → 39.2 - 23 Mar

Comment 1

3 years ago
We don't store the full set of tiles for any given user for privacy and data practices reason. We only store aggregate single and cooccurence site data.
(Reporter)

Comment 2

3 years ago
Sorry, got the brackets wrong. This is more correct:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}
(Reporter)

Comment 3

3 years ago
Mardak: OK how about this - we ship LICA in telemetry (it is very small) and make this, even more anonymously:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}
(Reporter)

Comment 4

3 years ago
Actually, locale would be great as well: 


{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
                'locale': 'en-US'
            },
            ...etc
        }
}

Comment 5

3 years ago
Locale and country is something we already have with the telemetry data. Having LICA doesn't help in projecting inventory sizes of a set of sites. Do we even know if using LICA in the context new tab data even make sense?

I don't see why this bug is employee only, so opening it up.
Group: mozilla-employee-confidential
OS: Mac OS X → All
Hardware: x86 → All

Updated

3 years ago
Whiteboard: .?

Updated

3 years ago
Iteration: 39.2 - 23 Mar → 39.3 - 30 Mar

Comment 6

3 years ago
This requires shipping UP in telemetry experiment.
Porting UP into telemetry could be wildly resisted by telemetry people.
So, i an unsure why is this required at this stage, and which immediate biz benefits it brings?

Currently adgroupd are just list of targeted sites and we already collect sufficient data to predict inventory for that. Need Kevin to provide clearer directions on why we need it now.
Flags: needinfo?(kghim)
(Reporter)

Comment 7

3 years ago
I'm suggesting shipping a classifier so then we can further anonymize the data.

I wouldn't know who the data was from, or what sites people are looking at - just the counts from the classifications of their 9 tiles. I don't think that's too bad.  

The latest most accurate LICA would be 35kbs compressed including payload. 

We need it now so I can accurately predict revenue for ad campaigns for our team. It is very high priority.

Comment 8

3 years ago
(In reply to Matthew Ruttley [:mruttley] from comment #7)
> I'm suggesting shipping a classifier so then we can further anonymize the
> data.

We can't do it for this telemetry experiment.  Perhaps in later experiments.

> 
> The latest most accurate LICA would be 35kbs compressed including payload. 

How was accuracy of LICA measured?

> We need it now so I can accurately predict revenue for ad campaigns for our
> team. It is very high priority.

AdGroups are currently just list of sites. For the current incarnation of the product the data provided by next telemetry experiment will be sufficient for the purpose of inventory estimation from AdGroups of interest.

Comment 9

3 years ago
(In reply to maxim zhilyaev from comment #6)
> This requires shipping UP in telemetry experiment.
> Porting UP into telemetry could be wildly resisted by telemetry people.
> So, i an unsure why is this required at this stage, and which immediate biz
> benefits it brings?
> 
> Currently adgroupd are just list of targeted sites and we already collect
> sufficient data to predict inventory for that. Need Kevin to provide clearer
> directions on why we need it now.

Seems like the data we need to receive needs to be consistent and accurately reflect the audience we're going to display Suggested Tiles. Per Bug 1135738, we need 100% of the sample. The type of analysis and eventual tool that needs to be built into Zenko needs to be Mozcat (or IAB) categories. In the tool, the data will be sliced into criteria that's audience packageable to agencies: projections based on categories, date ranges, geo being the minimum. I imagine this data would need to be updated as least every 2 weeks to be current and relevant.
Flags: needinfo?(kghim)

Comment 10

3 years ago
(In reply to Kevin Ghim from comment #9)
> I imagine this data would need to be updated as least every 2 weeks
> to be current and relevant.

Our telemetry experiment runs for two weeks.
Since we want to use full beta, no other experiments will be able to run, while we run.
If we run experiment every 2 weeks, then NO other experiment will ever run :)
We have to be somewhat reasonable with the telemetry channel - perhaps every other months or something.
OR we build something of our own, using existing tiles infrastructure.

Updated

3 years ago
Iteration: 39.3 - 30 Mar → 40.1 - 13 Apr

Comment 11

3 years ago
This requires some larger discussion around putting classification engine, what rules, etc into Firefox before we can implement and get the data.
Assignee: mzhilyaev → nobody
Iteration: 40.1 - 13 Apr → ---
You need to log in before you can comment on or make changes to this bug.