1144790 - Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection

Reporter

Description

•

9 years ago

In order to perform inventory projection (as well as several other experiments), we need lots of data about newtab usage. 

The holy grail of data that I would like per day is:

{
    'day': '2015-03-17',
    'users': 
        [
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        ]
}

This would solve basically every data need I have right now.

Matthew Ruttley [:mruttley]

Reporter

Updated

•

9 years ago

Summary: Anonymized newtab URL+Time stats per day for Inventory Projection → Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection

Matthew Ruttley [:mruttley]

Reporter

Updated

•

9 years ago

Iteration: --- → 39.2 - 23 Mar

Ed Lee :Mardak

Comment 1

•

9 years ago

We don't store the full set of tiles for any given user for privacy and data practices reason. We only store aggregate single and cooccurence site data.

Matthew Ruttley [:mruttley]

Reporter

Comment 2

•

9 years ago

Sorry, got the brackets wrong. This is more correct:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tiles': [
                    {
                        'url': 'http://www.domain.com',
                        'clicks_today': 4
                    },{
                        'url': 'http://www.domain2.com',
                        'clicks_today': 2
                    },
                    ...etc
                ],
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}

Matthew Ruttley [:mruttley]

Reporter

Comment 3

•

9 years ago

Mardak: OK how about this - we ship LICA in telemetry (it is very small) and make this, even more anonymously:

{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
            },
            ...etc
        }
}

Matthew Ruttley [:mruttley]

Reporter

Comment 4

•

9 years ago

Actually, locale would be great as well: 


{
    'day': '2015-03-17',
    'users': 
        {
            'anonymous_user_123': {
                'tile_categories': {
                    'sports/baseball': 1,
                    'sports/general': 1,
                    'technology & computing/computer programming': 1,
                    'folklore/astrology': 1,
                    'unknown': 5
                },
                'new_tab_page_dwell_time': 123,
                'locale': 'en-US'
            },
            ...etc
        }
}

Ed Lee :Mardak

Comment 5

•

9 years ago

Locale and country is something we already have with the telemetry data. Having LICA doesn't help in projecting inventory sizes of a set of sites. Do we even know if using LICA in the context new tab data even make sense?

I don't see why this bug is employee only, so opening it up.

Group: mozilla-employee-confidential

OS: Mac OS X → All

Hardware: x86 → All

Kevin Ghim

Updated

•

9 years ago

Whiteboard: .?

Ed Lee :Mardak

Updated

•

9 years ago

Iteration: 39.2 - 23 Mar → 39.3 - 30 Mar

maxim zhilyaev

Comment 6

•

9 years ago

This requires shipping UP in telemetry experiment.
Porting UP into telemetry could be wildly resisted by telemetry people.
So, i an unsure why is this required at this stage, and which immediate biz benefits it brings?

Currently adgroupd are just list of targeted sites and we already collect sufficient data to predict inventory for that. Need Kevin to provide clearer directions on why we need it now.

Flags: needinfo?(kghim)

Matthew Ruttley [:mruttley]

Reporter

Comment 7

•

9 years ago

I'm suggesting shipping a classifier so then we can further anonymize the data.

I wouldn't know who the data was from, or what sites people are looking at - just the counts from the classifications of their 9 tiles. I don't think that's too bad.  

The latest most accurate LICA would be 35kbs compressed including payload. 

We need it now so I can accurately predict revenue for ad campaigns for our team. It is very high priority.

maxim zhilyaev

Comment 8

•

9 years ago

(In reply to Matthew Ruttley [:mruttley] from comment #7)
> I'm suggesting shipping a classifier so then we can further anonymize the
> data.

We can't do it for this telemetry experiment.  Perhaps in later experiments.

> 
> The latest most accurate LICA would be 35kbs compressed including payload. 

How was accuracy of LICA measured?

> We need it now so I can accurately predict revenue for ad campaigns for our
> team. It is very high priority.

AdGroups are currently just list of sites. For the current incarnation of the product the data provided by next telemetry experiment will be sufficient for the purpose of inventory estimation from AdGroups of interest.

Kevin Ghim

Comment 9

•

9 years ago

(In reply to maxim zhilyaev from comment #6)
> This requires shipping UP in telemetry experiment.
> Porting UP into telemetry could be wildly resisted by telemetry people.
> So, i an unsure why is this required at this stage, and which immediate biz
> benefits it brings?
> 
> Currently adgroupd are just list of targeted sites and we already collect
> sufficient data to predict inventory for that. Need Kevin to provide clearer
> directions on why we need it now.

Seems like the data we need to receive needs to be consistent and accurately reflect the audience we're going to display Suggested Tiles. Per Bug 1135738, we need 100% of the sample. The type of analysis and eventual tool that needs to be built into Zenko needs to be Mozcat (or IAB) categories. In the tool, the data will be sliced into criteria that's audience packageable to agencies: projections based on categories, date ranges, geo being the minimum. I imagine this data would need to be updated as least every 2 weeks to be current and relevant.

Flags: needinfo?(kghim)

maxim zhilyaev

Comment 10

•

9 years ago

(In reply to Kevin Ghim from comment #9)
> I imagine this data would need to be updated as least every 2 weeks
> to be current and relevant.

Our telemetry experiment runs for two weeks.
Since we want to use full beta, no other experiments will be able to run, while we run.
If we run experiment every 2 weeks, then NO other experiment will ever run :)
We have to be somewhat reasonable with the telemetry channel - perhaps every other months or something.
OR we build something of our own, using existing tiles infrastructure.

Ed Lee :Mardak

Updated

•

9 years ago

Iteration: 39.3 - 30 Mar → 40.1 - 13 Apr

Ed Lee :Mardak

Comment 11

•

9 years ago

This requires some larger discussion around putting classification engine, what rules, etc into Firefox before we can implement and get the data.

Assignee: mzhilyaev → nobody

Iteration: 40.1 - 13 Apr → ---

u597032

Updated

•

1 month ago

Status: NEW → RESOLVED

Closed: 1 month ago

Resolution: --- → INCOMPLETE

Bugzilla

Quick Search

Telemetry Data: Anonymized newtab URL+Time stats per day for Inventory Projection

Categories

(Content Services Graveyard :: Tiles, defect, P1)

Tracking

(Not tracked)

People

(Reporter: mruttley, Unassigned)

References

Details

(Whiteboard: .?)

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Updated