Closed Bug 1132534 Opened 9 years ago Closed 9 years ago

Decide on how many sites should be in the related list for Related Tiles

Categories

(Content Services Graveyard :: Tiles, defect)

defect
Not set
normal
Points:
8

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
39.1 - 9 Mar

People

(Reporter: Mardak, Assigned: Mardak)

References

Details

(Whiteboard: .001)

For a Related Tile, it'll be able to specify a list of related sites so that if a particular user has visited one of those sites, Firefox shows the related tile.

One concern is if the list of related sites is small, e.g., 1 site, a user clicking on the tile potentially reveals to both Mozilla (through the click ping identifying the tile by id) and the destination page (if there's some tile-specific url, e.g., foo.com/?src=related-tile1).

We can try to reduce this revealing-ness and protect user privacy by having more sites in the related list. One simple proposal has been to require at least 5 sites. But then why not 10? But even with a number, it could be gamed, e.g., [realtarget.com, fakesite1.com, fakesite2.com, ...] although initially, we'll have people reviewing the list of related for business reasons (e.g., we only allow targeting based on user action - doing taxes - and not inferred user demographic - male, 20-40yo, $100k+ income)

Alternatively, there was a hand-wavy proposal that based on expected probability that a given site would trigger the related tile to appear, no one site should have more than 50% probability relative to the other sites in the list. This potentially is just complicating things without actually protecting user privacy

mmc, do you have thoughts or know who could help us decide on a number or policy of how many sites should be in the related list?
Flags: needinfo?(mmc)
Sorry, I don't have enough context. What's the proposal for how related tiles are collected for a given site?
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #1)
> Sorry, I don't have enough context. What's the proposal for how related
> tiles are collected for a given site?
We're more likely to approach this as finding a set of related sites for a given tile than the other way around of finding several related tiles for a given site (as it's more difficult to make multiple tiles than selecting multiple related sites).

There's two possible sources of related tiles: 1) from within Mozilla to create Mozilla affiliate tiles (linking to Mozilla properties) or organic tiles (many users probably want to go to site X if they've been to site Y based on telemetry data) and 2) externally paid/sponsored tiles.

In either case, it sounds like we want to avoid revealing the fact that the user has been to a specific site when the user clicks on a tile that was shown based on the user's browsing behaviors.

We're thinking that related tiles requests (either internal or external) will provide suggestions on sites to target or types of sites, e.g., "US tax related" or irs.gov. It would then be up to Mozilla to protect user privacy by adding in more tax/irs.gov related sites to the list such as turbotax.intuit.com, taxact.com, hrblock.com to try to obfuscate / allow for user deniability.

If the related tile request only has 1 site, we could just reply with "please give us more related sites because we require more to run the related tile," or we could do some work in finding appropriate related sites.

I realize this has various overhead, but we can optimize and automate some aspects of this once we see how people try to use related tiles.
Oh, one thing I didn't mention before but is probably pretty important:

The current thinking around reporting tile impressions/clicks is that we will report whether a related tile is shown but not how/why it was shown. I.e., we do not report back relatedTileA was shown because the user visited siteX.
I wonder if the added complexity of this 'click obfuscation scheme' even protects the user at all.  Bad actors don't need 'proof' that a user clicked on a tile (which our system wouldn't provide, even without this scheme), they only need actionable probability.  

The scheme definitely has the disadvantage of making the campaign *more* difficult to optimize, which, I fear, would lead to less engaging and lower performing ads.

Our ad-operations staff will (presumably) be rewarded for enhancing the performance and engagement of the campaigns we run, and this scheme seems to put an obstacle in front of them.
(In reply to Tim Spurway [:tspurway] from comment #4)
> I wonder if the added complexity of this 'click obfuscation scheme' even
> protects the user at all.
Indeed, I think mmc would agree that we don't want to go down a path of doing something that makes things more difficult or unusable for ourselves while not even protecting users.
I chatted a little with merwin about this. I don't think that setting a minimum number of related sites has a noticeable effect on the privacy aspects of this feature. Recommender systems in general are extremely leaky (check out http://33bits.org/2011/05/24/you-might-also-like-privacy-risks-of-collaborative-filtering/ for a cool example), and I think that even with zero information about URLs, category information is enough to give a reasonably good picture about what site the user has visited. For example, Amazon tops the Alexa Shopping sites and so anyone who bought a related tile from us would be able to infer that the user had probably visited amazon.com or ebay.com (http://www.alexa.com/topsites/category/Top/Shopping). The inclusion of sears.com doesn't really obfuscate that much.

However, I also don't think that it makes a lot of sense to promise partners that their tile will only be shown in relation to a specific set of URLs. Just considering correctness, we want to be able to update that list at will in case of error or in case our category information improves.

In the other direction where partners give us a list of related URLs for their tile, I would be very surprised if they didn't give us lots and lots. We are basically introducing a new keyword spam problem (http://en.wikipedia.org/wiki/Keyword_stuffing) so I'd expect partners to over-list related URLs.

One thing that hasn't been mentioned thus far is what happens if a tiles provider is able to accumulate enough interest data over time to uniquely identify a person. I believe the best "fix" for this is to require them to not use this data for other purposes than improving the tiles experience (e.g. not sell it to data brokers, or combine it with data from other services for other purposes) and have short retention periods, similar to what EFF's DNT policy states: https://www.eff.org/dnt-policy

So, my recommendations are:
1) Minimum number of related tiles does not buy us much
2) Map related tiles by category information rather than exposing a full list of related URLs for flexibility reasons
3) Get our partners to agree not to use interest data revealed from tiles for non-tiles purposes
Flags: needinfo?(mmc)
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #6)
> So, my recommendations are:
> 1) Minimum number of related tiles does not buy us much
> 2) Map related tiles by category information rather than exposing a full
> list of related URLs for flexibility reasons
> 3) Get our partners to agree not to use interest data revealed from tiles
> for non-tiles purposes
To be clear, at the end of the day, Firefox is matching a related tile based on sites. So having a category can give us some flexibility in what sites we match, but inspecting what makes up a category (from source code or the server response) will reveal a list of sites.

For example, if we have a "US tax filing" category, there's only so many sites that are practical to include to have decent precision (the site is actually a good indicator of the category) and coverage (how many users will even have the site).


> However, I also don't think that it makes a lot of sense to promise partners
> that their tile will only be shown in relation to a specific set of URLs.
On the business side of things, jterry has mentioned that partners would want to target a single specific site, but I don't have intuition around whether partners would be X% less likely to work with us or pay us Y% less money because we won't guarantee showing on specific sites. But we're okay with leaving money on the table if we can shift behaviors the advertising market.

But even with that, we can probably still ask partners what types of sites or categories they want to target and be clear with them that it's up to Mozilla to decide when a related tile will actually show up for users. (As this is our product and partners have to deal with our terms.)

> In the other direction where partners give us a list of related URLs for
> their tile, I would be very surprised if they didn't give us lots and lots.
We've discussed this in a separate context of avoid "run of network" type relatedness, where we would not allow for people to target google, facebook, yahoo, etc. just because they appear in many Firefox users' top sites.


> One thing that hasn't been mentioned thus far is what happens if a tiles
> provider is able to accumulate enough interest data over time to uniquely
> identify a person.
The uniquely identifying a user is somewhat tricky for us to prevent even if we don't use interest data because if the destination site has any cookie-like data on the user either directly or indirectly through 3rd parties, e.g., a script to facebook or iframe to google, that site could probably have enough data to do all sorts of things.

And this is actually somewhat desired. E.g., an amazon related tile showing prime membership promotions probably will show something different for users who have been using the membership already vs haven't signed up.
(In reply to Ed Lee :Mardak from comment #7)
> 
> > In the other direction where partners give us a list of related URLs for
> > their tile, I would be very surprised if they didn't give us lots and lots.
> We've discussed this in a separate context of avoid "run of network" type
> relatedness, where we would not allow for people to target google, facebook,
> yahoo, etc. just because they appear in many Firefox users' top sites.
> 

We need to be clear on the Related Tiles ad product offering in promoting the relatedness to both the user and the client. RON or Spamdexing (http://en.wikipedia.org/wiki/Spamdexing) type of targeting offers little value for the user and is not aligned with the Consideration/Intent funnel this product is addressing.
(In reply to Kevin Ghim from comment #8)
> We need to be clear on the Related Tiles ad product offering in promoting
> the relatedness to both the user and the client. RON or Spamdexing
> (http://en.wikipedia.org/wiki/Spamdexing) type of targeting offers little
> value for the user and is not aligned with the Consideration/Intent funnel
> this product is addressing.

IIRC, "spamdexing" and "keyword stuffing" are techniques used by an unscrupulous publisher to fool a search engine and are typically not used by advertisers (even unscrupulous ones!).  We exert and maintain control over what sites (and/or keywords) are used by advertisers in the Tiles product, and are obviously going to optimize for quality, performance and engagement. 

In a 'long tail' situation, where we sell Tiles in an unsupervised, self-serve format, we would need to implement automated schemes to reward high quality campaigns.  Google's AdWords product has a "quality score" that is used to control/reward highly performing/engaging campaigns.  I'd recommend we adopt something similar.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #6)
> a noticeable effect on the privacy aspects of this feature
Could you elaborate on what's the desired privacy aspects? What is okay to be revealed about browsing behaviors or not?

maksik and I had an interest chat about what are we actually trying to protect. As you pointed out, recommendation systems are leaky, but user actions are pretty leaky as well. ;)

Having categories or multiple relatedness criteria would still reveal that the user matched that criteria, so we were thinking, would it be possible to obfuscate across users by randomly showing a related tile that didn't actually match the user's behavior? (Somewhat similar to having more people using more encryption helps protect others.)

But even for users who randomly see a "related" TurboTax tile, a user reveals that s/he was interested in taxes by clicking.

Alternatively, we could just randomly have Firefoxes generate fake clicks of the related tile independent of the user's behaviors. But this would incur costs for all parties as well as possible unintended consequences of cookies being set, etc.
Flags: needinfo?(mmc)
Clearing since we're meeting about this Tuesday. 

(In reply to Ed Lee :Mardak from comment #10)
> (In reply to [:mmc] Monica Chew (please use needinfo) from comment #6)
> > a noticeable effect on the privacy aspects of this feature
> Could you elaborate on what's the desired privacy aspects? What is okay to
> be revealed about browsing behaviors or not?

The bar I was aiming at is, if we're going to do this at all, then does imposing requirements on the number of related tiles substantially reduce the amount of information that the user leaks? I think the answer is no.

(In reply to Ed Lee :Mardak from comment #7)
> (In reply to [:mmc] Monica Chew (please use needinfo) from comment #6)
> > One thing that hasn't been mentioned thus far is what happens if a tiles
> > provider is able to accumulate enough interest data over time to uniquely
> > identify a person.
> The uniquely identifying a user is somewhat tricky for us to prevent even if
> we don't use interest data because if the destination site has any
> cookie-like data on the user either directly or indirectly through 3rd
> parties, e.g., a script to facebook or iframe to google, that site could
> probably have enough data to do all sorts of things.
> 
> And this is actually somewhat desired. E.g., an amazon related tile showing
> prime membership promotions probably will show something different for users
> who have been using the membership already vs haven't signed up.

Customization based on past behavior is exactly what DNT is supposed to prevent :) I am not suggesting that partners never customize; only that the don't customize or accumulate history for users with DNT on, which is exactly what the EFF policy states. We can talk more about this on Tuesday.
Flags: needinfo?(mmc)
For the record, I love reading thru this.  Good conversations here.

Lets use Booking.com as an example here as they are a current client.  I'm going to pretend that I'm their Chief Marketing Officer and making a recommendation to target air travelers.  I'll want to target the following sites:

1.  Expedia
2.  Travelocity
3.  Orbitz
4.  Delta.com
5.  AA.com
6.  JetBlue.com
7.  UnitedAirways.com
8.  Lufthansa.com

I'll batch those 8 URL's over to the Content Services team and ask to have my Booking.com tile show whenever one of those 8 appear (or some derivative) in a users Tiles.  

Why do I do this?
Because I believe that based on previous behavior/interest, the user will also want to know about Booking.com.

What will I look for?
I'll want to measure this based on the amount of conversions I'm receiving.

Advanced Measurement:  If Mozilla could pass me the related tile (Delta.com or any of the 8 above) so I can understand what is working for Booking.com.  This could be a real marketing insight.  If Delta.com converts 3x better than UnitedAirways.com, then I can leverage this as an insight.  

** Note - Mozilla might not be comfortable providing this, but if we do in aggregate, I don't see why not.  We could give advertiser dashboards to show which sites (maybe over a certain size?) perform best.
(In reply to Darren Herman from comment #12)
> For the record, I love reading thru this.  Good conversations here.
> 
> Lets use Booking.com as an example here as they are a current client.  I'm
> going to pretend that I'm their Chief Marketing Officer and making a
> recommendation to target air travelers.  I'll want to target the following
> sites:
> 
> 1.  Expedia
> 2.  Travelocity
> 3.  Orbitz
> 4.  Delta.com
> 5.  AA.com
> 6.  JetBlue.com
> 7.  UnitedAirways.com
> 8.  Lufthansa.com
> 
> I'll batch those 8 URL's over to the Content Services team and ask to have
> my Booking.com tile show whenever one of those 8 appear (or some derivative)
> in a users Tiles.  

2 thoughts:

1. You will also care very much about the status of Booking.com in the user's history and its frecency relative to the others.

2. Will you care if one of the others shows us in the users Tiles?  That's pretty much an artefact of how we make the new tab page.  Ideally, you want to show up when the user starts to have intent in your market.  That's the dream.  In the current state, you want to show up before a period of frequent travel site usage - how near to that period of usage is gated by our ability to analyse the user's intent and the permission we have to use their data to perform that analysis.  The current product would allow us to deliver a message to a user of travel sites.
The conclusion is that we definitely want at least 5 and that we will try to reach as much of the desired audience as possible. E.g., if a Client is trying to reach US income tax related sites and gives us an initial site list of hrblock.com and turbotax.intuit.com, we'll expand on that list by adding sites like irs.gov so that we make recommendations to as many users who are likely to be interested.

Something that wouldn't be acceptable would be only adding random.tax.blog.com and small.tax.site.com if those sites don't have significant traffic, and we would end up not addressing the majority of the tax audience.

Perhaps trying to do this visually:
[TTTTTTTTTTTTTTTTTTTT] 100% of potential tax audience

[CCCC................] ~20% from initial list of sites a Client provides - NOT ok
[CCCCMMMMMMMM........] ~60% mozilla-augmented list - OK
[CCCCWW..............] ~30% with a weakly augmented list - NOT ok

Another privacy criteria is that we don't want to select audiences that are too small or too targeted.
Assignee: nobody → edilee
Status: NEW → RESOLVED
Iteration: --- → 39.1 - 9 Mar
Points: --- → 8
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: .? → .001
Blocks: 1136461
For reference, the FB microtargeting attack I mentioned during today's meeting is http://repository.cmu.edu/cgi/viewcontent.cgi?article=1066&context=jpc
Comment 14 was sort of helpful in understanding and parsing this bug, but I'm still struggling to understand this a bit.

In my mind there are three lists of interest.  

List A is a collection of sites from a user's browser history that might be related to a category of interest.

List B are a set of sites that represent a category of interest (similar to likes in the cmu paper?).

List C is the set of sites that we might choose from to send the user.

Is this a good way to think about it?  If so what are the min and max number of sites that we thinking ill come into play as a user interacts with this feature.
hit save too soon.

From comparing A and B we get matches that help to construct C, and out of C we get the particular tile that the user sees and might click on.
Blocks: 1155443
No longer blocks: 1155443
You need to log in before you can comment on or make changes to this bug.