Closed Bug 1306457 Opened 9 years ago Closed 9 years ago

Implement the whitelist for Stub Attribution `source` field

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ckprice, Assigned: oremj)

References

Details

We will need a whitelist for the `source` field used in Stub Attribution. oremj has noted that it would be best to implement and maintain the whitelist in the service being built in https://bugzilla.mozilla.org/show_bug.cgi?id=1273940. NI :cmore to supply the initial whitelist.
Flags: needinfo?(chrismore.bugzilla)
I have a proposed white list here in sheet "Sources with business interest": https://docs.google.com/spreadsheets/d/1U-0JHpc3INJnBwTFkdrPqpk7yqvpEn16DJtp6d6TRiY/edit#gid=1389272572 It is based off of a few criteria: * we are doing advertising with the organization * we have a partnership with an organization * it is a Mozilla property Legal is taking this list to outside counsel to get their opinion if we can proceed.
Flags: needinfo?(chrismore.bugzilla)
Can that list be public?
Flags: needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #2) > Can that list be public? I don't see why not. I don't want to share this specific spreadsheet, because it contains download metrics, but the list can be copy/pasted here. Let me see if legal has an update and I can paste it here.
Flags: needinfo?(chrismore.bugzilla)
NI :cmore to provide a final, public whitelist. Note: anything that does not exist in the whitelist we'll want to send `(other)` in the source field.
Flags: needinfo?(chrismore.bugzilla)
Here's the preliminary list we can use to build out the white list pattern matching on the source= field. All these values will be in the source= value being sent via www.mozilla.org to the stub service: I wrote these as regular expressions to capture the patterns these organizations use in their DNS. Our sites and Mozilla community sites: ^.*mozilla.*.+$ ^.*firefox.*.+$ Search engines + SEM: ^.+\.google\..+$ ^.+\.bing\..+$ ^.+\.yahoo\..+$ ^.*yandex\..+$ ^.+\.baidu\..+$ ^.+\.taobao\..+$ Other advertising: ^.+\.youtube\..+$ ^.+\.facebook\..+$ ^.*twitter\..+$ ^.*united-internet\..+$ ^.+\.and1\..+$ ^.+\.gmx\..+$ ^www\.mail\..+$ ^.+\.aol\..+$ ^.+\.qwant\..+$ ^.+\.seznam\.cz$ ^.+\.toshiba\..+$ ^.+\.kongregate\..+$ ^.+\.ea\..+$ There are also a few utm_source values that also need white listed as they are our vanity domains and in-browser referral traffic. ^firefox\-com$ ^getfirefox\-com$ ^firefox\-browser$ Anything that doesn't match the regex patterns above should have a source equal to this string: "(other)"
Flags: needinfo?(chrismore.bugzilla)
NI :oremj -- does this look okay to you?
Flags: needinfo?(oremj)
Is it possible to make the regexs more restrictive? As it is now, it still seems really open. For example, I could send something like, badthings.ea.somesiteidontwanttrackingme.com. Do we have a current source list? How many unique sources do we have? I'd rather whitelist everything exclusively rather than have regexs with ".+" or ".*" in them.
Flags: needinfo?(oremj)
If that's not possible, a mapping might be a good choice. Example: ^.+\.google\..+$ => "google" ^.+\.bing\..+$ => "bing"
The regex's above spit out these sources (from Google Analytics): www.google.com www.bing.com firefox-com facebook.com www.yahoo.com taobao.com youtube.com getfirefox-com support.google.com kongregate.com addons.mozilla.org www.google.com mozilla.jp go.mail.ru support.mozilla.org mozilla.de myaccount.google.com www.seznam.cz mozilla.cz firefox.cz www.aol.com photos.google.com testpilot.firefox.com firefox-browser br.search.yahoo.com mail.ru bing.com us.search.yahoo.com mx.search.yahoo.com uk.search.yahoo.com es.search.yahoo.com help.ea.com fr.search.yahoo.com firefox.mozilla.cz messenger.yahoo.com hangouts.google.com firefox.de mozilla.fi it.search.yahoo.com ca.search.yahoo.com in.search.yahoo.com qwant.com de.search.yahoo.com ph.search.yahoo.com developer.mozilla.org id.search.yahoo.com us.yhs4.search.yahoo.com love.mail.ru au.search.yahoo.com plus.url.google.com mozilla.hu suche.gmx.net mozilla.pl hk.search.yahoo.com ar.search.yahoo.com vn.search.yahoo.com plus.google.com nl.search.yahoo.com th.search.yahoo.com mozilla.com www.youtube.com malaysia.search.yahoo.com mozilla.ro my.mail.ru co.search.yahoo.com start.toshiba.com mozilla.lt tw.search.yahoo.com mozilla.sk mozilla.si se.search.yahoo.com mozilla.ch navigator-bs.gmx.fr br.yhs4.search.yahoo.com navigator-bs.gmx.com cto.mail.ru cn.bing.com mozilla.rs pl.search.yahoo.com pe.search.yahoo.com no.search.yahoo.com e.mail.ru ve.search.yahoo.com us-mg6.mail.yahoo.com extensions.aol.com cl.search.yahoo.com at.search.yahoo.com hacks.mozilla.org fi.search.yahoo.com sg.search.yahoo.com lite.qwant.com hello.firefox.com tr.search.yahoo.com mail.google.com images.tanks.mail.ru dk.search.yahoo.com mg.mail.yahoo.com maktoob.search.yahoo.com partnerads.ysm.yahoo.com www.google.de fr.yhs4.search.yahoo.com takeout.google.com mail.de www.google.fr espanol.search.yahoo.com mx.yhs4.search.yahoo.com firefox.no us-mg5.mail.yahoo.com ro.search.yahoo.com www.google.it activations.cdn.mozilla.net tweetdeck.twitter.com otvet.mail.ru firefox.org ch.search.yahoo.com uk.yhs4.search.yahoo.com scholar.google.com www.google.es encrypted.google.com talkgadget.google.com ru.search.yahoo.com mail.aol.com www.google.pl answers.yahoo.com tanks.mail.ru suche.gmx.at qc.search.yahoo.com nz.search.yahoo.com mozilla.ee www.google.ro help.mail.ru gr.search.yahoo.com www.google.ca fr-mg42.mail.yahoo.com bienvenido.toshiba.com accounts.firefox.com thunderbird.mozilla.cz start.new.toshiba.com search.1and1.com poseidon.navigator-bs.gmx.com malaysia.yhs4.search.yahoo.com in.yhs4.search.yahoo.com id.yhs4.search.yahoo.com id.messenger.yahoo.com hk.messenger.yahoo.com www.google.sr www.google.se www.google.be global.bing.com firefox.si co.yhs4.search.yahoo.com se.yhs4.search.yahoo.com navigator-bs.gmx.es www.google.dz www.google.bg es-mg42.mail.yahoo.com en-maktoob.search.yahoo.com email.seznam.cz br.answers.yahoo.com
Can I use comment 9 as the whitelist?
Updated list leading www's missing: www.google.com www.bing.com firefox-com facebook.com www.yahoo.com taobao.com youtube.com getfirefox-com support.google.com kongregate.com addons.mozilla.org www.google.com mozilla.jp go.mail.ru support.mozilla.org mozilla.de myaccount.google.com www.seznam.cz mozilla.cz firefox.cz www.aol.com photos.google.com testpilot.firefox.com firefox-browser br.search.yahoo.com mail.ru bing.com us.search.yahoo.com mx.search.yahoo.com uk.search.yahoo.com es.search.yahoo.com help.ea.com fr.search.yahoo.com firefox.mozilla.cz messenger.yahoo.com hangouts.google.com firefox.de mozilla.fi it.search.yahoo.com ca.search.yahoo.com in.search.yahoo.com www.qwant.com de.search.yahoo.com ph.search.yahoo.com developer.mozilla.org id.search.yahoo.com us.yhs4.search.yahoo.com love.mail.ru au.search.yahoo.com plus.url.google.com mozilla.hu suche.gmx.net mozilla.pl hk.search.yahoo.com ar.search.yahoo.com vn.search.yahoo.com plus.google.com nl.search.yahoo.com th.search.yahoo.com mozilla.com www.youtube.com malaysia.search.yahoo.com mozilla.ro my.mail.ru co.search.yahoo.com start.toshiba.com mozilla.lt tw.search.yahoo.com mozilla.sk mozilla.si se.search.yahoo.com mozilla.ch navigator-bs.gmx.fr br.yhs4.search.yahoo.com navigator-bs.gmx.com cto.mail.ru cn.bing.com mozilla.rs pl.search.yahoo.com pe.search.yahoo.com no.search.yahoo.com e.mail.ru ve.search.yahoo.com us-mg6.mail.yahoo.com extensions.aol.com cl.search.yahoo.com at.search.yahoo.com hacks.mozilla.org fi.search.yahoo.com sg.search.yahoo.com lite.qwant.com hello.firefox.com tr.search.yahoo.com mail.google.com images.tanks.mail.ru dk.search.yahoo.com mg.mail.yahoo.com maktoob.search.yahoo.com partnerads.ysm.yahoo.com www.google.de fr.yhs4.search.yahoo.com takeout.google.com mail.de www.google.fr espanol.search.yahoo.com mx.yhs4.search.yahoo.com firefox.no us-mg5.mail.yahoo.com ro.search.yahoo.com www.google.it activations.cdn.mozilla.net tweetdeck.twitter.com otvet.mail.ru firefox.org ch.search.yahoo.com uk.yhs4.search.yahoo.com scholar.google.com www.google.es encrypted.google.com talkgadget.google.com ru.search.yahoo.com mail.aol.com www.google.pl answers.yahoo.com tanks.mail.ru suche.gmx.at qc.search.yahoo.com nz.search.yahoo.com mozilla.ee www.google.ro help.mail.ru gr.search.yahoo.com www.google.ca fr-mg42.mail.yahoo.com bienvenido.toshiba.com accounts.firefox.com thunderbird.mozilla.cz start.new.toshiba.com search.1and1.com poseidon.navigator-bs.gmx.com malaysia.yhs4.search.yahoo.com in.yhs4.search.yahoo.com id.yhs4.search.yahoo.com id.messenger.yahoo.com hk.messenger.yahoo.com www.google.sr www.google.se www.google.be global.bing.com firefox.si co.yhs4.search.yahoo.com se.yhs4.search.yahoo.com navigator-bs.gmx.es www.google.dz www.google.bg es-mg42.mail.yahoo.com en-maktoob.search.yahoo.com email.seznam.cz br.answers.yahoo.com
(In reply to Jeremy Orem [:oremj] from comment #10) > Can I use comment 9 as the whitelist? I think it could work for now. I just think by doing it this way without regular expressions means just more maintenance over the list since the sub-domains on some of these probably change over time. Like with br.yhs4.search.yahoo.com, what if it changes to yhs5? I get why doing the regex is more complex to not go so wide that you capture www.mywebsiteismoreawesomethangoogle.com
In that case, can we do what I suggested in comment 8? That way we can still use regexes, but will normalize the value down.
(In reply to Jeremy Orem [:oremj] from comment #13) > In that case, can we do what I suggested in comment 8? That way we can still > use regexes, but will normalize the value down. Chris ^^^ (I'm just trying to help move this along, so I can continue testing further/deeper - thanks!)
Flags: needinfo?(chrismore.bugzilla)
I will take an action to do the regex mapping to a string to represent the domain. For now, :oremj will use the static list of domains in comment 11. We can change it later to a regex mapping, but let's not block on moving forward.
Flags: needinfo?(chrismore.bugzilla) → needinfo?(oremj)
Assignee: nobody → oremj
Flags: needinfo?(oremj)
:cmore, I've implemented this whitelist in https://github.com/mozilla-services/stubattribution/pull/23 r?
Flags: needinfo?(chrismore.bugzilla)
r+ Looks good to me for an MVP. We can always revise later. Thanks!
Flags: needinfo?(chrismore.bugzilla)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Here's a few extra sources that don't have any TLDs: google bing firefox-com yahoo yandex ask seznam aol
I did some more testing today, and it looks like "facebook" as a source shows up without any TLDs, too: Steps: 1) loaded https://www.facebook.com/Firefox/about/ 2) clicked on https://mzl.la/292SfT5, which 301s to https://www.mozilla.org/en-US/firefox/new/?utm_source=facebook&utm_medium=social&utm_content=facebook-about-bio&utm_campaign=firefox 3) following through to the end-user flow on Mozilla.org (on www-demo4), I get the following exception in Sentry[0]: could not validate attribution_code: source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook-about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist See http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 for the above need-info? :cmore just to triple-check that I'm correct, here, before we add "facebook" as a source. [0] https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/
Status: RESOLVED → REOPENED
Flags: needinfo?(chrismore.bugzilla)
Resolution: FIXED → ---
(In reply to Stephen Donner [:stephend] from comment #20) > I did some more testing today, and it looks like "facebook" as a source > shows up without any TLDs, too: > > Steps: > 1) loaded https://www.facebook.com/Firefox/about/ > 2) clicked on https://mzl.la/292SfT5, which 301s to > https://www.mozilla.org/en-US/firefox/new/ > ?utm_source=facebook&utm_medium=social&utm_content=facebook-about- > bio&utm_campaign=firefox > 3) following through to the end-user flow on Mozilla.org (on www-demo4), I > get the following exception in Sentry[0]: > > could not validate attribution_code: > source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook > -about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist > > See > http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 > for the above > > need-info? :cmore just to triple-check that I'm correct, here, before we add > "facebook" as a source. > > [0] > https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/ It looks like most of the time, it is facebook.com, but possible that sometimes people forget to add the TLD. Let's add "facebook" to the white list .
Flags: needinfo?(chrismore.bugzilla)
More valid whitelist domains: r.search.yahoo.com www.mozilla.org www.google.com.br www.google.com.ec www.google.com.mx or just whitelist them all: ^www.google.com.\w+$
Another valid whitelist source to be updated: fx36start
oremj: another addition to the whitelist in comment 23.
Flags: needinfo?(oremj)
These are all showing up in the logs, are they valid: firefox,89 "contra.pentagames.net",47 vk,47 fx36start,45 "oferta.senasofiaplus.edu.co",37 "www.minijuegos.com",37 bubblewitch3,31 "king.com",31 "int.search.myway.com",25 "update.org",25 browser,24 "www.kongregate.com",24 cdn,22 "contractwarsgame.com",21 ec,21 prod4,21 snippet,17 "www.clickjogos.com.br",15 "pro.rarom.ro",14 com,13 "diggerworld.ru",12 "www.nplay.com",12 "search.seznam.cz",11 "www.macrojuegos.com",11 "www.viamichelin.fr",11 about,10 home,10 "www.miniclip.com",10 "cse.google.com",9 "duckduckgo.com",9
Flags: needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #25) > These are all showing up in the logs, are they valid: > firefox,89 valid > "contra.pentagames.net",47 not valid > vk,47 not valid (maybe in the future) > fx36start,45 valid, that's us. > "oferta.senasofiaplus.edu.co",37 not valid > "www.minijuegos.com",37 not valid > bubblewitch3,31 not valid > "king.com",31 not valid > "int.search.myway.com",25 not valid > "update.org",25 not valid > browser,24 not valid (not sure what this is) > "www.kongregate.com",24 valid > cdn,22 not valid. probably generic cdn traffic. > "contractwarsgame.com",21 not valid > ec,21 not valid > prod4,21 not valid > snippet,17 valid (Firefox's snippets) > "www.clickjogos.com.br",15 not valid > "pro.rarom.ro",14 not valid > com,13 not valid > "diggerworld.ru",12 not valid > "www.nplay.com",12 not valid > "search.seznam.cz",11 valid > "www.macrojuegos.com",11 not valid > "www.viamichelin.fr",11 not valid > about,10 valid (I think this is source=about-home in this one and the next) > home,10 valid (firefox's about-home) > "www.miniclip.com",10 not valid > "cse.google.com",9 valid > "duckduckgo.com",9 valid, search partner in Firefox
Flags: needinfo?(chrismore.bugzilla)
Also, if we wanted to double-check/add more in-product utm_source values, we could grab from here: https://dxr.mozilla.org/mozilla-central/search?q=utm_source&redirect=false
:oremj: here's two more that are valid: firefox-dev-tools directory-tiles
Flags: needinfo?(chrismore.bugzilla)
oremj: here's another source that will need to be whitelisted for a partnership test that we will be running soon: softonic.com
Flags: needinfo?(oremj)
oremj: three more including comment 33: chip.de www.chip.de www.softonic.com may want to do: ^.*chip\.de$ ^.*softonic\.com$ Chip is a partner in Germany. Thanks!
The following list is the current top 20 blocked sources. Would you like me to add any while I'm at it? oferta.senasofiaplus.edu.co beinconnect.es www.macrojuegos.com contra.pentagames.net vk www.minijuegos.com ok.ru cdn contractwarsgame.com ec prod4 update.org browser int.search.myway.com yandex.ua desktop snippet watch.nowtv.com bubblewitch3 king.com
Flags: needinfo?(oremj) → needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #35) > The following list is the current top 20 blocked sources. Would you like me > to add any while I'm at it? > > oferta.senasofiaplus.edu.co no > beinconnect.es no > www.macrojuegos.com no > contra.pentagames.net no > vk no > www.minijuegos.com no > ok.ru no > cdn hmm. I don't know if this is our CDN or something else. Any other values coming through on source = cdn? > contractwarsgame.com no > ec no > prod4 no > update.org no > browser yes > int.search.myway.com no > yandex.ua yes > desktop yes > snippet yes > watch.nowtv.com no > bubblewitch3 no > king.com no
Flags: needinfo?(chrismore.bugzilla)
oremj: also add ^.+\.wikipedia\.org$
oremj: new partner to add: toshiba.com www.toshiba.com or ^.+\.toshiba\.com$
Flags: needinfo?(oremj)
Jeremy: in comment 33 and comment 34, I asked for softonic.com and www.softonic.com (regex to match both to be added to the whitelist). We have been running a parter experiment for the past month with source=softonic.com and I can't find the data in attribution. So, I went to our white list at: https://github.com/mozilla-services/stubattribution/blob/master/attributioncode/sourcewhitelist.go#L211 and I noticed that the white is not a regex and is just set to www.softonic.com. The current experiment points to this URL: https://www.mozilla.org/firefox/new/?utm_source=softonic.com&utm_campaign=fx-download-baseline&utm_medium=referral&utm_content=fx-download-page which doesn't contain the www, which is why I originally requested with and without the www to ensure this doesn't happen. That means it is likely that we have been discarding this data. Can you add the softonic.com without the www (or make it a regex)? Also, can you check the exception log to confirm that softonic.com (without www) has been being blocked? Thanks
Flags: needinfo?(oremj)
Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is showing up in the logs as blocked. Added here: https://github.com/mozilla-services/stubattribution/pull/56 Let's start filing new bugs, github issues or github PRs for whitelist change requests. The length of this bug is starting to make tracking difficult.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(oremj)
Resolution: --- → FIXED
(In reply to Jeremy Orem [:oremj] from comment #41) > Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is > showing up in the logs as blocked. > > Added here: https://github.com/mozilla-services/stubattribution/pull/56 > > Let's start filing new bugs, github issues or github PRs for whitelist > change requests. The length of this bug is starting to make tracking > difficult. Thanks, Jeremy. I will file new bugs for any updates to the whitelist.
Hey Jeremy. Sorry for another question as it is related to this above and not really a new bug. Question: For sources that are rejected from the white list, do they get attribution values of "unknown" or are they completely thrown away?
Flags: needinfo?(oremj)
They are completely thrown away. In other words, stub attribution service gives them an unmodified stub installer.
Flags: needinfo?(oremj)
(In reply to Jeremy Orem [:oremj] from comment #44) > They are completely thrown away. In other words, stub attribution service > gives them an unmodified stub installer. :oremj: the original requirements stated that non-white listed sources should be set to "(other)" instead of null. How big of a change would this be? I can file a new bug if this is a change can be made.
Flags: needinfo?(oremj)
Blocks: 1827985
You need to log in before you can comment on or make changes to this bug.