Closed
Bug 1306457
Opened 8 years ago
Closed 7 years ago
Implement the whitelist for Stub Attribution `source` field
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ckprice, Assigned: oremj)
References
Details
We will need a whitelist for the `source` field used in Stub Attribution. oremj has noted that it would be best to implement and maintain the whitelist in the service being built in https://bugzilla.mozilla.org/show_bug.cgi?id=1273940. NI :cmore to supply the initial whitelist.
Flags: needinfo?(chrismore.bugzilla)
Comment 1•8 years ago
|
||
I have a proposed white list here in sheet "Sources with business interest": https://docs.google.com/spreadsheets/d/1U-0JHpc3INJnBwTFkdrPqpk7yqvpEn16DJtp6d6TRiY/edit#gid=1389272572 It is based off of a few criteria: * we are doing advertising with the organization * we have a partnership with an organization * it is a Mozilla property Legal is taking this list to outside counsel to get their opinion if we can proceed.
Flags: needinfo?(chrismore.bugzilla)
Comment 3•8 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #2) > Can that list be public? I don't see why not. I don't want to share this specific spreadsheet, because it contains download metrics, but the list can be copy/pasted here. Let me see if legal has an update and I can paste it here.
Flags: needinfo?(chrismore.bugzilla)
Reporter | ||
Comment 4•8 years ago
|
||
NI :cmore to provide a final, public whitelist. Note: anything that does not exist in the whitelist we'll want to send `(other)` in the source field.
Flags: needinfo?(chrismore.bugzilla)
Comment 5•8 years ago
|
||
Here's the preliminary list we can use to build out the white list pattern matching on the source= field. All these values will be in the source= value being sent via www.mozilla.org to the stub service: I wrote these as regular expressions to capture the patterns these organizations use in their DNS. Our sites and Mozilla community sites: ^.*mozilla.*.+$ ^.*firefox.*.+$ Search engines + SEM: ^.+\.google\..+$ ^.+\.bing\..+$ ^.+\.yahoo\..+$ ^.*yandex\..+$ ^.+\.baidu\..+$ ^.+\.taobao\..+$ Other advertising: ^.+\.youtube\..+$ ^.+\.facebook\..+$ ^.*twitter\..+$ ^.*united-internet\..+$ ^.+\.and1\..+$ ^.+\.gmx\..+$ ^www\.mail\..+$ ^.+\.aol\..+$ ^.+\.qwant\..+$ ^.+\.seznam\.cz$ ^.+\.toshiba\..+$ ^.+\.kongregate\..+$ ^.+\.ea\..+$ There are also a few utm_source values that also need white listed as they are our vanity domains and in-browser referral traffic. ^firefox\-com$ ^getfirefox\-com$ ^firefox\-browser$ Anything that doesn't match the regex patterns above should have a source equal to this string: "(other)"
Flags: needinfo?(chrismore.bugzilla)
Assignee | ||
Comment 7•8 years ago
|
||
Is it possible to make the regexs more restrictive? As it is now, it still seems really open. For example, I could send something like, badthings.ea.somesiteidontwanttrackingme.com. Do we have a current source list? How many unique sources do we have? I'd rather whitelist everything exclusively rather than have regexs with ".+" or ".*" in them.
Flags: needinfo?(oremj)
Assignee | ||
Comment 8•8 years ago
|
||
If that's not possible, a mapping might be a good choice. Example: ^.+\.google\..+$ => "google" ^.+\.bing\..+$ => "bing"
Comment 9•8 years ago
|
||
The regex's above spit out these sources (from Google Analytics): www.google.com www.bing.com firefox-com facebook.com www.yahoo.com taobao.com youtube.com getfirefox-com support.google.com kongregate.com addons.mozilla.org www.google.com mozilla.jp go.mail.ru support.mozilla.org mozilla.de myaccount.google.com www.seznam.cz mozilla.cz firefox.cz www.aol.com photos.google.com testpilot.firefox.com firefox-browser br.search.yahoo.com mail.ru bing.com us.search.yahoo.com mx.search.yahoo.com uk.search.yahoo.com es.search.yahoo.com help.ea.com fr.search.yahoo.com firefox.mozilla.cz messenger.yahoo.com hangouts.google.com firefox.de mozilla.fi it.search.yahoo.com ca.search.yahoo.com in.search.yahoo.com qwant.com de.search.yahoo.com ph.search.yahoo.com developer.mozilla.org id.search.yahoo.com us.yhs4.search.yahoo.com love.mail.ru au.search.yahoo.com plus.url.google.com mozilla.hu suche.gmx.net mozilla.pl hk.search.yahoo.com ar.search.yahoo.com vn.search.yahoo.com plus.google.com nl.search.yahoo.com th.search.yahoo.com mozilla.com www.youtube.com malaysia.search.yahoo.com mozilla.ro my.mail.ru co.search.yahoo.com start.toshiba.com mozilla.lt tw.search.yahoo.com mozilla.sk mozilla.si se.search.yahoo.com mozilla.ch navigator-bs.gmx.fr br.yhs4.search.yahoo.com navigator-bs.gmx.com cto.mail.ru cn.bing.com mozilla.rs pl.search.yahoo.com pe.search.yahoo.com no.search.yahoo.com e.mail.ru ve.search.yahoo.com us-mg6.mail.yahoo.com extensions.aol.com cl.search.yahoo.com at.search.yahoo.com hacks.mozilla.org fi.search.yahoo.com sg.search.yahoo.com lite.qwant.com hello.firefox.com tr.search.yahoo.com mail.google.com images.tanks.mail.ru dk.search.yahoo.com mg.mail.yahoo.com maktoob.search.yahoo.com partnerads.ysm.yahoo.com www.google.de fr.yhs4.search.yahoo.com takeout.google.com mail.de www.google.fr espanol.search.yahoo.com mx.yhs4.search.yahoo.com firefox.no us-mg5.mail.yahoo.com ro.search.yahoo.com www.google.it activations.cdn.mozilla.net tweetdeck.twitter.com otvet.mail.ru firefox.org ch.search.yahoo.com uk.yhs4.search.yahoo.com scholar.google.com www.google.es encrypted.google.com talkgadget.google.com ru.search.yahoo.com mail.aol.com www.google.pl answers.yahoo.com tanks.mail.ru suche.gmx.at qc.search.yahoo.com nz.search.yahoo.com mozilla.ee www.google.ro help.mail.ru gr.search.yahoo.com www.google.ca fr-mg42.mail.yahoo.com bienvenido.toshiba.com accounts.firefox.com thunderbird.mozilla.cz start.new.toshiba.com search.1and1.com poseidon.navigator-bs.gmx.com malaysia.yhs4.search.yahoo.com in.yhs4.search.yahoo.com id.yhs4.search.yahoo.com id.messenger.yahoo.com hk.messenger.yahoo.com www.google.sr www.google.se www.google.be global.bing.com firefox.si co.yhs4.search.yahoo.com se.yhs4.search.yahoo.com navigator-bs.gmx.es www.google.dz www.google.bg es-mg42.mail.yahoo.com en-maktoob.search.yahoo.com email.seznam.cz br.answers.yahoo.com
Assignee | ||
Comment 10•8 years ago
|
||
Can I use comment 9 as the whitelist?
Comment 11•8 years ago
|
||
Updated list leading www's missing: www.google.com www.bing.com firefox-com facebook.com www.yahoo.com taobao.com youtube.com getfirefox-com support.google.com kongregate.com addons.mozilla.org www.google.com mozilla.jp go.mail.ru support.mozilla.org mozilla.de myaccount.google.com www.seznam.cz mozilla.cz firefox.cz www.aol.com photos.google.com testpilot.firefox.com firefox-browser br.search.yahoo.com mail.ru bing.com us.search.yahoo.com mx.search.yahoo.com uk.search.yahoo.com es.search.yahoo.com help.ea.com fr.search.yahoo.com firefox.mozilla.cz messenger.yahoo.com hangouts.google.com firefox.de mozilla.fi it.search.yahoo.com ca.search.yahoo.com in.search.yahoo.com www.qwant.com de.search.yahoo.com ph.search.yahoo.com developer.mozilla.org id.search.yahoo.com us.yhs4.search.yahoo.com love.mail.ru au.search.yahoo.com plus.url.google.com mozilla.hu suche.gmx.net mozilla.pl hk.search.yahoo.com ar.search.yahoo.com vn.search.yahoo.com plus.google.com nl.search.yahoo.com th.search.yahoo.com mozilla.com www.youtube.com malaysia.search.yahoo.com mozilla.ro my.mail.ru co.search.yahoo.com start.toshiba.com mozilla.lt tw.search.yahoo.com mozilla.sk mozilla.si se.search.yahoo.com mozilla.ch navigator-bs.gmx.fr br.yhs4.search.yahoo.com navigator-bs.gmx.com cto.mail.ru cn.bing.com mozilla.rs pl.search.yahoo.com pe.search.yahoo.com no.search.yahoo.com e.mail.ru ve.search.yahoo.com us-mg6.mail.yahoo.com extensions.aol.com cl.search.yahoo.com at.search.yahoo.com hacks.mozilla.org fi.search.yahoo.com sg.search.yahoo.com lite.qwant.com hello.firefox.com tr.search.yahoo.com mail.google.com images.tanks.mail.ru dk.search.yahoo.com mg.mail.yahoo.com maktoob.search.yahoo.com partnerads.ysm.yahoo.com www.google.de fr.yhs4.search.yahoo.com takeout.google.com mail.de www.google.fr espanol.search.yahoo.com mx.yhs4.search.yahoo.com firefox.no us-mg5.mail.yahoo.com ro.search.yahoo.com www.google.it activations.cdn.mozilla.net tweetdeck.twitter.com otvet.mail.ru firefox.org ch.search.yahoo.com uk.yhs4.search.yahoo.com scholar.google.com www.google.es encrypted.google.com talkgadget.google.com ru.search.yahoo.com mail.aol.com www.google.pl answers.yahoo.com tanks.mail.ru suche.gmx.at qc.search.yahoo.com nz.search.yahoo.com mozilla.ee www.google.ro help.mail.ru gr.search.yahoo.com www.google.ca fr-mg42.mail.yahoo.com bienvenido.toshiba.com accounts.firefox.com thunderbird.mozilla.cz start.new.toshiba.com search.1and1.com poseidon.navigator-bs.gmx.com malaysia.yhs4.search.yahoo.com in.yhs4.search.yahoo.com id.yhs4.search.yahoo.com id.messenger.yahoo.com hk.messenger.yahoo.com www.google.sr www.google.se www.google.be global.bing.com firefox.si co.yhs4.search.yahoo.com se.yhs4.search.yahoo.com navigator-bs.gmx.es www.google.dz www.google.bg es-mg42.mail.yahoo.com en-maktoob.search.yahoo.com email.seznam.cz br.answers.yahoo.com
Comment 12•8 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #10) > Can I use comment 9 as the whitelist? I think it could work for now. I just think by doing it this way without regular expressions means just more maintenance over the list since the sub-domains on some of these probably change over time. Like with br.yhs4.search.yahoo.com, what if it changes to yhs5? I get why doing the regex is more complex to not go so wide that you capture www.mywebsiteismoreawesomethangoogle.com
Assignee | ||
Comment 13•8 years ago
|
||
In that case, can we do what I suggested in comment 8? That way we can still use regexes, but will normalize the value down.
(In reply to Jeremy Orem [:oremj] from comment #13) > In that case, can we do what I suggested in comment 8? That way we can still > use regexes, but will normalize the value down. Chris ^^^ (I'm just trying to help move this along, so I can continue testing further/deeper - thanks!)
Flags: needinfo?(chrismore.bugzilla)
Comment 15•8 years ago
|
||
I will take an action to do the regex mapping to a string to represent the domain. For now, :oremj will use the static list of domains in comment 11. We can change it later to a regex mapping, but let's not block on moving forward.
Flags: needinfo?(chrismore.bugzilla) → needinfo?(oremj)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → oremj
Flags: needinfo?(oremj)
Assignee | ||
Comment 16•8 years ago
|
||
:cmore, I've implemented this whitelist in https://github.com/mozilla-services/stubattribution/pull/23 r?
Flags: needinfo?(chrismore.bugzilla)
Comment 17•8 years ago
|
||
r+ Looks good to me for an MVP. We can always revise later. Thanks!
Flags: needinfo?(chrismore.bugzilla)
Assignee | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 18•7 years ago
|
||
Here's a few extra sources that don't have any TLDs: google bing firefox-com yahoo yandex ask seznam aol
Assignee | ||
Comment 19•7 years ago
|
||
https://github.com/mozilla-services/stubattribution/pull/35
I did some more testing today, and it looks like "facebook" as a source shows up without any TLDs, too: Steps: 1) loaded https://www.facebook.com/Firefox/about/ 2) clicked on https://mzl.la/292SfT5, which 301s to https://www.mozilla.org/en-US/firefox/new/?utm_source=facebook&utm_medium=social&utm_content=facebook-about-bio&utm_campaign=firefox 3) following through to the end-user flow on Mozilla.org (on www-demo4), I get the following exception in Sentry[0]: could not validate attribution_code: source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook-about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist See http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 for the above need-info? :cmore just to triple-check that I'm correct, here, before we add "facebook" as a source. [0] https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/
Status: RESOLVED → REOPENED
Flags: needinfo?(chrismore.bugzilla)
Resolution: FIXED → ---
Comment 21•7 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #20) > I did some more testing today, and it looks like "facebook" as a source > shows up without any TLDs, too: > > Steps: > 1) loaded https://www.facebook.com/Firefox/about/ > 2) clicked on https://mzl.la/292SfT5, which 301s to > https://www.mozilla.org/en-US/firefox/new/ > ?utm_source=facebook&utm_medium=social&utm_content=facebook-about- > bio&utm_campaign=firefox > 3) following through to the end-user flow on Mozilla.org (on www-demo4), I > get the following exception in Sentry[0]: > > could not validate attribution_code: > source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook > -about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist > > See > http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 > for the above > > need-info? :cmore just to triple-check that I'm correct, here, before we add > "facebook" as a source. > > [0] > https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/ It looks like most of the time, it is facebook.com, but possible that sometimes people forget to add the TLD. Let's add "facebook" to the white list .
Flags: needinfo?(chrismore.bugzilla)
Comment 22•7 years ago
|
||
More valid whitelist domains: r.search.yahoo.com www.mozilla.org www.google.com.br www.google.com.ec www.google.com.mx or just whitelist them all: ^www.google.com.\w+$
Comment 23•7 years ago
|
||
Another valid whitelist source to be updated: fx36start
Comment 24•7 years ago
|
||
oremj: another addition to the whitelist in comment 23.
Flags: needinfo?(oremj)
Assignee | ||
Comment 25•7 years ago
|
||
These are all showing up in the logs, are they valid: firefox,89 "contra.pentagames.net",47 vk,47 fx36start,45 "oferta.senasofiaplus.edu.co",37 "www.minijuegos.com",37 bubblewitch3,31 "king.com",31 "int.search.myway.com",25 "update.org",25 browser,24 "www.kongregate.com",24 cdn,22 "contractwarsgame.com",21 ec,21 prod4,21 snippet,17 "www.clickjogos.com.br",15 "pro.rarom.ro",14 com,13 "diggerworld.ru",12 "www.nplay.com",12 "search.seznam.cz",11 "www.macrojuegos.com",11 "www.viamichelin.fr",11 about,10 home,10 "www.miniclip.com",10 "cse.google.com",9 "duckduckgo.com",9
Also, firefox-dev-tools: https://sentry.prod.mozaws.net/operations/stub_attribution-prod/issues/381991/events/9615199/
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(chrismore.bugzilla)
Comment 27•7 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #25) > These are all showing up in the logs, are they valid: > firefox,89 valid > "contra.pentagames.net",47 not valid > vk,47 not valid (maybe in the future) > fx36start,45 valid, that's us. > "oferta.senasofiaplus.edu.co",37 not valid > "www.minijuegos.com",37 not valid > bubblewitch3,31 not valid > "king.com",31 not valid > "int.search.myway.com",25 not valid > "update.org",25 not valid > browser,24 not valid (not sure what this is) > "www.kongregate.com",24 valid > cdn,22 not valid. probably generic cdn traffic. > "contractwarsgame.com",21 not valid > ec,21 not valid > prod4,21 not valid > snippet,17 valid (Firefox's snippets) > "www.clickjogos.com.br",15 not valid > "pro.rarom.ro",14 not valid > com,13 not valid > "diggerworld.ru",12 not valid > "www.nplay.com",12 not valid > "search.seznam.cz",11 valid > "www.macrojuegos.com",11 not valid > "www.viamichelin.fr",11 not valid > about,10 valid (I think this is source=about-home in this one and the next) > home,10 valid (firefox's about-home) > "www.miniclip.com",10 not valid > "cse.google.com",9 valid > "duckduckgo.com",9 valid, search partner in Firefox
Flags: needinfo?(chrismore.bugzilla)
Assignee | ||
Comment 28•7 years ago
|
||
https://github.com/mozilla-services/stubattribution/pull/51
Flags: needinfo?(oremj)
Sorry, one more, for firefox-dev-tools, which comes from: https://dxr.mozilla.org/mozilla-central/search?q=firefox-dev-tools&redirect=true pref("devtools.devedition.promo.url", "https://www.mozilla.org/firefox/developer/?utm_source=firefox-dev-tools&utm_medium=firefox-browser&utm_content=betadoorhanger");
Flags: needinfo?(oremj)
Flags: needinfo?(chrismore.bugzilla)
Also, if we wanted to double-check/add more in-product utm_source values, we could grab from here: https://dxr.mozilla.org/mozilla-central/search?q=utm_source&redirect=false
Comment 31•7 years ago
|
||
:oremj: here's two more that are valid: firefox-dev-tools directory-tiles
Flags: needinfo?(chrismore.bugzilla)
Assignee | ||
Comment 32•7 years ago
|
||
https://github.com/mozilla-services/stubattribution/commit/50094bb1b30e425be1ca1c09171a8026ab1d66ba
Flags: needinfo?(oremj)
Comment 33•7 years ago
|
||
oremj: here's another source that will need to be whitelisted for a partnership test that we will be running soon: softonic.com
Flags: needinfo?(oremj)
Comment 34•7 years ago
|
||
oremj: three more including comment 33: chip.de www.chip.de www.softonic.com may want to do: ^.*chip\.de$ ^.*softonic\.com$ Chip is a partner in Germany. Thanks!
Assignee | ||
Comment 35•7 years ago
|
||
The following list is the current top 20 blocked sources. Would you like me to add any while I'm at it? oferta.senasofiaplus.edu.co beinconnect.es www.macrojuegos.com contra.pentagames.net vk www.minijuegos.com ok.ru cdn contractwarsgame.com ec prod4 update.org browser int.search.myway.com yandex.ua desktop snippet watch.nowtv.com bubblewitch3 king.com
Flags: needinfo?(oremj) → needinfo?(chrismore.bugzilla)
Comment 36•7 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #35) > The following list is the current top 20 blocked sources. Would you like me > to add any while I'm at it? > > oferta.senasofiaplus.edu.co no > beinconnect.es no > www.macrojuegos.com no > contra.pentagames.net no > vk no > www.minijuegos.com no > ok.ru no > cdn hmm. I don't know if this is our CDN or something else. Any other values coming through on source = cdn? > contractwarsgame.com no > ec no > prod4 no > update.org no > browser yes > int.search.myway.com no > yandex.ua yes > desktop yes > snippet yes > watch.nowtv.com no > bubblewitch3 no > king.com no
Flags: needinfo?(chrismore.bugzilla)
Comment 37•7 years ago
|
||
oremj: also add ^.+\.wikipedia\.org$
Comment 38•7 years ago
|
||
oremj: new partner to add: toshiba.com www.toshiba.com or ^.+\.toshiba\.com$
Flags: needinfo?(oremj)
Assignee | ||
Comment 39•7 years ago
|
||
https://github.com/mozilla-services/stubattribution/pull/55
Flags: needinfo?(oremj)
Comment 40•7 years ago
|
||
Jeremy: in comment 33 and comment 34, I asked for softonic.com and www.softonic.com (regex to match both to be added to the whitelist). We have been running a parter experiment for the past month with source=softonic.com and I can't find the data in attribution. So, I went to our white list at: https://github.com/mozilla-services/stubattribution/blob/master/attributioncode/sourcewhitelist.go#L211 and I noticed that the white is not a regex and is just set to www.softonic.com. The current experiment points to this URL: https://www.mozilla.org/firefox/new/?utm_source=softonic.com&utm_campaign=fx-download-baseline&utm_medium=referral&utm_content=fx-download-page which doesn't contain the www, which is why I originally requested with and without the www to ensure this doesn't happen. That means it is likely that we have been discarding this data. Can you add the softonic.com without the www (or make it a regex)? Also, can you check the exception log to confirm that softonic.com (without www) has been being blocked? Thanks
Flags: needinfo?(oremj)
Assignee | ||
Comment 41•7 years ago
|
||
Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is showing up in the logs as blocked. Added here: https://github.com/mozilla-services/stubattribution/pull/56 Let's start filing new bugs, github issues or github PRs for whitelist change requests. The length of this bug is starting to make tracking difficult.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Flags: needinfo?(oremj)
Resolution: --- → FIXED
Comment 42•7 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #41) > Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is > showing up in the logs as blocked. > > Added here: https://github.com/mozilla-services/stubattribution/pull/56 > > Let's start filing new bugs, github issues or github PRs for whitelist > change requests. The length of this bug is starting to make tracking > difficult. Thanks, Jeremy. I will file new bugs for any updates to the whitelist.
Comment 43•7 years ago
|
||
Hey Jeremy. Sorry for another question as it is related to this above and not really a new bug. Question: For sources that are rejected from the white list, do they get attribution values of "unknown" or are they completely thrown away?
Flags: needinfo?(oremj)
Assignee | ||
Comment 44•7 years ago
|
||
They are completely thrown away. In other words, stub attribution service gives them an unmodified stub installer.
Flags: needinfo?(oremj)
Comment 45•7 years ago
|
||
(In reply to Jeremy Orem [:oremj] from comment #44) > They are completely thrown away. In other words, stub attribution service > gives them an unmodified stub installer. :oremj: the original requirements stated that non-white listed sources should be set to "(other)" instead of null. How big of a change would this be? I can file a new bug if this is a change can be made.
Flags: needinfo?(oremj)
Assignee | ||
Comment 46•7 years ago
|
||
I filed https://github.com/mozilla-services/stubattribution/issues/57
Flags: needinfo?(oremj)
Comment 47•7 years ago
|
||
Filed this issue for oremj: https://github.com/mozilla-services/stubattribution/issues/64
You need to log in
before you can comment on or make changes to this bug.
Description
•