Closed Bug 1306457 Opened 8 years ago Closed 7 years ago

Implement the whitelist for Stub Attribution `source` field

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ckprice, Assigned: oremj)

References

Details

We will need a whitelist for the `source` field used in Stub Attribution.

oremj has noted that it would be best to implement and maintain the whitelist in the service being built in https://bugzilla.mozilla.org/show_bug.cgi?id=1273940.

NI :cmore to supply the initial whitelist.
Flags: needinfo?(chrismore.bugzilla)
I have a proposed white list here in sheet "Sources with business interest":

https://docs.google.com/spreadsheets/d/1U-0JHpc3INJnBwTFkdrPqpk7yqvpEn16DJtp6d6TRiY/edit#gid=1389272572

It is based off of a few criteria:

* we are doing advertising with the organization
* we have a partnership with an organization
* it is a Mozilla property

Legal is taking this list to outside counsel to get their opinion if we can proceed.
Flags: needinfo?(chrismore.bugzilla)
Can that list be public?
Flags: needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #2)
> Can that list be public?

I don't see why not. I don't want to share this specific spreadsheet, because it contains download metrics, but the list can be copy/pasted here. Let me see if legal has an update and I can paste it here.
Flags: needinfo?(chrismore.bugzilla)
NI :cmore to provide a final, public whitelist.

Note: anything that does not exist in the whitelist we'll want to send `(other)` in the source field.
Flags: needinfo?(chrismore.bugzilla)
Here's the preliminary list we can use to build out the white list pattern matching on the source= field.

All these values will be in the source= value being sent via www.mozilla.org to the stub service:

I wrote these as regular expressions to capture the patterns these organizations use in their DNS.

Our sites and Mozilla community sites:

^.*mozilla.*.+$
^.*firefox.*.+$

Search engines + SEM:

^.+\.google\..+$
^.+\.bing\..+$
^.+\.yahoo\..+$
^.*yandex\..+$
^.+\.baidu\..+$
^.+\.taobao\..+$

Other advertising:

^.+\.youtube\..+$
^.+\.facebook\..+$
^.*twitter\..+$
^.*united-internet\..+$
^.+\.and1\..+$
^.+\.gmx\..+$
^www\.mail\..+$
^.+\.aol\..+$
^.+\.qwant\..+$
^.+\.seznam\.cz$
^.+\.toshiba\..+$
^.+\.kongregate\..+$
^.+\.ea\..+$

There are also a few utm_source values that also need white listed as they are our vanity domains and in-browser referral traffic.

^firefox\-com$
^getfirefox\-com$
^firefox\-browser$

Anything that doesn't match the regex patterns above should have a source equal to this string: "(other)"
Flags: needinfo?(chrismore.bugzilla)
NI :oremj -- does this look okay to you?
Flags: needinfo?(oremj)
Is it possible to make the regexs more restrictive? As it is now, it still seems really open. For example, I could send something like, badthings.ea.somesiteidontwanttrackingme.com.

Do we have a current source list? How many unique sources do we have? I'd rather whitelist everything exclusively rather than have regexs with ".+" or ".*" in them.
Flags: needinfo?(oremj)
If that's not possible, a mapping might be a good choice. Example:

^.+\.google\..+$ => "google"
^.+\.bing\..+$ => "bing"
The regex's above spit out these sources (from Google Analytics):

www.google.com
www.bing.com
firefox-com
facebook.com
www.yahoo.com
taobao.com
youtube.com
getfirefox-com
support.google.com
kongregate.com
addons.mozilla.org
www.google.com
mozilla.jp
go.mail.ru
support.mozilla.org
mozilla.de
myaccount.google.com
www.seznam.cz
mozilla.cz
firefox.cz
www.aol.com
photos.google.com
testpilot.firefox.com
firefox-browser
br.search.yahoo.com
mail.ru
bing.com
us.search.yahoo.com
mx.search.yahoo.com
uk.search.yahoo.com
es.search.yahoo.com
help.ea.com
fr.search.yahoo.com
firefox.mozilla.cz
messenger.yahoo.com
hangouts.google.com
firefox.de
mozilla.fi
it.search.yahoo.com
ca.search.yahoo.com
in.search.yahoo.com
qwant.com
de.search.yahoo.com
ph.search.yahoo.com
developer.mozilla.org
id.search.yahoo.com
us.yhs4.search.yahoo.com
love.mail.ru
au.search.yahoo.com
plus.url.google.com
mozilla.hu
suche.gmx.net
mozilla.pl
hk.search.yahoo.com
ar.search.yahoo.com
vn.search.yahoo.com
plus.google.com
nl.search.yahoo.com
th.search.yahoo.com
mozilla.com
www.youtube.com
malaysia.search.yahoo.com
mozilla.ro
my.mail.ru
co.search.yahoo.com
start.toshiba.com
mozilla.lt
tw.search.yahoo.com
mozilla.sk
mozilla.si
se.search.yahoo.com
mozilla.ch
navigator-bs.gmx.fr
br.yhs4.search.yahoo.com
navigator-bs.gmx.com
cto.mail.ru
cn.bing.com
mozilla.rs
pl.search.yahoo.com
pe.search.yahoo.com
no.search.yahoo.com
e.mail.ru
ve.search.yahoo.com
us-mg6.mail.yahoo.com
extensions.aol.com
cl.search.yahoo.com
at.search.yahoo.com
hacks.mozilla.org
fi.search.yahoo.com
sg.search.yahoo.com
lite.qwant.com
hello.firefox.com
tr.search.yahoo.com
mail.google.com
images.tanks.mail.ru
dk.search.yahoo.com
mg.mail.yahoo.com
maktoob.search.yahoo.com
partnerads.ysm.yahoo.com
www.google.de
fr.yhs4.search.yahoo.com
takeout.google.com
mail.de
www.google.fr
espanol.search.yahoo.com
mx.yhs4.search.yahoo.com
firefox.no
us-mg5.mail.yahoo.com
ro.search.yahoo.com
www.google.it
activations.cdn.mozilla.net
tweetdeck.twitter.com
otvet.mail.ru
firefox.org
ch.search.yahoo.com
uk.yhs4.search.yahoo.com
scholar.google.com
www.google.es
encrypted.google.com
talkgadget.google.com
ru.search.yahoo.com
mail.aol.com
www.google.pl
answers.yahoo.com
tanks.mail.ru
suche.gmx.at
qc.search.yahoo.com
nz.search.yahoo.com
mozilla.ee
www.google.ro
help.mail.ru
gr.search.yahoo.com
www.google.ca
fr-mg42.mail.yahoo.com
bienvenido.toshiba.com
accounts.firefox.com
thunderbird.mozilla.cz
start.new.toshiba.com
search.1and1.com
poseidon.navigator-bs.gmx.com
malaysia.yhs4.search.yahoo.com
in.yhs4.search.yahoo.com
id.yhs4.search.yahoo.com
id.messenger.yahoo.com
hk.messenger.yahoo.com
www.google.sr
www.google.se
www.google.be
global.bing.com
firefox.si
co.yhs4.search.yahoo.com
se.yhs4.search.yahoo.com
navigator-bs.gmx.es
www.google.dz
www.google.bg
es-mg42.mail.yahoo.com
en-maktoob.search.yahoo.com
email.seznam.cz
br.answers.yahoo.com
Can I use comment 9 as the whitelist?
Updated list leading www's missing:

www.google.com
www.bing.com
firefox-com
facebook.com
www.yahoo.com
taobao.com
youtube.com
getfirefox-com
support.google.com
kongregate.com
addons.mozilla.org
www.google.com
mozilla.jp
go.mail.ru
support.mozilla.org
mozilla.de
myaccount.google.com
www.seznam.cz
mozilla.cz
firefox.cz
www.aol.com
photos.google.com
testpilot.firefox.com
firefox-browser
br.search.yahoo.com
mail.ru
bing.com
us.search.yahoo.com
mx.search.yahoo.com
uk.search.yahoo.com
es.search.yahoo.com
help.ea.com
fr.search.yahoo.com
firefox.mozilla.cz
messenger.yahoo.com
hangouts.google.com
firefox.de
mozilla.fi
it.search.yahoo.com
ca.search.yahoo.com
in.search.yahoo.com
www.qwant.com
de.search.yahoo.com
ph.search.yahoo.com
developer.mozilla.org
id.search.yahoo.com
us.yhs4.search.yahoo.com
love.mail.ru
au.search.yahoo.com
plus.url.google.com
mozilla.hu
suche.gmx.net
mozilla.pl
hk.search.yahoo.com
ar.search.yahoo.com
vn.search.yahoo.com
plus.google.com
nl.search.yahoo.com
th.search.yahoo.com
mozilla.com
www.youtube.com
malaysia.search.yahoo.com
mozilla.ro
my.mail.ru
co.search.yahoo.com
start.toshiba.com
mozilla.lt
tw.search.yahoo.com
mozilla.sk
mozilla.si
se.search.yahoo.com
mozilla.ch
navigator-bs.gmx.fr
br.yhs4.search.yahoo.com
navigator-bs.gmx.com
cto.mail.ru
cn.bing.com
mozilla.rs
pl.search.yahoo.com
pe.search.yahoo.com
no.search.yahoo.com
e.mail.ru
ve.search.yahoo.com
us-mg6.mail.yahoo.com
extensions.aol.com
cl.search.yahoo.com
at.search.yahoo.com
hacks.mozilla.org
fi.search.yahoo.com
sg.search.yahoo.com
lite.qwant.com
hello.firefox.com
tr.search.yahoo.com
mail.google.com
images.tanks.mail.ru
dk.search.yahoo.com
mg.mail.yahoo.com
maktoob.search.yahoo.com
partnerads.ysm.yahoo.com
www.google.de
fr.yhs4.search.yahoo.com
takeout.google.com
mail.de
www.google.fr
espanol.search.yahoo.com
mx.yhs4.search.yahoo.com
firefox.no
us-mg5.mail.yahoo.com
ro.search.yahoo.com
www.google.it
activations.cdn.mozilla.net
tweetdeck.twitter.com
otvet.mail.ru
firefox.org
ch.search.yahoo.com
uk.yhs4.search.yahoo.com
scholar.google.com
www.google.es
encrypted.google.com
talkgadget.google.com
ru.search.yahoo.com
mail.aol.com
www.google.pl
answers.yahoo.com
tanks.mail.ru
suche.gmx.at
qc.search.yahoo.com
nz.search.yahoo.com
mozilla.ee
www.google.ro
help.mail.ru
gr.search.yahoo.com
www.google.ca
fr-mg42.mail.yahoo.com
bienvenido.toshiba.com
accounts.firefox.com
thunderbird.mozilla.cz
start.new.toshiba.com
search.1and1.com
poseidon.navigator-bs.gmx.com
malaysia.yhs4.search.yahoo.com
in.yhs4.search.yahoo.com
id.yhs4.search.yahoo.com
id.messenger.yahoo.com
hk.messenger.yahoo.com
www.google.sr
www.google.se
www.google.be
global.bing.com
firefox.si
co.yhs4.search.yahoo.com
se.yhs4.search.yahoo.com
navigator-bs.gmx.es
www.google.dz
www.google.bg
es-mg42.mail.yahoo.com
en-maktoob.search.yahoo.com
email.seznam.cz
br.answers.yahoo.com
(In reply to Jeremy Orem [:oremj] from comment #10)
> Can I use comment 9 as the whitelist?

I think it could work for now. I just think by doing it this way without regular expressions means just more maintenance over the list since the sub-domains on some of these probably change over time. Like with br.yhs4.search.yahoo.com, what if it changes to yhs5? I get why doing the regex is more complex to not go so wide that you capture www.mywebsiteismoreawesomethangoogle.com
In that case, can we do what I suggested in comment 8? That way we can still use regexes, but will normalize the value down.
(In reply to Jeremy Orem [:oremj] from comment #13)
> In that case, can we do what I suggested in comment 8? That way we can still
> use regexes, but will normalize the value down.

Chris ^^^ (I'm just trying to help move this along, so I can continue testing further/deeper - thanks!)
Flags: needinfo?(chrismore.bugzilla)
I will take an action to do the regex mapping to a string to represent the domain.

For now, :oremj will use the static list of domains in comment 11.

We can change it later to a regex mapping, but let's not block on moving forward.
Flags: needinfo?(chrismore.bugzilla) → needinfo?(oremj)
Assignee: nobody → oremj
Flags: needinfo?(oremj)
:cmore, I've implemented this whitelist in https://github.com/mozilla-services/stubattribution/pull/23 r?
Flags: needinfo?(chrismore.bugzilla)
r+

Looks good to me for an MVP. We can always revise later.

Thanks!
Flags: needinfo?(chrismore.bugzilla)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Here's a few extra sources that don't have any TLDs:

google
bing
firefox-com
yahoo
yandex
ask
seznam
aol
I did some more testing today, and it looks like "facebook" as a source shows up without any TLDs, too:

Steps:
1) loaded https://www.facebook.com/Firefox/about/
2) clicked on https://mzl.la/292SfT5, which 301s to https://www.mozilla.org/en-US/firefox/new/?utm_source=facebook&utm_medium=social&utm_content=facebook-about-bio&utm_campaign=firefox
3) following through to the end-user flow on Mozilla.org (on www-demo4), I get the following exception in Sentry[0]:

could not validate attribution_code: source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook-about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist

See http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 for the above

need-info? :cmore just to triple-check that I'm correct, here, before we add "facebook" as a source.

[0] https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/
Status: RESOLVED → REOPENED
Flags: needinfo?(chrismore.bugzilla)
Resolution: FIXED → ---
(In reply to Stephen Donner [:stephend] from comment #20)
> I did some more testing today, and it looks like "facebook" as a source
> shows up without any TLDs, too:
> 
> Steps:
> 1) loaded https://www.facebook.com/Firefox/about/
> 2) clicked on https://mzl.la/292SfT5, which 301s to
> https://www.mozilla.org/en-US/firefox/new/
> ?utm_source=facebook&utm_medium=social&utm_content=facebook-about-
> bio&utm_campaign=firefox
> 3) following through to the end-user flow on Mozilla.org (on www-demo4), I
> get the following exception in Sentry[0]:
> 
> could not validate attribution_code:
> source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook
> -about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist
> 
> See
> http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1
> for the above
> 
> need-info? :cmore just to triple-check that I'm correct, here, before we add
> "facebook" as a source.
> 
> [0]
> https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/

It looks like most of the time, it is facebook.com, but possible that sometimes people forget to add the TLD. Let's add "facebook" to the white list .
Flags: needinfo?(chrismore.bugzilla)
More valid whitelist domains:

r.search.yahoo.com
www.mozilla.org
www.google.com.br
www.google.com.ec
www.google.com.mx
or just whitelist them all:
^www.google.com.\w+$
Another valid whitelist source to be updated:

fx36start
oremj: another addition to the whitelist in comment 23.
Flags: needinfo?(oremj)
These are all showing up in the logs, are they valid:
firefox,89
"contra.pentagames.net",47
vk,47
fx36start,45
"oferta.senasofiaplus.edu.co",37
"www.minijuegos.com",37
bubblewitch3,31
"king.com",31
"int.search.myway.com",25
"update.org",25
browser,24
"www.kongregate.com",24
cdn,22
"contractwarsgame.com",21
ec,21
prod4,21
snippet,17
"www.clickjogos.com.br",15
"pro.rarom.ro",14
com,13
"diggerworld.ru",12
"www.nplay.com",12
"search.seznam.cz",11
"www.macrojuegos.com",11
"www.viamichelin.fr",11
about,10
home,10
"www.miniclip.com",10
"cse.google.com",9
"duckduckgo.com",9
Flags: needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #25)
> These are all showing up in the logs, are they valid:
> firefox,89

valid

> "contra.pentagames.net",47

not valid

> vk,47

not valid (maybe in the future)

> fx36start,45

valid, that's us.

> "oferta.senasofiaplus.edu.co",37

not valid

> "www.minijuegos.com",37

not valid

> bubblewitch3,31

not valid

> "king.com",31


not valid

> "int.search.myway.com",25


not valid

> "update.org",25

not valid

> browser,24

not valid (not sure what this is)

> "www.kongregate.com",24

valid

> cdn,22

not valid. probably generic cdn traffic.

> "contractwarsgame.com",21

not valid

> ec,21

not valid

> prod4,21

not valid

> snippet,17

valid (Firefox's snippets)

> "www.clickjogos.com.br",15

not valid

> "pro.rarom.ro",14

not valid

> com,13

not valid

> "diggerworld.ru",12

not valid

> "www.nplay.com",12

not valid

> "search.seznam.cz",11


valid

> "www.macrojuegos.com",11

not valid

> "www.viamichelin.fr",11

not valid

> about,10

valid (I think this is source=about-home in this one and the next)

> home,10

valid (firefox's about-home)

> "www.miniclip.com",10


not valid

> "cse.google.com",9

valid

> "duckduckgo.com",9

valid, search partner in Firefox
Flags: needinfo?(chrismore.bugzilla)
Also, if we wanted to double-check/add more in-product utm_source values, we could grab from here:

https://dxr.mozilla.org/mozilla-central/search?q=utm_source&redirect=false
:oremj: here's two more that are valid:

firefox-dev-tools
directory-tiles
Flags: needinfo?(chrismore.bugzilla)
oremj: here's another source that will need to be whitelisted for a partnership test that we will be running soon:

softonic.com
Flags: needinfo?(oremj)
oremj: three more including comment 33:

chip.de
www.chip.de
www.softonic.com

may want to do:

^.*chip\.de$
^.*softonic\.com$

Chip is a partner in Germany.

Thanks!
The following list is the current top 20 blocked sources. Would you like me to add any while I'm at it?

oferta.senasofiaplus.edu.co
beinconnect.es
www.macrojuegos.com
contra.pentagames.net
vk
www.minijuegos.com
ok.ru
cdn
contractwarsgame.com
ec
prod4
update.org
browser
int.search.myway.com
yandex.ua
desktop
snippet
watch.nowtv.com
bubblewitch3
king.com
Flags: needinfo?(oremj) → needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #35)
> The following list is the current top 20 blocked sources. Would you like me
> to add any while I'm at it?
> 
> oferta.senasofiaplus.edu.co

no

> beinconnect.es

no

> www.macrojuegos.com

no

> contra.pentagames.net

no

> vk

no

> www.minijuegos.com

no

> ok.ru

no

> cdn


hmm.
I don't know if this is our CDN or something else. Any other values coming through on source = cdn?

> contractwarsgame.com

no

> ec

no

> prod4

no

> update.org

no

> browser

yes

> int.search.myway.com

no

> yandex.ua

yes

> desktop

yes

> snippet

yes

> watch.nowtv.com

no

> bubblewitch3

no

> king.com

no
Flags: needinfo?(chrismore.bugzilla)
oremj: also add 

^.+\.wikipedia\.org$
oremj: new partner to add:

toshiba.com
www.toshiba.com

or 

^.+\.toshiba\.com$
Flags: needinfo?(oremj)
Jeremy:

in comment 33 and comment 34, I asked for softonic.com and www.softonic.com (regex to match both to be added to the whitelist). We have been running a parter experiment for the past month with source=softonic.com and I can't find the data in attribution. So, I went to our white list at:

https://github.com/mozilla-services/stubattribution/blob/master/attributioncode/sourcewhitelist.go#L211

and I noticed that the white is not a regex and is just set to www.softonic.com. 

The current experiment points to this URL:

https://www.mozilla.org/firefox/new/?utm_source=softonic.com&utm_campaign=fx-download-baseline&utm_medium=referral&utm_content=fx-download-page

which doesn't contain the www, which is why I originally requested with and without the www to ensure this doesn't happen. That means it is likely that we have been discarding this data.

Can you add the softonic.com without the www (or make it a regex)?

Also, can you check the exception log to confirm that softonic.com (without www) has been being blocked?

Thanks
Flags: needinfo?(oremj)
Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is showing up in the logs as blocked.

Added here: https://github.com/mozilla-services/stubattribution/pull/56

Let's start filing new bugs, github issues or github PRs for whitelist change requests. The length of this bug is starting to make tracking difficult.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Flags: needinfo?(oremj)
Resolution: --- → FIXED
(In reply to Jeremy Orem [:oremj] from comment #41)
> Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is
> showing up in the logs as blocked.
> 
> Added here: https://github.com/mozilla-services/stubattribution/pull/56
> 
> Let's start filing new bugs, github issues or github PRs for whitelist
> change requests. The length of this bug is starting to make tracking
> difficult.

Thanks, Jeremy. I will file new bugs for any updates to the whitelist.
Hey Jeremy. Sorry for another question as it is related to this above and not really a new bug.

Question: For sources that are rejected from the white list, do they get attribution values of "unknown" or are they completely thrown away?
Flags: needinfo?(oremj)
They are completely thrown away. In other words, stub attribution service gives them an unmodified stub installer.
Flags: needinfo?(oremj)
(In reply to Jeremy Orem [:oremj] from comment #44)
> They are completely thrown away. In other words, stub attribution service
> gives them an unmodified stub installer.

:oremj: the original requirements stated that non-white listed sources should be set to "(other)" instead of null.

How big of a change would this be? I can file a new bug if this is a change can be made.
Flags: needinfo?(oremj)
Blocks: 1827985
You need to log in before you can comment on or make changes to this bug.