Implement the whitelist for Stub Attribution `source` field

RESOLVED FIXED

Status

Cloud Services
Operations
RESOLVED FIXED
2 years ago
10 months ago

People

(Reporter: ckprice, Assigned: oremj)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

We will need a whitelist for the `source` field used in Stub Attribution.

oremj has noted that it would be best to implement and maintain the whitelist in the service being built in https://bugzilla.mozilla.org/show_bug.cgi?id=1273940.

NI :cmore to supply the initial whitelist.
Flags: needinfo?(chrismore.bugzilla)

Comment 1

2 years ago
I have a proposed white list here in sheet "Sources with business interest":

https://docs.google.com/spreadsheets/d/1U-0JHpc3INJnBwTFkdrPqpk7yqvpEn16DJtp6d6TRiY/edit#gid=1389272572

It is based off of a few criteria:

* we are doing advertising with the organization
* we have a partnership with an organization
* it is a Mozilla property

Legal is taking this list to outside counsel to get their opinion if we can proceed.
Flags: needinfo?(chrismore.bugzilla)
(Assignee)

Comment 2

2 years ago
Can that list be public?
Flags: needinfo?(chrismore.bugzilla)

Comment 3

2 years ago
(In reply to Jeremy Orem [:oremj] from comment #2)
> Can that list be public?

I don't see why not. I don't want to share this specific spreadsheet, because it contains download metrics, but the list can be copy/pasted here. Let me see if legal has an update and I can paste it here.
Flags: needinfo?(chrismore.bugzilla)
NI :cmore to provide a final, public whitelist.

Note: anything that does not exist in the whitelist we'll want to send `(other)` in the source field.
Flags: needinfo?(chrismore.bugzilla)

Comment 5

2 years ago
Here's the preliminary list we can use to build out the white list pattern matching on the source= field.

All these values will be in the source= value being sent via www.mozilla.org to the stub service:

I wrote these as regular expressions to capture the patterns these organizations use in their DNS.

Our sites and Mozilla community sites:

^.*mozilla.*.+$
^.*firefox.*.+$

Search engines + SEM:

^.+\.google\..+$
^.+\.bing\..+$
^.+\.yahoo\..+$
^.*yandex\..+$
^.+\.baidu\..+$
^.+\.taobao\..+$

Other advertising:

^.+\.youtube\..+$
^.+\.facebook\..+$
^.*twitter\..+$
^.*united-internet\..+$
^.+\.and1\..+$
^.+\.gmx\..+$
^www\.mail\..+$
^.+\.aol\..+$
^.+\.qwant\..+$
^.+\.seznam\.cz$
^.+\.toshiba\..+$
^.+\.kongregate\..+$
^.+\.ea\..+$

There are also a few utm_source values that also need white listed as they are our vanity domains and in-browser referral traffic.

^firefox\-com$
^getfirefox\-com$
^firefox\-browser$

Anything that doesn't match the regex patterns above should have a source equal to this string: "(other)"
Flags: needinfo?(chrismore.bugzilla)
NI :oremj -- does this look okay to you?
Flags: needinfo?(oremj)
(Assignee)

Comment 7

2 years ago
Is it possible to make the regexs more restrictive? As it is now, it still seems really open. For example, I could send something like, badthings.ea.somesiteidontwanttrackingme.com.

Do we have a current source list? How many unique sources do we have? I'd rather whitelist everything exclusively rather than have regexs with ".+" or ".*" in them.
Flags: needinfo?(oremj)
(Assignee)

Comment 8

2 years ago
If that's not possible, a mapping might be a good choice. Example:

^.+\.google\..+$ => "google"
^.+\.bing\..+$ => "bing"

Comment 9

2 years ago
The regex's above spit out these sources (from Google Analytics):

www.google.com
www.bing.com
firefox-com
facebook.com
www.yahoo.com
taobao.com
youtube.com
getfirefox-com
support.google.com
kongregate.com
addons.mozilla.org
www.google.com
mozilla.jp
go.mail.ru
support.mozilla.org
mozilla.de
myaccount.google.com
www.seznam.cz
mozilla.cz
firefox.cz
www.aol.com
photos.google.com
testpilot.firefox.com
firefox-browser
br.search.yahoo.com
mail.ru
bing.com
us.search.yahoo.com
mx.search.yahoo.com
uk.search.yahoo.com
es.search.yahoo.com
help.ea.com
fr.search.yahoo.com
firefox.mozilla.cz
messenger.yahoo.com
hangouts.google.com
firefox.de
mozilla.fi
it.search.yahoo.com
ca.search.yahoo.com
in.search.yahoo.com
qwant.com
de.search.yahoo.com
ph.search.yahoo.com
developer.mozilla.org
id.search.yahoo.com
us.yhs4.search.yahoo.com
love.mail.ru
au.search.yahoo.com
plus.url.google.com
mozilla.hu
suche.gmx.net
mozilla.pl
hk.search.yahoo.com
ar.search.yahoo.com
vn.search.yahoo.com
plus.google.com
nl.search.yahoo.com
th.search.yahoo.com
mozilla.com
www.youtube.com
malaysia.search.yahoo.com
mozilla.ro
my.mail.ru
co.search.yahoo.com
start.toshiba.com
mozilla.lt
tw.search.yahoo.com
mozilla.sk
mozilla.si
se.search.yahoo.com
mozilla.ch
navigator-bs.gmx.fr
br.yhs4.search.yahoo.com
navigator-bs.gmx.com
cto.mail.ru
cn.bing.com
mozilla.rs
pl.search.yahoo.com
pe.search.yahoo.com
no.search.yahoo.com
e.mail.ru
ve.search.yahoo.com
us-mg6.mail.yahoo.com
extensions.aol.com
cl.search.yahoo.com
at.search.yahoo.com
hacks.mozilla.org
fi.search.yahoo.com
sg.search.yahoo.com
lite.qwant.com
hello.firefox.com
tr.search.yahoo.com
mail.google.com
images.tanks.mail.ru
dk.search.yahoo.com
mg.mail.yahoo.com
maktoob.search.yahoo.com
partnerads.ysm.yahoo.com
www.google.de
fr.yhs4.search.yahoo.com
takeout.google.com
mail.de
www.google.fr
espanol.search.yahoo.com
mx.yhs4.search.yahoo.com
firefox.no
us-mg5.mail.yahoo.com
ro.search.yahoo.com
www.google.it
activations.cdn.mozilla.net
tweetdeck.twitter.com
otvet.mail.ru
firefox.org
ch.search.yahoo.com
uk.yhs4.search.yahoo.com
scholar.google.com
www.google.es
encrypted.google.com
talkgadget.google.com
ru.search.yahoo.com
mail.aol.com
www.google.pl
answers.yahoo.com
tanks.mail.ru
suche.gmx.at
qc.search.yahoo.com
nz.search.yahoo.com
mozilla.ee
www.google.ro
help.mail.ru
gr.search.yahoo.com
www.google.ca
fr-mg42.mail.yahoo.com
bienvenido.toshiba.com
accounts.firefox.com
thunderbird.mozilla.cz
start.new.toshiba.com
search.1and1.com
poseidon.navigator-bs.gmx.com
malaysia.yhs4.search.yahoo.com
in.yhs4.search.yahoo.com
id.yhs4.search.yahoo.com
id.messenger.yahoo.com
hk.messenger.yahoo.com
www.google.sr
www.google.se
www.google.be
global.bing.com
firefox.si
co.yhs4.search.yahoo.com
se.yhs4.search.yahoo.com
navigator-bs.gmx.es
www.google.dz
www.google.bg
es-mg42.mail.yahoo.com
en-maktoob.search.yahoo.com
email.seznam.cz
br.answers.yahoo.com
(Assignee)

Comment 10

2 years ago
Can I use comment 9 as the whitelist?

Comment 11

2 years ago
Updated list leading www's missing:

www.google.com
www.bing.com
firefox-com
facebook.com
www.yahoo.com
taobao.com
youtube.com
getfirefox-com
support.google.com
kongregate.com
addons.mozilla.org
www.google.com
mozilla.jp
go.mail.ru
support.mozilla.org
mozilla.de
myaccount.google.com
www.seznam.cz
mozilla.cz
firefox.cz
www.aol.com
photos.google.com
testpilot.firefox.com
firefox-browser
br.search.yahoo.com
mail.ru
bing.com
us.search.yahoo.com
mx.search.yahoo.com
uk.search.yahoo.com
es.search.yahoo.com
help.ea.com
fr.search.yahoo.com
firefox.mozilla.cz
messenger.yahoo.com
hangouts.google.com
firefox.de
mozilla.fi
it.search.yahoo.com
ca.search.yahoo.com
in.search.yahoo.com
www.qwant.com
de.search.yahoo.com
ph.search.yahoo.com
developer.mozilla.org
id.search.yahoo.com
us.yhs4.search.yahoo.com
love.mail.ru
au.search.yahoo.com
plus.url.google.com
mozilla.hu
suche.gmx.net
mozilla.pl
hk.search.yahoo.com
ar.search.yahoo.com
vn.search.yahoo.com
plus.google.com
nl.search.yahoo.com
th.search.yahoo.com
mozilla.com
www.youtube.com
malaysia.search.yahoo.com
mozilla.ro
my.mail.ru
co.search.yahoo.com
start.toshiba.com
mozilla.lt
tw.search.yahoo.com
mozilla.sk
mozilla.si
se.search.yahoo.com
mozilla.ch
navigator-bs.gmx.fr
br.yhs4.search.yahoo.com
navigator-bs.gmx.com
cto.mail.ru
cn.bing.com
mozilla.rs
pl.search.yahoo.com
pe.search.yahoo.com
no.search.yahoo.com
e.mail.ru
ve.search.yahoo.com
us-mg6.mail.yahoo.com
extensions.aol.com
cl.search.yahoo.com
at.search.yahoo.com
hacks.mozilla.org
fi.search.yahoo.com
sg.search.yahoo.com
lite.qwant.com
hello.firefox.com
tr.search.yahoo.com
mail.google.com
images.tanks.mail.ru
dk.search.yahoo.com
mg.mail.yahoo.com
maktoob.search.yahoo.com
partnerads.ysm.yahoo.com
www.google.de
fr.yhs4.search.yahoo.com
takeout.google.com
mail.de
www.google.fr
espanol.search.yahoo.com
mx.yhs4.search.yahoo.com
firefox.no
us-mg5.mail.yahoo.com
ro.search.yahoo.com
www.google.it
activations.cdn.mozilla.net
tweetdeck.twitter.com
otvet.mail.ru
firefox.org
ch.search.yahoo.com
uk.yhs4.search.yahoo.com
scholar.google.com
www.google.es
encrypted.google.com
talkgadget.google.com
ru.search.yahoo.com
mail.aol.com
www.google.pl
answers.yahoo.com
tanks.mail.ru
suche.gmx.at
qc.search.yahoo.com
nz.search.yahoo.com
mozilla.ee
www.google.ro
help.mail.ru
gr.search.yahoo.com
www.google.ca
fr-mg42.mail.yahoo.com
bienvenido.toshiba.com
accounts.firefox.com
thunderbird.mozilla.cz
start.new.toshiba.com
search.1and1.com
poseidon.navigator-bs.gmx.com
malaysia.yhs4.search.yahoo.com
in.yhs4.search.yahoo.com
id.yhs4.search.yahoo.com
id.messenger.yahoo.com
hk.messenger.yahoo.com
www.google.sr
www.google.se
www.google.be
global.bing.com
firefox.si
co.yhs4.search.yahoo.com
se.yhs4.search.yahoo.com
navigator-bs.gmx.es
www.google.dz
www.google.bg
es-mg42.mail.yahoo.com
en-maktoob.search.yahoo.com
email.seznam.cz
br.answers.yahoo.com

Comment 12

2 years ago
(In reply to Jeremy Orem [:oremj] from comment #10)
> Can I use comment 9 as the whitelist?

I think it could work for now. I just think by doing it this way without regular expressions means just more maintenance over the list since the sub-domains on some of these probably change over time. Like with br.yhs4.search.yahoo.com, what if it changes to yhs5? I get why doing the regex is more complex to not go so wide that you capture www.mywebsiteismoreawesomethangoogle.com
(Assignee)

Comment 13

2 years ago
In that case, can we do what I suggested in comment 8? That way we can still use regexes, but will normalize the value down.
(In reply to Jeremy Orem [:oremj] from comment #13)
> In that case, can we do what I suggested in comment 8? That way we can still
> use regexes, but will normalize the value down.

Chris ^^^ (I'm just trying to help move this along, so I can continue testing further/deeper - thanks!)
Flags: needinfo?(chrismore.bugzilla)

Comment 15

2 years ago
I will take an action to do the regex mapping to a string to represent the domain.

For now, :oremj will use the static list of domains in comment 11.

We can change it later to a regex mapping, but let's not block on moving forward.
Flags: needinfo?(chrismore.bugzilla) → needinfo?(oremj)
(Assignee)

Updated

2 years ago
Assignee: nobody → oremj
Flags: needinfo?(oremj)
(Assignee)

Comment 16

2 years ago
:cmore, I've implemented this whitelist in https://github.com/mozilla-services/stubattribution/pull/23 r?
Flags: needinfo?(chrismore.bugzilla)
r+

Looks good to me for an MVP. We can always revise later.

Thanks!
Flags: needinfo?(chrismore.bugzilla)
(Assignee)

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Here's a few extra sources that don't have any TLDs:

google
bing
firefox-com
yahoo
yandex
ask
seznam
aol
I did some more testing today, and it looks like "facebook" as a source shows up without any TLDs, too:

Steps:
1) loaded https://www.facebook.com/Firefox/about/
2) clicked on https://mzl.la/292SfT5, which 301s to https://www.mozilla.org/en-US/firefox/new/?utm_source=facebook&utm_medium=social&utm_content=facebook-about-bio&utm_campaign=firefox
3) following through to the end-user flow on Mozilla.org (on www-demo4), I get the following exception in Sentry[0]:

could not validate attribution_code: source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook-about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist

See http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1 for the above

need-info? :cmore just to triple-check that I'm correct, here, before we add "facebook" as a source.

[0] https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/
Status: RESOLVED → REOPENED
Flags: needinfo?(chrismore.bugzilla)
Resolution: FIXED → ---
(In reply to Stephen Donner [:stephend] from comment #20)
> I did some more testing today, and it looks like "facebook" as a source
> shows up without any TLDs, too:
> 
> Steps:
> 1) loaded https://www.facebook.com/Firefox/about/
> 2) clicked on https://mzl.la/292SfT5, which 301s to
> https://www.mozilla.org/en-US/firefox/new/
> ?utm_source=facebook&utm_medium=social&utm_content=facebook-about-
> bio&utm_campaign=firefox
> 3) following through to the end-user flow on Mozilla.org (on www-demo4), I
> get the following exception in Sentry[0]:
> 
> could not validate attribution_code:
> source%3Dfacebook%26medium%3Dsocial%26campaign%3Dfirefox%26content%3Dfacebook
> -about-bio%26timestamp%3D1483170532: source: facebook is not in whitelist
> 
> See
> http://www.webpagetest.org/result/161231_D8_A11/1/details/#step1_request1
> for the above
> 
> need-info? :cmore just to triple-check that I'm correct, here, before we add
> "facebook" as a source.
> 
> [0]
> https://sentry.prod.mozaws.net/operations/stub_attribution-dev/issues/376577/

It looks like most of the time, it is facebook.com, but possible that sometimes people forget to add the TLD. Let's add "facebook" to the white list .
Flags: needinfo?(chrismore.bugzilla)
More valid whitelist domains:

r.search.yahoo.com
www.mozilla.org
www.google.com.br
www.google.com.ec
www.google.com.mx
or just whitelist them all:
^www.google.com.\w+$
Another valid whitelist source to be updated:

fx36start
oremj: another addition to the whitelist in comment 23.
Flags: needinfo?(oremj)
(Assignee)

Comment 25

a year ago
These are all showing up in the logs, are they valid:
firefox,89
"contra.pentagames.net",47
vk,47
fx36start,45
"oferta.senasofiaplus.edu.co",37
"www.minijuegos.com",37
bubblewitch3,31
"king.com",31
"int.search.myway.com",25
"update.org",25
browser,24
"www.kongregate.com",24
cdn,22
"contractwarsgame.com",21
ec,21
prod4,21
snippet,17
"www.clickjogos.com.br",15
"pro.rarom.ro",14
com,13
"diggerworld.ru",12
"www.nplay.com",12
"search.seznam.cz",11
"www.macrojuegos.com",11
"www.viamichelin.fr",11
about,10
home,10
"www.miniclip.com",10
"cse.google.com",9
"duckduckgo.com",9
(Assignee)

Updated

a year ago
Flags: needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #25)
> These are all showing up in the logs, are they valid:
> firefox,89

valid

> "contra.pentagames.net",47

not valid

> vk,47

not valid (maybe in the future)

> fx36start,45

valid, that's us.

> "oferta.senasofiaplus.edu.co",37

not valid

> "www.minijuegos.com",37

not valid

> bubblewitch3,31

not valid

> "king.com",31


not valid

> "int.search.myway.com",25


not valid

> "update.org",25

not valid

> browser,24

not valid (not sure what this is)

> "www.kongregate.com",24

valid

> cdn,22

not valid. probably generic cdn traffic.

> "contractwarsgame.com",21

not valid

> ec,21

not valid

> prod4,21

not valid

> snippet,17

valid (Firefox's snippets)

> "www.clickjogos.com.br",15

not valid

> "pro.rarom.ro",14

not valid

> com,13

not valid

> "diggerworld.ru",12

not valid

> "www.nplay.com",12

not valid

> "search.seznam.cz",11


valid

> "www.macrojuegos.com",11

not valid

> "www.viamichelin.fr",11

not valid

> about,10

valid (I think this is source=about-home in this one and the next)

> home,10

valid (firefox's about-home)

> "www.miniclip.com",10


not valid

> "cse.google.com",9

valid

> "duckduckgo.com",9

valid, search partner in Firefox
Flags: needinfo?(chrismore.bugzilla)
Sorry, one more, for firefox-dev-tools, which comes from:

https://dxr.mozilla.org/mozilla-central/search?q=firefox-dev-tools&redirect=true
pref("devtools.devedition.promo.url", "https://www.mozilla.org/firefox/developer/?utm_source=firefox-dev-tools&utm_medium=firefox-browser&utm_content=betadoorhanger");
Flags: needinfo?(oremj)
Flags: needinfo?(chrismore.bugzilla)
Also, if we wanted to double-check/add more in-product utm_source values, we could grab from here:

https://dxr.mozilla.org/mozilla-central/search?q=utm_source&redirect=false
:oremj: here's two more that are valid:

firefox-dev-tools
directory-tiles
Flags: needinfo?(chrismore.bugzilla)
oremj: here's another source that will need to be whitelisted for a partnership test that we will be running soon:

softonic.com
Flags: needinfo?(oremj)
oremj: three more including comment 33:

chip.de
www.chip.de
www.softonic.com

may want to do:

^.*chip\.de$
^.*softonic\.com$

Chip is a partner in Germany.

Thanks!
(Assignee)

Comment 35

a year ago
The following list is the current top 20 blocked sources. Would you like me to add any while I'm at it?

oferta.senasofiaplus.edu.co
beinconnect.es
www.macrojuegos.com
contra.pentagames.net
vk
www.minijuegos.com
ok.ru
cdn
contractwarsgame.com
ec
prod4
update.org
browser
int.search.myway.com
yandex.ua
desktop
snippet
watch.nowtv.com
bubblewitch3
king.com
Flags: needinfo?(oremj) → needinfo?(chrismore.bugzilla)
(In reply to Jeremy Orem [:oremj] from comment #35)
> The following list is the current top 20 blocked sources. Would you like me
> to add any while I'm at it?
> 
> oferta.senasofiaplus.edu.co

no

> beinconnect.es

no

> www.macrojuegos.com

no

> contra.pentagames.net

no

> vk

no

> www.minijuegos.com

no

> ok.ru

no

> cdn


hmm.
I don't know if this is our CDN or something else. Any other values coming through on source = cdn?

> contractwarsgame.com

no

> ec

no

> prod4

no

> update.org

no

> browser

yes

> int.search.myway.com

no

> yandex.ua

yes

> desktop

yes

> snippet

yes

> watch.nowtv.com

no

> bubblewitch3

no

> king.com

no
Flags: needinfo?(chrismore.bugzilla)
oremj: also add 

^.+\.wikipedia\.org$
oremj: new partner to add:

toshiba.com
www.toshiba.com

or 

^.+\.toshiba\.com$
Flags: needinfo?(oremj)
Jeremy:

in comment 33 and comment 34, I asked for softonic.com and www.softonic.com (regex to match both to be added to the whitelist). We have been running a parter experiment for the past month with source=softonic.com and I can't find the data in attribution. So, I went to our white list at:

https://github.com/mozilla-services/stubattribution/blob/master/attributioncode/sourcewhitelist.go#L211

and I noticed that the white is not a regex and is just set to www.softonic.com. 

The current experiment points to this URL:

https://www.mozilla.org/firefox/new/?utm_source=softonic.com&utm_campaign=fx-download-baseline&utm_medium=referral&utm_content=fx-download-page

which doesn't contain the www, which is why I originally requested with and without the www to ensure this doesn't happen. That means it is likely that we have been discarding this data.

Can you add the softonic.com without the www (or make it a regex)?

Also, can you check the exception log to confirm that softonic.com (without www) has been being blocked?

Thanks
Flags: needinfo?(oremj)
(Assignee)

Comment 41

a year ago
Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is showing up in the logs as blocked.

Added here: https://github.com/mozilla-services/stubattribution/pull/56

Let's start filing new bugs, github issues or github PRs for whitelist change requests. The length of this bug is starting to make tracking difficult.
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Flags: needinfo?(oremj)
Resolution: --- → FIXED
(In reply to Jeremy Orem [:oremj] from comment #41)
> Sorry, I was thrown off by comment 34. I've confirmed that softonic.com is
> showing up in the logs as blocked.
> 
> Added here: https://github.com/mozilla-services/stubattribution/pull/56
> 
> Let's start filing new bugs, github issues or github PRs for whitelist
> change requests. The length of this bug is starting to make tracking
> difficult.

Thanks, Jeremy. I will file new bugs for any updates to the whitelist.
Hey Jeremy. Sorry for another question as it is related to this above and not really a new bug.

Question: For sources that are rejected from the white list, do they get attribution values of "unknown" or are they completely thrown away?
Flags: needinfo?(oremj)
(Assignee)

Comment 44

a year ago
They are completely thrown away. In other words, stub attribution service gives them an unmodified stub installer.
Flags: needinfo?(oremj)
(In reply to Jeremy Orem [:oremj] from comment #44)
> They are completely thrown away. In other words, stub attribution service
> gives them an unmodified stub installer.

:oremj: the original requirements stated that non-white listed sources should be set to "(other)" instead of null.

How big of a change would this be? I can file a new bug if this is a change can be made.
Flags: needinfo?(oremj)
You need to log in before you can comment on or make changes to this bug.