Closed Bug 926891 Opened 11 years ago Closed 10 years ago

rehost effective_tld_names.dat for publicsuffix.org?

Categories

(Developer Services :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fubar, Assigned: nmaul)

Details

q.v. bug 924785#c7

http://publicsuffix.org/list/ links directly to effective_tld_names.dat on mxr.mozilla.org. daily traffic to the direct link appears to be on the order of 5.5M/day; blocking said traffic results in very poor client retry behavior. as :gerv says, people shouldn't be coding that address into their products.

1) look at mxr referrer logs to try and determine the worst offenders and pester them to fix their code

2) host a better? cannonical URL for publicsuffix.org; mxr.m.o will eventually be replaced by dxr.m.o (in what we hope to be a less than geological time scale)
5.5M hits a day is just insane.

:fubar: Do you know what hardware publicsuffix.org itself is running on? My initial idea was to have a 302 redirect on that site,  e.g. at publicsuffix.org/list.dat , so people could code in a canonical URL, and we could 302 redirect it at something beefy. But perhaps the publicsuffix.org server can't even cope with that.

The options for where we point people to, to get the data, seem to me to be:

1) mxr.mozilla.org
2) dxr.mozilla.org
3) hg.mozilla.org
4) github.com
5) direct hosting on publicsuffix.org

5) just means we have to have a copying mechanism, and it's another website to scale. Let's not go there. The use of 4) gives the problem to someone else, but perhaps depends on the official-ness of Mozilla's github.com mirrors. 3), 2) and 1) depend on what IT/Ops would prefer.

Gerv
publicsuffix.org is hosted on our static cluster (static == no db backend). it's fairly robust, with 4 webheads behind zeus. truth be told, I think I'd prefer to host it there and have a simple cronjob update the file. hg.m.o would be my next preference. let me run it by :jakem and see what he thinks.
Chatted with solarce and jakem about this a bit.  First cut, we can host it with publicsuffix.org and have zeus cache it; how often does the file change? We'll also look at what it would take to push it out to the CDN, Just In Case. 

Ideally, though, we'd like to be able to host it without people hot linking to it, but it's not clear how to do that yet without probably breaking extant valid uses; this would also go hand in hand with trying to find the worst offenders and have them fix their code so that clients don't freak out again.
The file changes perhaps once every two weeks; more importantly, it is not urgent that consumers get the updated file immediately. (There's a delay of 12-18 weeks on it shipping in Firefox, for example). A cron job which copies it across daily would be more than adequate.

Do we actually want to stop people doing download on demand? Assuming they are sensible about it (i.e. once a day, max), I don't think it's too bad.

Gerv
The question is can we reasonably prevent people from baking it into their code while still allowing users browsing publicsuffix.org to download the file? e.g. are there any "normal" use cases where an empty referrer header might appear?

It might not be possible to do without adversely effecting valid uses, and if so than we'll move on, but it's worth looking into.
fubar: so is the plan to host on publicsuffix.org and update with a cron job?

publicsuffix.org is in SVN. Would we just add the file to .svnignore and manage it separately?

Gerv
I think that's the best plan. Where should we pull the file from (it's actually getting picked up from two different places on mxr)? Do we need to do any formatting or post-processing of the file?
No formatting or post-processing.

You can get the file from whatever mechanism you prefer that pulls that path from mozilla-central. MXR, hg.mozilla.org, whatever.

Gerv
Actually, given how the deploy scripts work there's no need for the .svnignore; cron will now pull directly from hg.m.o/mozilla-central just ahead of the regularly scheduled update/deploy. 

It's currently available at http://publicsuffix.org/list/effective_tld_names.dat

Looks like we're up to 11M hits/day on mxr based on yesterday's numbers. Interestingly, 70% of them have referrer headers... from something on the order of 138,000 different domains. >.<
I have updated the website to link to the new URL given in comment #9, and also added a request that people have their apps download the list no more than once per day.

If you can provide a sample of referrer headers, we can try and work out what is doing this frequent downloading.

Gerv
Assignee: server-ops-webops → nmaul
Sadly, it looks like there are almost no usable Referer headers here anymore. I'm not sure what changed from January to now that would cause this. :(

From MXR hits (note the last line, for scale):

    828 "http://www.amazon.com/aan/2009-09-09/static/amazon/iframeproxy-39.html"
   1063 "https://www.google.com/"
   1120 "http://googleads.g.doubleclick.net/pagead/html/r20140520/r20140417/zrt_lookup.html"
   1148 "https://plus.google.com/u/0/_/streamwidgets/canvas"
   1222 "https://s-static.ak.facebook.com/connect/xd_arbiter/V80PAcvrynR.js?version=41"
   1233 "https://www.google.com/blank.html"
   1396 "https://plus.google.com/u/0/_/notifications/frame?sourceid=1&hl=en&origin=https%3A%2F%2Fwww.google.com&jsh=m%3B%2F_%2Fscs%2Fabc-static%2F_%2Fjs%2Fk%3Dgapi.gapi.en.TY07tiUU0tE.O%2Fm%3D__features__%2Frt%3Dj%2Fd%3D1%2Frs%3DAItRSTNfGmB_-do3YO3g20AHt3L6itPzpQ"
   1754 "http://static.ak.facebook.com/connect/xd_arbiter/V80PAcvrynR.js?version=41"
   2612 "https://www.facebook.com/"
 272031 "-"

From publicsuffix.org hits:

      2 "https://publicsuffix.org/list/"
    466 "-"


Since this redirect is in place and this data was the last request on this bug, I'm going to close this out. Please let us know if there's anything still needed here. Thanks!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in before you can comment on or make changes to this bug.