Stop blocking Google bot and Google translate
Categories
(developer.mozilla.org Graveyard :: General, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: atopal, Assigned: rjohnson)
Details
(Keywords: in-triage, Whiteboard: [specification][type:bug][points=1])
What did you do?
1. Looked up the log of IP addresses blocked by Worf in the #mdn-notices slack channel
2. ran a reverse DNS look up with "host [IP]"
3. Noted that we are blocking IPs that resolve to Google bot and googleusercontent.com
What happened?
Googlebot wasn't able to connect to MDN and no error was sent. People weren't able to use Google translate (googleusercontent.com) to see MDN translated into their language of choice.
What should have happened?
We should match Googlebot's crawling speed. If we can't right now, we need to plan to get there. In the meantime we should send a 429 error.
We should also white list googleusercontent.com it's not a crawler. The page loads are user initiated.
Is there anything else we should know?
Here's Google's explanation of how to identify the Google crawler: https://support.google.com/webmasters/answer/80553?hl=en
Comment 1•7 years ago
|
||
I have deployed a fix that would ignore googlebot.com and googleusercontent.com
| Reporter | ||
Comment 2•7 years ago
|
||
Hey Ed, according to that link, it's not just googlebot.com but also google.com. Both of those are used by the Google crawler. Can you add that?
| Reporter | ||
Comment 3•7 years ago
|
||
Nevermind, saw the script, and it looks like google.com is included
| Reporter | ||
Comment 5•7 years ago
|
||
I looked at the #mdn-notices channel on Slack and haven't seen anything blocked since Wednesday. So, yes, I'd say that this bug is fixed.
That said, it seems odd that nothing was blocked since Wednesday. At least when I checked last week only some of the blocked IPs resolved to Google addresses, seems odd that there are none of those now.
Comment 6•7 years ago
|
||
We had a traffic spike on Tuesday March 5, of about 15,000 requests per minute. We had a second one on Tuesday March 12, again 15,000 requests per minute for about 10 minutes, from googleusercontent.com. The generic blocker did not block these requests, but it triggered Kuma's limit of 400 document requests per minute, and most requests got a 429 Too Many Requests.
From these two data points, we might see another spike of thousands of requests per minute from googleusercontent.com on March 26.
| Reporter | ||
Comment 7•7 years ago
|
||
And I guess March 19?
How did we come up with the 400 documents/minute limit? Would it be worth experimenting with that? I'll file a separate bug for it.
Comment 8•7 years ago
|
||
I think it would make sense to keep working on this bug, so all the context is in one place.
This was added in https://github.com/mozilla/kuma/pull/4591 by the SREs, there is no justification for the choice in the PR. It was increased from 200 per minute to 400 per minute during the time the PR was open.
There are 20 * 8 = 160 web workers, so 400 requests per minute means the whole system is serving a single client at 2.5 requests per minute per web worker. I'm not sure if there is value in increasing it, it feels like we should just drop the rate limit for the homepage and document views.
| Reporter | ||
Comment 9•7 years ago
|
||
Okay, let's go ahead and drop rate limit for those pages then.
Comment 10•7 years ago
|
||
I've found some more context in the (Mozilla-only) downtime report for Dec 3, 2017. A scraper requested about 2,000 pages per minute, and downtime started 3 minutes later as the scraper continued to follow in-content links, and prevented the homepage from responding.
Google translate requests may come from translate.googleusercontent.com, but googleusercontent.com is also the source domain for Google Cloud projects. I don't think Google is trying to scrape MDN as fast as possible, I think this is someone (ab)using GCP.
Maybe we can log the googleusercontent.com IPs without blocking, and report abuse as time allows. We can also tune this up so less requests see a 429, and determine what level is acceptable without downtime. I plan to remove the homepage limit, and double the document limit to 800/m.
| Reporter | ||
Comment 11•7 years ago
|
||
Sounds good to me, John.
Where does the information about translate.googleusercontent.com come from?
As far as I can tell, Google uses that domain for a bunch of projects, eg lite pages and page caches. Here's an example:
I find it hard to find information from Google, but Adobe has some docs for their analytics product that address the issue: https://helpx.adobe.com/analytics/kb/googleusercontentcom-instances.html
If we suspect abuse of GCP, we should notify them via https://support.google.com/code/contact/compute_engine_report?visit_id=636882601739208333-3131907561&p=gce_abuse_report&rd=1
Comment 12•7 years ago
|
||
Sorry, there's no translate.googleusercontent.com. I'm going from the statement in the bug report "People weren't able to use Google translate (googleusercontent.com) to see MDN translated into their language of choice."
Chrome's in-browser translation feature makes a network request to call translate.googleapis.com. There may be a different feature that uses googleusercontent.com.
Updated•7 years ago
|
| Assignee | ||
Comment 14•6 years ago
|
||
| Assignee | ||
Comment 15•6 years ago
|
||
https://github.com/mozilla/kuma/pull/5466 has been deployed to production.
| Reporter | ||
Updated•6 years ago
|
Updated•5 years ago
|
Description
•