Closed Bug 1532736 Opened 7 years ago Closed 6 years ago

Stop blocking Google bot and Google translate

Categories

(developer.mozilla.org Graveyard :: General, enhancement, P1)

All
Other
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: atopal, Assigned: rjohnson)

Details

(Keywords: in-triage, Whiteboard: [specification][type:bug][points=1])

What did you do?

1. Looked up the log of IP addresses blocked by Worf in the #mdn-notices slack channel
2. ran a reverse DNS look up with "host [IP]"
3. Noted that we are blocking IPs that resolve to Google bot and googleusercontent.com

What happened?

Googlebot wasn't able to connect to MDN and no error was sent. People weren't able to use Google translate (googleusercontent.com) to see MDN translated into their language of choice.

What should have happened?

We should match Googlebot's crawling speed. If we can't right now, we need to plan to get there. In the meantime we should send a 429 error.

We should also white list googleusercontent.com it's not a crawler. The page loads are user initiated.

Is there anything else we should know?

Here's Google's explanation of how to identify the Google crawler: https://support.google.com/webmasters/answer/80553?hl=en

I have deployed a fix that would ignore googlebot.com and googleusercontent.com

Hey Ed, according to that link, it's not just googlebot.com but also google.com. Both of those are used by the Google crawler. Can you add that?

Flags: needinfo?(limed)

Nevermind, saw the script, and it looks like google.com is included

So I'm assuming we're good here? :)

Flags: needinfo?(limed)

I looked at the #mdn-notices channel on Slack and haven't seen anything blocked since Wednesday. So, yes, I'd say that this bug is fixed.

That said, it seems odd that nothing was blocked since Wednesday. At least when I checked last week only some of the blocked IPs resolved to Google addresses, seems odd that there are none of those now.

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED

We had a traffic spike on Tuesday March 5, of about 15,000 requests per minute. We had a second one on Tuesday March 12, again 15,000 requests per minute for about 10 minutes, from googleusercontent.com. The generic blocker did not block these requests, but it triggered Kuma's limit of 400 document requests per minute, and most requests got a 429 Too Many Requests.

From these two data points, we might see another spike of thousands of requests per minute from googleusercontent.com on March 26.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

And I guess March 19?

How did we come up with the 400 documents/minute limit? Would it be worth experimenting with that? I'll file a separate bug for it.

I think it would make sense to keep working on this bug, so all the context is in one place.

This was added in https://github.com/mozilla/kuma/pull/4591 by the SREs, there is no justification for the choice in the PR. It was increased from 200 per minute to 400 per minute during the time the PR was open.

There are 20 * 8 = 160 web workers, so 400 requests per minute means the whole system is serving a single client at 2.5 requests per minute per web worker. I'm not sure if there is value in increasing it, it feels like we should just drop the rate limit for the homepage and document views.

Okay, let's go ahead and drop rate limit for those pages then.

I've found some more context in the (Mozilla-only) downtime report for Dec 3, 2017. A scraper requested about 2,000 pages per minute, and downtime started 3 minutes later as the scraper continued to follow in-content links, and prevented the homepage from responding.

Google translate requests may come from translate.googleusercontent.com, but googleusercontent.com is also the source domain for Google Cloud projects. I don't think Google is trying to scrape MDN as fast as possible, I think this is someone (ab)using GCP.

Maybe we can log the googleusercontent.com IPs without blocking, and report abuse as time allows. We can also tune this up so less requests see a 429, and determine what level is acceptable without downtime. I plan to remove the homepage limit, and double the document limit to 800/m.

Sounds good to me, John.

Where does the information about translate.googleusercontent.com come from?

As far as I can tell, Google uses that domain for a bunch of projects, eg lite pages and page caches. Here's an example:

https://icl.googleusercontent.com/?lite_url=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify&re=1&ts=1552663281&sig=ACgcqhpm5j2c8qb7G9NCBImOfZ42zaAIjA

I find it hard to find information from Google, but Adobe has some docs for their analytics product that address the issue: https://helpx.adobe.com/analytics/kb/googleusercontentcom-instances.html

If we suspect abuse of GCP, we should notify them via https://support.google.com/code/contact/compute_engine_report?visit_id=636882601739208333-3131907561&p=gce_abuse_report&rd=1

Sorry, there's no translate.googleusercontent.com. I'm going from the statement in the bug report "People weren't able to use Google translate (googleusercontent.com) to see MDN translated into their language of choice."

Chrome's in-browser translation feature makes a network request to call translate.googleapis.com. There may be a different feature that uses googleusercontent.com.

Assignee: nobody → rjohnson
Keywords: in-triage
Priority: -- → P1
Whiteboard: [specification][type:bug] → [specification][type:bug][points=1]

https://github.com/mozilla/kuma/pull/5466 has been deployed to production.

Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.