Closed Bug 1316610 Opened 8 years ago Closed 5 years ago

Google keeps indexing pages that should not be indexed

Categories

(developer.mozilla.org Graveyard :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: atopal, Assigned: jwhitlock)

References

Details

(Keywords: in-triage)

Attachments

(2 files, 1 obsolete file)

I was looking at our Google Search Console data and was surprised to see that Google had indexed more than 600k MDN pages, when we don't have nearly as many documents, even if we count locales. It turns out that Google still indexes $translate and $edit pages and probably others as well: https://www.google.de/search?biw=1280&bih=1237&q=site:developer.mozilla.org#q=site:developer.mozilla.org&start=553&filter=0

Since this is mostly duplicate content it can be penalized by Google in the ranking of results.

Our robots.txt is supposed to filter those out, but there seems to be an issue with that filtering. I'm guessing the $ sign has to be handled in some special way.

Expected Result: None of the pages that are listed in robots.txt should be filtered
Follow up: looks like $styles is properly filtered. weird.
info from Google about blocking: https://support.google.com/webmasters/topic/4598466?rd=1
Severity: normal → major
Keywords: in-triage
So our robots.txt file includes the $translate and other document parameters in such a way that it prevents the crawlers to not indexing /*$translate link.
such as 
https://developer.mozilla.org/en-US/Firefox$translate

It was first fixed in PR 3884(1), but later we found that the crawlers was indexing the zone redirected pages. (2). So someone from cacktus team made another patch PR 3891(3) and fixed indexing the zone redirects but it made the regression that all other document with $translate parameters was allowed for indexing.

So we should find a way to disallow crawlers to crawl all urls ending with $translate or similar parameters.

1. https://github.com/mozilla/kuma/pull/3884
2. https://github.com/mozilla/kuma/pull/3884#issuecomment-225040977
3. https://github.com/mozilla/kuma/pull/3891
So I have test with my local development server using ngrok and google robots.txt Tester, and it seems our rule is correct enough to disallow the crawlers from indexing urls ending with $edit or similar parameters.

I suspect that google may not respect our robots.txt file
Attached image Screenshot of robots.txt tester (obsolete) —
Attachment #8811146 - Attachment is obsolete: true
I'm having trouble finding confirmation on this theory, but could it be that Google is adding these URLs to its index because they are found on crawlable URLs?

Links to URLs containing $edit, $history, $locales, etc are found on many crawlable URLs across MDN. The description for these $ URLs is missing in Google's results ("A description for this result is not available because of this site's robots.txt"), which implies that robots.txt *is* being respected.

I'll try to dig a little deeper on this tomorrow.
These URLs appear in links on crawlable content pages, without a "nofollow" indicator.
(In reply to John Whitlock [:jwhitlock] from comment #8)
> These URLs appear in links on crawlable content pages, without a "nofollow"
> indicator.

Nice catch! Sounds like adding rel="nofollow" to the links in question should be the first order of business. I'll file a bug.

However, I have no idea if this will remove those results from Google's index now that they're in there.
Depends on: 1330336
I think, adding a header is a good idea for telling the bot to not index.
(In reply to Safwan Rahman (:safwan) from comment #10)
> I think, adding a header is a good idea for telling the bot to not index.

Though I don't think the pages in question are being indexed (based on research in comment 4 and Google's results saying "A description for this result is not available because of this site's robots.txt"), a quick look shows the uncrawlable pages *do* have a <meta name="robots" content="index, follow"> meta tag. I don't *think* this is causing issues, but it's incorrect nonetheless and would be good to fix. Perhaps we should file a bug?

I only did a cursory check, but I do see an x-robots-tag: "noindex" for the uncrawlable pages. Might be worth an audit to make sure all relevant URLs do indeed have this header though.
:jpetto Is it fixed by this patch?
https://github.com/mozilla/kuma/pull/4095
Flags: needinfo?(jon)
(In reply to Safwan Rahman (:safwan) from comment #12)
> :jpetto Is it fixed by this patch?
> https://github.com/mozilla/kuma/pull/4095

I *think* that patch does fix the issue. I manually re-ran the search in comment 0 and did not see any $ URLs listed. 

To be sure, I'll let :atopal weigh in before we close this one out.
Flags: needinfo?(jon) → needinfo?(a.topal)
Thanks for your work on this Jon and thanks for looking into this Safwan.

There is something weird going on with the indexing here. Google Search Console tells me that Google has 631,513 urls indexed in total for MDN, but it also says that 644,227 urls are blocked. I'll leave this open for our SEO analyst to look into.

Btw. Google's robots.txt tester says that this line is an error: "Request-rate: 1/5" and it doesn't respect "Crawl-delay: 5". Both are legal as far as I can tell, but they essentially tell the crawlers that they can only crawl 1 page every 5 seconds, that doesn't seem useful. I'd remove both.
Flags: needinfo?(a.topal)
I'd like to hear what :jwhitlock has to say about the request rate. Could be a performance-related setting.

(FWIW, bedrock specifies crawl-delay: 1 and does not use request-rate.)
Flags: needinfo?(jwhitlock)
The setting has been there since robots.txt was added in 2012:

https://github.com/mozilla/kuma/commit/8bab92460cc3f67ecb11e5a6f51dc3962a530c02

Back then, both were considered valid keys:

http://bugs.python.org/issue16099

They were not part of the standard, and were not standardized.  I think the consensus is to use Crawl-delay, and to interpret it as "wait 5 seconds between requests".

I'd rather wait until after the SCL3 -> AWS migration.  If you want to do it before, we'll need to talk to webops to determine if there is an automated rule to ban IP addresses that crawl faster than this rate.  It would be a shame to ban Google's IP.

w0ts0n, do the load balancers implement rate-based blocks?
Flags: needinfo?(rwatson)
Flags: needinfo?(jwhitlock)
Flags: needinfo?(rwatson)
Looking at the long tail of Google results, it appears that $translate links are gone entirely, and I found one $edit link, that can probably be manually excluded using Webmaster Tools.

Current index status is 432,734 indexed, 346,850 blocked by robots.

One source appears to be profile pages. We have 8880 contributors in 47 languages, which results in 417,360 profile pages, with no canonical. We only link to these pages when someone makes an edit, so I don't think every profile contributes to the index inflation, but they certainly don't help.

We added a <meta name="robots" content="noindex, nofollow"> to profile pages in May 2017, but this hasn't been reflected in the index yet:

https://www.google.com/search?q=site%3Adeveloper.mozilla.org+jwhitlock

The SEO had some ideas on why this hasn't taken effect yet, but they don't reflect the site configuration. It could be "nofollow" on links to profile pages, but we don't use that tag in the contributor sections on wiki pages. Google should be noticing this tag, at least for contributors.

I have a plan:

1. Setup one profile page as the canonical page. "English" is a good one, but I think it is not much more work to set it to the user's preferred language if set.  I don't intend to enumerate all the translations.
2. Create a sitemap of the 8880 x 47 profile pages.  It will be huge, and maybe multi-part to work in the sitemap standard, but machine generated.
3. Submit this directly to Google using the webmaster tools, so that it will re crawl the profile pages
4. Watch for profile pages to drop from the search index.
Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED
Current status is 397,055 indexed, 1,384 blocked.  Both graphs have appeared to level off since mid-August.

User profiles are all but gone from the index, so none of the planned action will be taken:

https://www.google.com/search?q=site%3Adeveloper.mozilla.org+inurl%3A%2Fprofiles%2F

I see one for Mte90, and many (55,400) "duplicate urls" for https://developer.mozilla.org/users/github/login/.

This demonstrates the limitations of robots and meta directives:

1. Google will continue to index a page, like https://developer.mozilla.org/en-US/profiles/Mte90, if it is linked from other places.

2. Google will index documents, like the GitHub Login page /users/github/login/, if they are linked internally, but won't show the content, because robots.txt doesn't allow it.

I think we need a "nofollow" attribute on the link to the GitHub login in the header.  We might also want to drop the rule "Disallow: /*users/" so that Google can see that it doesn't want to index that page.
Search results, Sep 26 2017, looking for URLs containing /profiles/. One for Mte90, and about 55,000 more for the Github Login page.
Depends on: 1431511
Current status is 237,282 indexed, 120,175 blocked. The big change in blocking is that we're not allowing Google to index the attachments / samples domain. I think this is a mistake, and we should allow indexing again.
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/4ba37df42531589c4276ea2b4c44270ac1c49210
bug 1316610: Drop Crawl-delay, etc from robots.txt

Googlebot doesn't respect Crawl-delay, but instead watches to see how
fast it can index your site without breaking it.

Request-rate isn't know by Googlebot, and doesn't appear to be supported
by many browsers.

MDN can be accessed faster than 5 pages per second, and we shouldn't
penalize scrapers that respect robots.txt.

https://github.com/mozilla/kuma/commit/d7231772a38e672599c69a24885eaaf3f5a7e70e
bug 1316610: Switch ALLOW_ROBOTS to lists

Instead of a yes/no setting, use lists of hostnames that allow robots,
and deny robots for other hostnames.

For the main website, robots.txt allows robots, but forbids some paths
with dynamic or user-specific content.

For the untrusted attachments and samples domain, allow robots.
Previously, they were not allowed in AWS, but they seem to be an
important part of the content.

https://github.com/mozilla/kuma/commit/748a81391a020752fad330519d80592406c8524f
bug 1316610: Avoid indexing empty, dupe docs

If a document is empty, or it an untranslated page that is just showing
the English document, then ask the search crawlers not to index it.

https://github.com/mozilla/kuma/commit/8ac592a95c1021084b74cc762de1e388f891a275
Merge pull request #4649 from jwhitlock/indexing-801623

bug 801623, 1316610, 1431511: Update indexing and robots.txt
Current status is 209,720 indexed 119,322 blocked. The attachments / samples domain at https://mdn.mozillademos.org allows indexing again.

"Request-rate: 1/5" and "Crawl-delay: 5" are no longer in the main domain's robots.txt.

John, can you report the current indexing state and decide whether this bug should still be open?

Flags: needinfo?(jwhitlock)

The Google search console is transitioning to a new implementation, and the counting is different. The new console talks about All Known Pages and All Submitted pages

Category Valid Valid with Warnings Error Excluded
All Known Pages 66.6k 43.4k 370 1.52M
All Submitted Pages 62.1k 0 340 353

"Valid with Warnings" includes pages like the $edit pages, which are crawled but blocked by robots.txt. We've been working on this and other issued on bug 1518482. A 10X reduction in indexed page is pretty good, and I think this would be a good time to close this bug and do new work on the newer bug.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(jwhitlock)
Resolution: --- → INCOMPLETE
See Also: → 1518482

Even its resolved some thoughts on this. Tldr: I wouldnt worry much about it.

If a page is not allowed to crawl via robots.txt it still is allowed to be indexed if found via a external or internal link. Just no content is allowed to index, but the url is part of the content of other pages.

So if you want to get pages out of the index, you need to noindex them.

But this difference is only important in very few cases, like legal cases.

In general both ways tell crawlers they shouldnt waste their time on these pages and thats what we are doing here.

The 600k+ "pages" excluded in the search console are basically every possible redirect, parameter combination and historically known url on a domain. The only thing there to worry about is, if urls arent in the right category why they are excluded. But excluded urls are pretty well handled and ignored in general by search engines these day. there are exceptions, but I am not aware of a dangerous one on mdn.

Dublicate content is imo never a reason to get penalised. A penalty is a very rare thing. Massive problems with dublicate content can confuse crawlers to find the original content, but thats happening if not only dupblicate content is massive on a page but as well the link and site architecture is not clear. The reason why dublicate content witch is reachable for users should be handled via canonicals to the original is because noindexed or a disallow via robots.txt are taking these pages out of the pagerank equation. And if users can see the content, they might link to these pages.

But here we are imo all safe, we send enough understandable signals for crawlers to pick the right content.

Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: