1353079 - Google crawler hits errors

Reporter

Description

•

7 years ago

An email from Google came in yesterday notifying us that Googlebot found an increase in authorization permission errors on https://developer.mozilla.org/

Looking into it on the search console it seems like that's triggered by profile pages: https://www.google.com/webmasters/tools/crawl-errors?siteUrl=https://developer.mozilla.org/&utm_source=wnc_652500&utm_medium=gamma&utm_campaign=wnc_652500&utm_content=msg_711002&hl=en#t2=3

Until the end of Februray we used to have less than 200 "access denied" errors a day, now we have more than 2000 trending up.

A typical entry on the list is "en-US/profiles/eddingim45ag"

I can see that we have a rule in robots.txt to disallow editing the profile page
Disallow: /*profiles*/edit

but we should also disallow the profile page itself.

Stephanie Hobson [:shobson]

Updated

•

7 years ago

Depends on: 1353134

Stephanie Hobson [:shobson]

Comment 1

•

7 years ago

The errors are almost all from the profiles of banned users.

We should:
1) Return a 404 for these profiles. (Bug 1353138)
2) Not link to them from the pages they have edited. (Bug 1353134)

Some fun stuff I found poking around:

We do filter the contributor profiles by is_active. Unfortunately, an active ban does not an inactive profile make. User jajang20 was banned, and appears in the contributor credits for pages they edited. mirekczechxmm does not because their profile is inactive.

Not all profiles the Google bot is crawling are linked to from MDN pages. (I have not figured out why Google is attempting to crawl mirekczechxmm's profile).

I don't see a specific PR that could be responsible for the sudden start of this problem on Feb 28th. It could be that there were changes made to the Google Bot on that date.

Stephanie Hobson [:shobson]

Updated

•

7 years ago

Depends on: 1353138

Stephanie Hobson [:shobson]

Comment 2

•

7 years ago

Links to these profiles seem to be from a mix of:

- Our own updated version linking to profiles when they shouldn't be.
- Someone else's cached version of our page.
- Posts the spammer made to drive traffic to the spam profile.
- Google's cached versions of our pages.
--> this is weird and if the other changes don't fix the problem we should follow up on it.
--> might be related to Bug 1316610, since the history pages link to the banned profiles even though they are not linked from the pages the Google reports the profiles are linked from.

We should see if the first 2 bugs fix the problem before attempting to address anything else.

Florian Scholz (Open Web Docs)

Updated

•

7 years ago

Keywords: in-triage

John Whitlock [:jwhitlock]

Assignee

Comment 3

•

7 years ago

According to moz.com, robots.txt prevents a search engine from indexing a page's content, but it still may end up with no content in search results:

https://moz.com/learn/seo/robotstxt

I agree with their suggestion to use noindex meta tags on pages we don't want indexed. And I'm guessing we don't want to index profile pages.  They are currently marked "index, follow".

I asked Google what it knows about mirekczechxmm, and I found some archives of MDN pages that links to it. Google will find whatever is published.

Kadir Topal [:atopal]

Reporter

Comment 4

•

7 years ago

Hey John, is there a way to see which pages have noindex meta tags? I'd like to compare that to our robots..txt

John Whitlock [:jwhitlock]

Assignee

Comment 5

•

7 years ago

You can see the templates that include noindex tags by searching the repository:

https://github.com/mozilla/kuma/search?utf8=✓&q=noindex+meta&type=

A comprehensive list would require visiting the page types identified last year:

https://docs.google.com/spreadsheets/d/1X-YLmIg8vVvDWlShLF-361Cz2KcUBnlaPXBHLMaYEaQ/edit#gid=0

[github robot]

Comment 6

•

7 years ago

Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/31028cdf1c94a9a20c23d78fb1ffdd0dd5eca202
bug 1353079: Add <meta robots noindex> tag

For pages that we don't want indexed, set a head tag:

<meta name="robots" content="noindex, nofollow">

Remove some Jinja "set meta=" declarations that weren't doing anything.

https://github.com/mozilla/kuma/commit/59043d0f15d7dc3da9981731909210236bf7cc73
bug 1353079: Add robots noindex to ban user page

https://github.com/mozilla/kuma/commit/61291445a2312feb5e6a981b142a3737e412b354
Merge pull request #4228 from jwhitlock/meta-no-index-follow-1353079

bug 1353079: Add meta tag to avoid search indexing

r=stephaniehobson

John Whitlock [:jwhitlock]

Assignee

Comment 7

•

7 years ago

This has been deployed. You can find templates using <meta name="robots" content="noindex, nofollow"> with:

https://github.com/mozilla/kuma/search?utf8=✓&q=robots_value+noindex&type=

Assignee: nobody → jwhitlock

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

John Whitlock [:jwhitlock]

Assignee

Updated

•

7 years ago

Comment 8

•

7 years ago

The Google Search Console continues to report a rising number of 403 pages. I had a quick glance at the list and it seems that the remaining pages are $edit and $history pages linked to from external sites.

I'm not sure what, if anything we need to do about this. 

Kadir, could you ask our SEO consultant if a high number of 403s is a bad thing? We do legitimately want to keep the crawler off these pages.

Status: RESOLVED → REOPENED

Flags: needinfo?(a.topal)

Resolution: FIXED → ---

John Whitlock [:jwhitlock]

Assignee

Comment 9

•

7 years ago

A couple of data points:

Wikipedia has a simple link to their edit page, available in anonymous browsing as well.  The edit page has a <meta name="robots" content="noindex,nofollow">. They don't restrict edit pages in robots.txt, which start w/ index.php.

Github doesn't show buttons for items that require logging in first.  For example, logged-in users can edit files on GitHub. As an anonymous user, there is a disabled icon for editing, with hover text "You must be signed in to make or propose changes".

Some possibilities:

* Add <meta name="robots" content="noindex,nofollow"> to sign-in page
* Improve performance of $history page
* Improve performance of pages linked from $history page

With these changes, I'd support changing them from "Permission denied" / 403 for crawlers to standard 200, and deal with the slow pages the bad crawlers start finding.

Florian Scholz (Open Web Docs)

Updated

•

6 years ago

Priority: -- → P2

Kadir Topal [:atopal]

Reporter

Comment 10

•

6 years ago

Hey Raphael, can you help us with the prioritization of this? Are a high number of 403s a bad thing that would affect our ranking?

Kadir Topal [:atopal]

Reporter

Updated

•

6 years ago

Flags: needinfo?(a.topal) → needinfo?(rraue)

Raphael Raue [:raue] [:rraue]

Comment 11

•

6 years ago

A high number of pages not being useful for ranking but need to be crawled are a bad thing for big websites, as they eat crawl budged without adding value. 

But here I think here we have a differnet problem: this 403 should not be there, t should just redirect to the login page. So instead of 403 it should be 301->200 for the login page when externally linked. Thats what I see, wehen not logged in clicking on such link internally. 

I gues having a look on the backend there would solve the errors.

Flags: needinfo?(rraue)

John Whitlock [:jwhitlock]

Assignee

Comment 12

•

5 years ago

New work is being done against bug 1518482

Status: REOPENED → RESOLVED

Closed: 7 years ago → 5 years ago

Resolution: --- → INCOMPLETE

Updated

•

4 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard

Bugzilla

Quick Search

Google crawler hits errors

Categories

(developer.mozilla.org Graveyard :: General, enhancement, P2)

Tracking

(Not tracked)

People

(Reporter: atopal, Assigned: jwhitlock)

References

Details

(Keywords: in-triage)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Updated

Comment 10

Updated

Comment 11

Comment 12

Updated