Closed Bug 1353079 Opened 7 years ago Closed 5 years ago

Google crawler hits errors

Categories

(developer.mozilla.org Graveyard :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: atopal, Assigned: jwhitlock)

References

Details

(Keywords: in-triage)

An email from Google came in yesterday notifying us that Googlebot found an increase in authorization permission errors on https://developer.mozilla.org/

Looking into it on the search console it seems like that's triggered by profile pages: https://www.google.com/webmasters/tools/crawl-errors?siteUrl=https://developer.mozilla.org/&utm_source=wnc_652500&utm_medium=gamma&utm_campaign=wnc_652500&utm_content=msg_711002&hl=en#t2=3

Until the end of Februray we used to have less than 200 "access denied" errors a day, now we have more than 2000 trending up.

A typical entry on the list is "en-US/profiles/eddingim45ag"

I can see that we have a rule in robots.txt to disallow editing the profile page
Disallow: /*profiles*/edit

but we should also disallow the profile page itself.
Depends on: 1353134
The errors are almost all from the profiles of banned users.

We should:
1) Return a 404 for these profiles. (Bug 1353138)
2) Not link to them from the pages they have edited. (Bug 1353134)

Some fun stuff I found poking around:

We do filter the contributor profiles by is_active. Unfortunately, an active ban does not an inactive profile make. User jajang20 was banned, and appears in the contributor credits for pages they edited. mirekczechxmm does not because their profile is inactive.

Not all profiles the Google bot is crawling are linked to from MDN pages. (I have not figured out why Google is attempting to crawl mirekczechxmm's profile).

I don't see a specific PR that could be responsible for the sudden start of this problem on Feb 28th. It could be that there were changes made to the Google Bot on that date.
Depends on: 1353138
Links to these profiles seem to be from a mix of:

- Our own updated version linking to profiles when they shouldn't be.
- Someone else's cached version of our page.
- Posts the spammer made to drive traffic to the spam profile.
- Google's cached versions of our pages.
--> this is weird and if the other changes don't fix the problem we should follow up on it.
--> might be related to Bug 1316610, since the history pages link to the banned profiles even though they are not linked from the pages the Google reports the profiles are linked from.

We should see if the first 2 bugs fix the problem before attempting to address anything else.
According to moz.com, robots.txt prevents a search engine from indexing a page's content, but it still may end up with no content in search results:

https://moz.com/learn/seo/robotstxt

I agree with their suggestion to use noindex meta tags on pages we don't want indexed. And I'm guessing we don't want to index profile pages.  They are currently marked "index, follow".

I asked Google what it knows about mirekczechxmm, and I found some archives of MDN pages that links to it. Google will find whatever is published.
Hey John, is there a way to see which pages have noindex meta tags? I'd like to compare that to our robots..txt
You can see the templates that include noindex tags by searching the repository:

https://github.com/mozilla/kuma/search?utf8=✓&q=noindex+meta&type=

A comprehensive list would require visiting the page types identified last year:

https://docs.google.com/spreadsheets/d/1X-YLmIg8vVvDWlShLF-361Cz2KcUBnlaPXBHLMaYEaQ/edit#gid=0
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/31028cdf1c94a9a20c23d78fb1ffdd0dd5eca202
bug 1353079: Add <meta robots noindex> tag

For pages that we don't want indexed, set a head tag:

<meta name="robots" content="noindex, nofollow">

Remove some Jinja "set meta=" declarations that weren't doing anything.

https://github.com/mozilla/kuma/commit/59043d0f15d7dc3da9981731909210236bf7cc73
bug 1353079: Add robots noindex to ban user page

https://github.com/mozilla/kuma/commit/61291445a2312feb5e6a981b142a3737e412b354
Merge pull request #4228 from jwhitlock/meta-no-index-follow-1353079

bug 1353079: Add meta tag to avoid search indexing

r=stephaniehobson
This has been deployed. You can find templates using <meta name="robots" content="noindex, nofollow"> with:

https://github.com/mozilla/kuma/search?utf8=✓&q=robots_value+noindex&type=
Assignee: nobody → jwhitlock
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
See Also: → 1259725
The Google Search Console continues to report a rising number of 403 pages. I had a quick glance at the list and it seems that the remaining pages are $edit and $history pages linked to from external sites.

I'm not sure what, if anything we need to do about this. 

Kadir, could you ask our SEO consultant if a high number of 403s is a bad thing? We do legitimately want to keep the crawler off these pages.
Status: RESOLVED → REOPENED
Flags: needinfo?(a.topal)
Resolution: FIXED → ---
A couple of data points:

Wikipedia has a simple link to their edit page, available in anonymous browsing as well.  The edit page has a <meta name="robots" content="noindex,nofollow">. They don't restrict edit pages in robots.txt, which start w/ index.php.

Github doesn't show buttons for items that require logging in first.  For example, logged-in users can edit files on GitHub. As an anonymous user, there is a disabled icon for editing, with hover text "You must be signed in to make or propose changes".

Some possibilities:

* Add <meta name="robots" content="noindex,nofollow"> to sign-in page
* Improve performance of $history page
* Improve performance of pages linked from $history page

With these changes, I'd support changing them from "Permission denied" / 403 for crawlers to standard 200, and deal with the slow pages the bad crawlers start finding.
Priority: -- → P2
Hey Raphael, can you help us with the prioritization of this? Are a high number of 403s a bad thing that would affect our ranking?
Flags: needinfo?(a.topal) → needinfo?(rraue)
A high number of pages not being useful for ranking but need to be crawled are a bad thing for big websites, as they eat crawl budged without adding value. 

But here I think here we have a differnet problem: this 403 should not be there, t should just redirect to the login page. So instead of 403 it should be 301->200 for the login page when externally linked. Thats what I see, wehen not logged in clicking on such link internally. 

I gues having a look on the backend there would solve the errors.
Flags: needinfo?(rraue)

New work is being done against bug 1518482

Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → INCOMPLETE
See Also: → 1518482
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.