Closed Bug 1443188 Opened 7 years ago Closed 5 years ago

Move handling of crawlers from robot.txt to affected pages

Categories

(developer.mozilla.org Graveyard :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: atopal, Unassigned)

Details

(Keywords: in-triage, Whiteboard: [points=6+][SEO])

There are three places that contain rules about how to crawl MDN pages. 1. robots.txt 2. the robots meta tag on individual pages 3. The X-Robots-Tag HTTP header Having rules in all 3 places makes it difficult to keep them consistent. According to search console, we currently have 40,722 elements in the Google index that are marked as blocked in robots.txt. Example: https://www.google.com/search?q=info:https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FJavaScript%2FReference%2FGlobal_Objects%2FArrayBuffer%24edit Please remove disallow directives from robots.txt and add them as meta tags as appropriate for pages that don't have that already.
Keywords: in-triage
Priority: -- → P1
Whiteboard: [points=6+]
Patterns in robots.txt are there for a mix or reasons. Some are there because we don't want the content indexed in search engines. This needs to be fixed by using "nofollow" on links to the content, and <meta name="robots" content="noindex"> for the page itself. Some are in the list because of performance. When a crawler accesses this page, it puts a greater burden on the site then serving wiki pages. Crawling of these pages has been associated with downtime in the past. The pages need to be updated so that rapid access does not cause downtime. Some pages are both non-indexable, and have performance issues These page changes should be handled one endpoint at a time. There may be performance repercussions from deploying too many too rapidly. Other pages were used in the past, and not cleaned up when the pages were removed. Indexing: /admin/ /*docs/new /*docs/get-documents /*docs/submit_akismet_spam /*docs/Experiment:* /*$api /*$edit /*$translate /*$move /*$quick-review /*$revert /*$repair_breadcrumbs /*$delete /*$restore /*$purge /*$subscribe /*$subscribe_to_tree /*docs/ckeditor_config.js /*/files/ /media /*preview-wiki-content /*profiles*/edit /*type=feed /*users/ Performance: /*docs/feeds /*$samples Performance and Indexing: /*/dashboards/* - Revision dashboard and others /*docs/all /*docs/tag* /*docs/needs-review* /*docs/localization-tag* /*docs/with-errors /*docs/without-parent /*docs/top-level /*$compare /*$revision /*$history /*$children /*$locales /*$json /*$styles /*$toc /*docs.json Out of date: /*docs/templates - List of KumaScript macros - moved to /en-US/dashboards/macros /*docs*Template: - KumaScript macros stored in Kuma - moved to https://github.com/mdn/kumascript/tree/master/macros /*docs/load* - Unknown purpose /*$flag - Unknown /*$vote - Unknown /*move-requested - Unknown /skins - Unknown
Whiteboard: [points=6+] → [points=6+][SEO]
Priority: P1 → P2
Maybe I can add here some thoughts, because its getting mixed up a bit why these three ways to block crawlers exist and how to use them: - robots.txt is used for pages you dont want the crawlers even see. Adminpages, loginpages, feeds, stuff like that. The good crawler will not even visit them. (okay, it will, but usually just one or two times when it explores the url). It should not be used for folders where images, css, js or other documents live, which are needed to render any page. I see some "blocked ressources" errors in search console for /files/. Some pictures there are used in indexed pages and cant be rendered by Google therefore. If we could cache the feeds it would make sense to allow them as well. This is a great source for the crawler to find updated content fast and efficient. - noindex in the meta-head or in x-header means for a crawler the same. It should not be contradictory on one page of course. Tests show, that x-header wins, but not always. So it should be always only x-header or meta-head. The difference to a page blocked in robots.txt is, that it will be still crawled. Not so regularly, but it will be. And another difference is more important: if the directive is only noindex its implicit noindex, follow and all links count in terms of pagerank and indexing. So all pages having external links should be, if needed, noindex, follow but not blocked by the robots.txt. Else these links will not give us any pagerank. -noindex,nofollow is only useful in very specific cases and most of them should be avoided by robots.txt. Usually we want the crawler to follow all links internally. Pagerank sculpting doesnt work anymore (if it ever) and nofollow is usually only for internal links where the crawler would get into trouble because it cant do there anything or will run into a endless recursion of not indexable pages. And such folders should be blocked by the robots.txt. So tags for example should be noindex,follow even thought its not very important. And links to the tags should be follow as well. Google will not crawl them very often but will be able to connect the pages via tags, even if these tags will be not indexed. And when google understands that these pages will be not in the index forever, it will not crawl them at all anymore. -internal links nofollow are not necessary for 99% of the cases. We should block folders or pages, where the crawler can get into trouble and in all other cases let it decide if it follows or not. It doesnt harm to have nofollow links to internal pages, if these are noindex, but it harms if we forget them and reindex some of the pages. if all internal links are follow, nothing can get harmed. I hope this helped a bit. In general it was a great move to take a lot unneccesary pages of the index. These thought should just help to not bother too much about it. Block all folders in robots.txt which can harm the crawler or we know we will not index at all, noindex,follow all pages not necessary for the index and let all internal links follow. And one method for noindex is enough, having two times the same directive is fine, contradictory directives need a treatment. And ressources needed in indexed pages should be indexable itself.
raue, thank you for explaining the differences between the methods. There are 50 endpoints in robots.txt, which means there are about 50 tasks of different sizes to make this change. Each endpoint will need to be evaluated, categorized, and some will require code and template changes. There are several months of work to complete the task, so we're treating this as a tracking bug, rather than a single effort to fix, and the work is going to compete with other priorities. Do you have advice on which endpoints should be evaluated and fixed first?
Flags: needinfo?(rraue)
Like I said, nothing here is crutial, I would start with the rendered sources and then improve the setup step for step.
Flags: needinfo?(rraue)
MDN Web Docs' bug reporting has now moved to GitHub. From now on, please file content bugs at https://github.com/mdn/sprints/issues/ and platform bugs at https://github.com/mdn/kuma/issues/.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.