Open Bug 1441250 Opened 7 years ago Updated 7 years ago

XML Sitemap (include all documents)

Categories

(www.mozilla.org :: Pages & Content, defect)

Production
defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: erenaud, Assigned: raue)

References

(Blocks 1 open bug)

Details

Raphael - please expand on the requirement here to expand the XML sitemap for mozilla.org
- Helps to analyse and to keep track of changes not just internally, but like crawlers see it. - Webmastertools provide more data on documents in Sitemaps. Biggest bonus: crawling gets much faster. Building a complete sitemap: - All pages included, which are self canonicalized - Just not noindex pages (but they will not have a self canonical in the future anyways) - One main sitemapindex in /sitemap.xml including the following: - One sitemap for each locale. F.e. /de/sitemap-de.xml - No subdomains included - All pages get an <lastmod> and it gets updated it whenever anything changes on that page (code or content) - Can be cached but needs to be the fastest document on our site, updates to the sitemap can be delayed though caching, thats not important for the normal sitemaps. If we decide later to have a news sitemap, the update interval needs there to be under 10 Minutes, if thats important to know now. - When we this complete sitemap, I will try with the help of :jmize a scoring of all included pages based on importance to the download funnel, organic traffic, social shares and overall traffic. With that scoring we should set <priority> so the crawler can differnciate between important and less important pages. There is very little space to sculp crawling, but it will help us internally to analyse as well.
(In reply to Raphael Raue [:raue] [:rraue] from comment #1) > Building a complete sitemap: > - All pages included, which are self canonicalized Do you have a list (or an idea) of which pages are not included in our current sitemap? It's generated on deployment automatically to include every URL on the site outside of an exclusion list. It's possible our automation is skipping something, but I'm not sure what. > - Just not noindex pages (but they will not have a self canonical in the > future anyways) > - One main sitemapindex in /sitemap.xml including the following: > - One sitemap for each locale. F.e. /de/sitemap-de.xml We were going to do this at some point, but it was decided to stick with the <link rel="alternative" hreflang...> tags. Should we have both or just one or the other? > - No subdomains included > - All pages get an <lastmod> and it gets updated it whenever anything > changes on that page (code or content) Is this required or nice-to-have? Google should already get this from our etags. Unfortunately this is not easy to do since some pages include some data, some don't, some could be changed because a base template changed, etc. We could make as good an effort as we can if it's a high priority. > - Can be cached but needs to be the fastest document on our site, updates to > the sitemap can be delayed though caching, thats not important for the > normal sitemaps. If we decide later to have a news sitemap, the update > interval needs there to be under 10 Minutes, if thats important to know now. The sitemap is generated on deployment since that is the only time URLs are added or removed and mostly the only time updates happen.
Not included are right now all locales pages. They should be, as the sitemap is a very fast way for crawlers to check what they do have to crawl. And the more effective they crawl the more often they can crawl the important pages. A sitemap visit of the crawler is not -1 for the crawlbudget. Thats why <lastmod> is not necessary in terms of validation of the sitemap, but very nice to have. It saves us a lot of crawlingbudget which can be used better then going to a page, getting a 304 crawlingbudget is -1, just checking the etag means crawlingbudget -1. And if just something changes in the base template, that is a very important information for the crawler. If you change for example only one js-file, the whole rendering of that document could change and that is exactly what google wants to crawl. So if it is possible and not a crazy load of work, we would help the crawler quite a bit. A list of pages not includes besides the locales I will comment here when I have it, I need to first solce a problem with the data limits in search console. We should definetly not have hreflang in the sitemap and the head. One or another. Lets for now stick to the head. But including the locales is something else: /de/ is for /en-US/ a alternate but for googles index a unique document. Creating the sitemap on deployment is totally fine for now.
A good example why we should add lastmod is the /en-GB/firefox/new/. We changed the title of the document and it took Google a week to reindex it with the new title. That is a while for such a well linked and important page. I am quite sure, that lastmod would help us test changes much faster.
I finally have the list of missing entrys: https://docs.google.com/spreadsheets/d/1By3rwhTWkcEP1r7X1Yt_sahupDYW3y5c4KmoppmoqNs/edit?usp=sharing If I see this right releasnotes, systemrequirements, auroranotes and some stuff in security is missing in the sitemap.
Paul - please see rraue's latest comments above (looks like blockers are removed).
Flags: needinfo?(pmac)
(In reply to Raphael Raue [:raue] [:rraue] from comment #5) > I finally have the list of missing entrys: > https://docs.google.com/spreadsheets/d/ > 1By3rwhTWkcEP1r7X1Yt_sahupDYW3y5c4KmoppmoqNs/edit?usp=sharing > > If I see this right releasnotes, systemrequirements, auroranotes and some > stuff in security is missing in the sitemap. That makes sense. Those pages are generated based on data in the DB and are not hard-coded URLs like the rest of the site. The site map doesn't currently use the DB during generation, but could be made to do so if we want. We're probably also missing a good portion of the /security/ section of the site: specifically all of the pages linked from here: https://www.mozilla.org/en-US/security/advisories/
Flags: needinfo?(pmac)
Sounds great. My goal is to have all indexed 200er pages in the sitemaps, including locales and that with lastmod.
You need to log in before you can comment on or make changes to this bug.