XML sitemap missing from www.mozilla.org

RESOLVED DUPLICATE of bug 1369738

Status

P2
enhancement
RESOLVED DUPLICATE of bug 1369738
5 years ago
6 months ago

People

(Reporter: cmore, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [kb=1128714] , URL)

Attachments

(5 attachments, 2 obsolete attachments)

(Reporter)

Description

5 years ago
Created attachment 792411 [details]
GSI's examples with lang references

Based on recommendations of our SEO audit and other research, we should create an XML version of a sitemap that is easily crawalable by search engines include references to all languages.
(Reporter)

Updated

5 years ago
No longer depends on: 906879
Priority: -- → P2
Whiteboard: [kb=1086766]
(Reporter)

Comment 1

5 years ago
:kohei: What are thoughts on making a dynamic sitemap.xml file for only the URLs/locales that are in bedrock and have it expand out over time as more pages and locales move into bedrock?
I think we can have a dynamic sitemap in some way... As more pages migrated to Bedrock, the sitemap will be more complete. I'll find out how :)

I could find a useful script at http://djangosnippets.org/snippets/1434/
Assignee: nobody → kohei.yoshino
Status: NEW → ASSIGNED
Created attachment 803081 [details]
Full sitemap.xml including l10n

The implementation was not difficult. My rough code is here, and an output is attached.
https://github.com/kyoshino/bedrock/commit/d0b485075d343c1e650842395dc637b4ee662a13

Issues:

* It takes about 30 seconds to respond. The translation list of each page is based on the template name, but there's no easy way to get the template name of each URL. I had to send an HTTP request to each page.

* As you can see, the output is redundant. The file size will exceed a 50 MB limit in the future.
https://support.google.com/webmasters/answer/183668?hl=en#1

Possible solution:

* Including only /en-US/ URLs in the sitemap. Search engines can still recognize each page's alternate URLs that we already have implemented in Bug 481550.
Created attachment 803297 [details]
sitemap.xml

Sent PR: https://github.com/mozilla/bedrock/pull/1217

To avoid the issues noted above, I only included English URLs. An output sitemap.xml file is attached.
(Reporter)

Comment 5

5 years ago
(In reply to Kohei Yoshino [:kohei] from comment #3)
> Created attachment 803081 [details]
> Full sitemap.xml including l10n
> 
> The implementation was not difficult. My rough code is here, and an output
> is attached.
> https://github.com/kyoshino/bedrock/commit/
> d0b485075d343c1e650842395dc637b4ee662a13
> 
> Issues:
> 
> * It takes about 30 seconds to respond. The translation list of each page is
> based on the template name, but there's no easy way to get the template name
> of each URL. I had to send an HTTP request to each page.
> 
> * As you can see, the output is redundant. The file size will exceed a 50 MB
> limit in the future.
> https://support.google.com/webmasters/answer/183668?hl=en#1
> 
> Possible solution:
> 
> * Including only /en-US/ URLs in the sitemap. Search engines can still
> recognize each page's alternate URLs that we already have implemented in Bug
> 481550.

Do you have a link or can you attach an example complete sitemap.xml file that would include all locales? over 50MB? wow.

en-US only sitemap.xml doesn't help SEO much at all.

jgmize had an idea: Use sitemap pagination and use the Django pagination feature. 

jgmize and kohei, can you two sync up?
(Reporter)

Updated

5 years ago
Flags: needinfo?(kohei.yoshino)
(Reporter)

Updated

5 years ago
Flags: needinfo?(jmize)
(In reply to Chris More [:cmore] from comment #5)
> Do you have a link or can you attach an example complete sitemap.xml file
> that would include all locales? over 50MB? wow.

The attachment 803081 [details] in my Comment 3 is a complete sitemap. Though it's still 1.63 MB, more and more pages are migrated to and translated on Bedrock...

> en-US only sitemap.xml doesn't help SEO much at all.

Canonical URLs on each page might help, but of course, a complete sitemap would be helpful.

> jgmize had an idea: Use sitemap pagination and use the Django pagination
> feature. 

I'll check it out this afternoon!
Flags: needinfo?(kohei.yoshino)
I just regenerated a complete sitemap. Now it's 3.1 MB with 859 URLs. Will try to

* Use a cron to retrieve URLs including localized pages
* Split the complete URL list by locales or specific number of URLs, by using a Sitemap index file
https://support.google.com/webmasters/answer/71453
Attachment #821100 - Attachment description: pull reques → Pull Request on GitHub
Created attachment 822357 [details]
Sitemap Index
Attachment #803081 - Attachment is obsolete: true
Flags: needinfo?(jmize)
Whiteboard: [kb=1086766] → [kb=1128714]
Severity: normal → enhancement
Summary: Create a dynamic XML sitemap of top-level URLs in [Bedrock] → Create a dynamic XML sitemap of all indexable URLs in [Bedrock]
(Reporter)

Comment 13

5 years ago
Any update on the sitemap bug?
(Reporter)

Comment 15

5 years ago
All: Given everything else we are working on now, let's put this on hold until later in Q2. I still think it will help, but we have bigger priorities now.
Status: ASSIGNED → NEW
(Reporter)

Comment 16

2 years ago
Here's a good example of a XML sitemap of a website that has a lot of sub-sites with their own sub-navigation: https://www.apple.com/sitemap.xml
Now is a great time for us to resurrect this effort. It's high on the list of marketing priorities[0]. 

An optimal approach would
* generate this sitemap from a more authoritative source than http crawls (e.g. from bedrock itself)
* give us an opportunity to choose the priority of certain elements in the sitemap (e.g. firefox marketing pages) in an effort to shape search results.

[0] https://docs.google.com/spreadsheets/d/1fizrZ92kNr6sJSMizxl343OF7F-BCJHEUG5TfWj1Gs8/edit#gid=466760365
(Reporter)

Comment 19

2 years ago
One more thing here, we also need to make sure the sitemap.xml is linked from the robots.txt like:

https://www.mozilla.org/robots.txt

i.e.

"Sitemap: https://www.mozilla.org/sitemap.xml"

Please note that the sitemap URL in robots.txt should be the full absolute URL and not relative like the rest of the URLs in the file. See examples at https://www.apple.com/robots.txt (bottom) and https://www.google.com/robots.txt (bottom)
Doh, I totally missed the Django sitemap framework the last time I baked my pull request ;) I'm happy to work on this again but my question now is: will the sitemap include all pages on Bedrock or only major pages? The purpose of Bug 1369738 is the latter, I guess...
(Reporter)

Comment 21

2 years ago
(In reply to Kohei Yoshino [:kohei] from comment #20)
> Doh, I totally missed the Django sitemap framework the last time I baked my
> pull request ;) I'm happy to work on this again but my question now is: will
> the sitemap include all pages on Bedrock or only major pages? The purpose of
> Bug 1369738 is the latter, I guess...

Peter German has worked on a spreadsheet to capture all of the URLs to be included in the v1.0 of this sitemap: https://docs.google.com/spreadsheets/d/1Sq-o-R9XjO9VPKaOL-aprOIiWNgvFttZ8XHeSquAWH4/edit#gid=1400086798

Peter: what is the difference between this bug and bug 1369738? If there is no difference, we should keep this bug for historical context and if the bugs are different, but related they should be linked together with specific title differences.
Flags: needinfo?(pgerman)

Comment 22

2 years ago
I was asked to create a new bug for this. I'll reference this for context.
Flags: needinfo?(pgerman)
new bug 1369738
(Reporter)

Updated

2 years ago
Summary: Create a dynamic XML sitemap of all indexable URLs in [Bedrock] → XML sitemap missing from www.mozilla.org
(Reporter)

Updated

2 years ago
Depends on: 1369738
I think Bug 1369738 has covered this.
Assignee: kohei.yoshino → nobody
No longer blocks: 629786
Status: NEW → RESOLVED
Last Resolved: 6 months ago
No longer depends on: 1369738
Resolution: --- → DUPLICATE
Duplicate of bug: 1369738
You need to log in before you can comment on or make changes to this bug.