Closed Bug 599071 Opened 14 years ago Closed 14 years ago

Archive Mozilla Service site

Categories

(Websites :: website-archive.mozilla.org, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davidwboswell, Assigned: ryansnyder)

References

()

Details

(Whiteboard: [privacy] )

Following up a discussion at the web site task force, we want to archive the Mozilla Service Week site and use that as a model for archiving other sites.

Specifically, we want to add some sort of header to the site pages letting people know this site is no longer active but still contains historically interesting information.  We can also provide a wrap-up saying what the site accomplished -- that could either be a link to a blog post in the header or maybe the home page or about page is updated.

In addition, we should make the site read-only so that people are no longer able to submit information if there are still ways for people to input information or create accounts.
In order to alleviate the need to continue hosting this site and database (which require security updates by Ops if we keep the site up and running) I think we're going to need to look at a proper archiving method for the retired sites, a la the Wayback Machine.

http://web.archive.org/web/20050130093517/www.mozilla.org/

I'm going to propose that we write a (or find an already written) script that scrapes all of the pages from the site that we are trying to archive, and places all of the pages on a site where all of the retired Mozilla sites are archived.  

In initial testing, Wget does a good job of archiving an entire site.  For example, running the following command:

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains mozillaservice.org http://mozillaservice.org

Results in the following archival of the site:
http://mozilla-archive.ryansnyder.me/mozillaservice.org/

Once properly archived on the Mozilla archive collection, the original site and database can be retired.  All requests to the original site can be redirected to the archived site.

The next thing we'll need to do is create a script that will automatically append a notice to the top of the site, much like http://www-archive.mozilla.org/contact/ .  Within the notice atop the site, we could link to a wiki record depicting more details about the site, such as when it was created, when it was retired and why it was retired.
To confirm that I understand the suggestion in comment #1, does this mean that we would create one site (archive.mozilla.org maybe) that would host all of the sites that have been archived (ex. archive.mozilla.org/mozillaservice.org)?  

People may or may not go directly to the top level archive.mozilla.org page since there are redirects from the URLs of the original sites, but we could create some sort of intro page that acts as a curated guide for what's going on?

That sounds fine to me.

BTW, I got a 'This Connection Is Untrusted' notice when I tried viewing one of the localized pages linked to from the language picker at the top.  I didn't come across that in limited surfing on the English version.
Thanks David.

Yes, sorry I didn't clarify that in comment #1.  

I think we'll want to create a website called website-archive.mozilla.org (or something of the like) to host archived versions of our websites.

Then we can host all of the archived websites under 1 subdomain of mozilla.org.  The landing page for this subdomain could include information about site archiving and links to each of the archived websites. The structure could look something like:
http://website-archive.mozilla.org/ 
http://website-archive.mozilla.org/mozillaservice.org/
http://website-archive.mozilla.org/firefoxccstudio.org/
http://website-archive.mozilla.org/operationfirefox.com/

Doing this would allow us to offload server and database resources, as well as eliminate website maintenance for sites that we no longer need.  The new archived sites would not have databases and would only be .html, .js and .css pages that would be nearly exact representations of their last known state before retirement.  Some site functionality would not work for these archived sites, but the information on each page would remain intact.

If we move forward with fully archiving websites, another thing we may want to define is the acceptable representation of the archived site.  For example, for a localized website, will we need to archive the website only in the english language, or do we need to archive every single language that is available on that site?  What is acceptable in the case of an AJAX application, which cannot be completely scraped by a script?  How thoroughly do we need to QA an archived website before its retirement?  I suppose most of these questions can be answered on a site by site basis, but I wanted to pose a few that came up while I was in prototyping mode.

The Untrusted Connection notice will happen if the site is view in https - I created a quick subdomain on my personal site to demonstrate how archiving might work and didn't enable an SSL certificate for the site.
Everything in comment #3 sounds good to me and I agree that we'll only really be able to answer these open questions by going through this with a few sites.

What are next steps?  Do we need to open a new bug to set up website-archive.mozilla.org before moving forward with this one?
Great.  

I'll work on the next step, which is finding a way to automatically insert text at the top of each page on the archived site, much like:
http://www-archive.mozilla.org/contact/

What do we want the text atop the archived mozillaservice.org say?

Once I can automate the text insertion, we can move with the next step, which will be to create a ticket to request the creation of the site website-archive.mozilla.org and the archiving of mozillaservice.org.  I can take care of this step when we're ready.
Mary, could you post some text for the header for the archived mozillaservice.org?
Morgamic also mentioned that we will need to ensure that all of the forms that collect personal information on an archived site are disabled and tested to verify.
Status: NEW → ASSIGNED
If you want to do more advanced HTML rewriting, lxml has a lot of options for that and I can help with it.  (Probably simple text substitution will work?)

Two obvious issues I found when running a mirror:

1. Adding .html changes the URL.  One way to handle that would be:

RewriteCond %{PATH_INFO}.html -f
RewriteRule (.*) $1.html

(You might want an internal or external redirect, I don't know that it matters)

2. Some URLs include a query string.  It may or may not be meaningful, but either way I believe the mirror won't really work.  Another rewrite rule might be:

RewriteCond %{PATH_INFO}?%{QUERY_STRING} -f
RewriteRule (.*) $1?%{QUERY_STRING}

You might need to %-encode the ? on one of those lines, I'm not sure.

In some cases you could simply strip the query string from the filename, and Apache will ignore it.  But if you are archiving something like a blog with URLs like /?p=30 then those query strings are significant.

These should be able to go in .htaccess files, but they will have to be subtly different and I have never understood exactly how.

Another option with Apache configuration might be a type map: http://httpd.apache.org/docs/2.2/mod/mod_negotiation.html#typemaps -- this lets you put a file in place with headers intact, e.g., a file without an extension.  This probably isn't needed here, though.

You might also want to be careful with encodings, either to make sure the encodings are represented inline in the pages (meta http-equiv), or that you've normalized everything to the same encoding that Apache sends out (e.g., text/html; charset=UTF-8).
(In reply to comment #6)
> Mary, could you post some text for the header for the archived
> mozillaservice.org?

Yes, I am still a bit swamped in email.  Sorry for dragging my bum on this!
So late with this...here is the copy:

You are currently viewing an archived site.  Mozilla Service Week took place from September 14 - 21, 2009.  Over 11,000 service hours were donated by our awesome community to help organizations and local communities around the world. Thanks for making a difference!
Component: mozillaservice.org → website-archive.mozilla.org
QA Contact: mozillaservice-org → website-archive-mozilla-org
Blocks: 607187
Whiteboard: [privacy]
I am in the process of downloading this site and archiving it.  It's taking a long time because of all of the 500 errors the script is encountering randomly across the site.  I'm running the 6th scrape of the site now, and hope by tomorrow that I will have enough pages that I can proceed with the remaining steps of the archive process.
I've created Bug 612692 to deploy the Mozilla Website Archive.  This will contain the archived version of mozillaservice.org.  

Once this site is online, I'll ask Mary and David (and whomever else would like to pitch in) to take a look at the site and make sure it works for everybody.  If there are issues, let me know and I can take a look at them and resolve them.

After approval, we can take the next step which will be to take down mozillaservice.org and ensure that website visitors are redirected to the archived site on the Mozilla Website Archive.

The first draft for the process for archiving a website is available here:
https://wiki.mozilla.org/Websites/Taskforce/Proposals/Abandoned_Sites/Archive

FYI - There were numerous issues (mainly 500 errors) with the localized pages.  Instead of scraping all of the localized pages individually, I just archived all of the English pages.  If that's an issue, let me know, but it would take a while to work through all the errors I encountered.
The Mozilla Website Archive is now available at:
http://website-archive.mozilla.org/

The Mozilla Website Archive home page is bare, and Bug 612694 has been created to create a homepage for this site.

The archived version of mozillaservice.org is now available at:
http://website-archive.mozilla.org/mozillaservice.org/

All English pages have been scraped and are available on the archived mozillaservice.org.  Non-English and https:// pages were not scraped due to numerous 500 errors on the production site.  A banner was added to the top of the site and includes Mary's text.  Forms and personal information have been removed from the site.

Mary, please confirm that the site looks good to you.  Once we receive this confirmation we will proceed with shutting down the existing mozillaservice.org and redirecting website visitors to the archived version of the site.
Hi there: Took a look and it looks great, but want to confirm that the forms will be removed.  For instance, right now when I click on add hours/register it takes me to the live site.  What will the user experience be like when it's completely archived?

Also, thinking it might be best to left justify the banner on the home page so that it doesn't overlap with the plane.

Thanks!
Thanks Mary!  Okay, I'll take care of the banner.

The user experience of an archived site is not intended to be perfect.  Users visiting the login page will hit a 404 Not Found error.
Gotcha - we're good to go once the banner is fixed :)
I've updated the banner.  The fix will be pushed to production in Bug 614698, which will also be used to retire the existing production site.
Assignee: nobody → ryan
The Mozilla Service Week website has been archived, the existing site taken down, and website visitors are now being directed to the archived version of the site.

http://website-archive.mozilla.org/mozillaservice.org
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.