Websites/Taskforce/Proposals/Abandoned Sites/Archive

From MozillaWiki
Jump to: navigation, search

The following steps may be followed in order to archive a Mozilla website that has been abandoned or is being retired.

Archived Mozilla websites will be made available at http://website-archive.mozilla.org and the Retired Sites list.

Web Dev Actions

The following actions will be performed by a member of the web development team.

Subversion

The subversion repository for the Mozilla Website Archive is available at http://svn.mozilla.org/projects/website-archive.mozilla.org

Follow the svn instructions for Mozilla subversion access. Once you have access, you may checkout the website-archive repository.

 svn checkout svn+ssh://svn.mozilla.org/projects/website-archive.mozilla.org  

Initial Archive

The initial archive can be performed using wget. This will scrape and the entire site into html, javascript and css files. It will also save each index file with an .html extension.

Change into the root of the checkout directory and exectute the following wget command.

 cd website-archive.mozilla.org;
 wget -rpEkH -nc -t 25 -w 2 --random-wait --retry-connrefused --no-check-certificate -R *.pdf -R *.bz2 -R *.gz -R *.mov -R *.fla -R *.xml -R *.json -R *.rss -D mozillaservice.org http://mozillaservice.org

This method scrapes and archives most of the website. It excludes all files that we don't want to download due to space issues, such as PDF files and zipped files (this may vary on a site-by-site basis).

For a site that is approximately 1,500 pages in size, this process took about 2 hours finish the archive. If you're not concerned about server usage for this particular site, you can remove the --wait=n and --random-wait flags to be more aggressive towards the server.

Privacy Actions

Once the site has been downloaded locally in its entirety, you will need to remove all code that refers to or collects user identifiable information.

Forms

Forms that request user information like email addresses and passwords will need to be removed from the codebase. Currently, we are handling this process manually by grepping for the forms.

 grep -rn form * | grep action | grep -v svn

Email Addresses

All of the user identifiable information, such as email addresses, will need to be removed from the code. To locate email addresses in the code base, you may use the following egrep statement.

 egrep -rn "\w+([._-]\w)*@\w+([._-]\w)*\.\w{2,4}" * | grep -v svn | grep -v "mozilla.org"

Resolving Redirects

After all files have been downloaded, you will need to test urls to ensure that they are redirecting properly. More than likely, you will find that you'll need to alter the .htaccess file to append .html to extension-less urls.

 Options  None +FollowSymLinks
 Order    Allow,Deny
 Allow    from all
 <IfModule mod_rewrite.c>
   RewriteEngine On
   RewriteBase /mozillaservice.org
   RewriteCond %{REQUEST_FILENAME} !-f
   RewriteCond %{REQUEST_FILENAME} !-d
   RewriteRule (.*) $1.html [L]
 </IfModule>

Archived Header

Finally, you will need to add a note to the top of each page making the user aware that this is an archived website. We're doing this by including a javascript on each page that will print the archive statement atop each page.

Create a new file - js/archive.js. This file will allow us to append text to each of the pages, and will allow us to easily edit one file if their is a text change in the future. Use something resembling the following text:

(Open to copy, document.write contains divs)

document.write('
You are viewing an archived site in the <a href="/">Mozilla Archive</a>. Mozilla Service Week took place from September 14 - 21, 2009. Over 11,000 service hours were donated by our awesome community to help organizations and local communities around the world. Thanks for making a difference!
');

Use sed to append each of the .html files on the site with the javascript include:

 find . -name "*.html" -print | xargs sed -i "s/<\/head/<script type=\"text\/javascript\" src=\"\/mozillaservice.org\/js\/archive.js\"><\/script><\/head/g"

Now, update the site's global css file accordingly, ensuring that the changes are congruent with the site's design, eg:

 #archive { margin: 0; padding: 5px; position: relative; text-align: center; padding: 14px 10px 15px 10px; color: #f5f3ed; background-color: #4d5151;  }
 #archive_text { margin-left: auto ; margin-right: auto ; width: 740px; text-align: center; font: bold 1.143em/1 Arial, Calibri, Helvetica, "Helvetica Neue", sans-serif; line-height: 1.2em;}
 #archive_text a, #archive_text a:hover { color: #fff; text-decoration: underline; }

Commit

Once this site has been downloaded and all privacy concerns have been handled, you will need to commit the site to subversion. Then you will need to file an IT request in Bugzilla to have this code pushed to production.

Bugzilla

  • File a bug to decommission all related staging servers.
  • File a bug to move all open bugs to Website Graveyard or Webtools Graveyard.

Systems Operations Actions

A member of the Mozilla Systems Operations team will need to perform the following actions.

Subversion

Perform an update of subversion for http://website-archive.mozilla.org .

Ensure that the archived site is accessible by entering the site domain name into the uri. Using mozillaservice.org as an example, the archived site will be available at http://website-archive.mozilla.org/mozillaservice.org

GitHub

  • Make a note on the GitHub repo that the website is retired.

Website

First, expunge all user-specific data from the database, namely email addresses.

Second, backup the database for the existing website.

Next, in Apache, redirect the visitor (301) accessing any page of the website to the archived page on the website-archive.mozilla.org website.

For example, a user accessing the retired website: http://mozillaservice.org/activity/stories/en_US

Should be redirected to: http://website-archive.mozilla.org/mozillaservice.org/activity/stories/en_US

Finally, remove the website code from the server.