Websites/Taskforce/Proposals/Abandoned Sites/Archive
The following steps may be followed in order to archive a Mozilla website that has been abandoned or is being retired.
Archived Mozilla websites will be made available at http://website-archive.mozilla.org
Web Dev Actions
The following actions will be performed by a member of the web development team.
Subversion
The subversion repository for the Mozilla Website Archive is available at http://svn.mozilla.org/projects/website-archive.mozilla.org
Follow the svn instructions for Mozilla subversion access. Once you have access, you may checkout the website-archive repository.
svn checkout svn+ssh://svn.mozilla.org/projects/website-archive.mozilla.org
Initial Archive
The initial archive can be performed using wget. This will scrape and the entire site into html, javascript and css files. It will also save each index file with an .html extension.
cd website-archive.mozilla.org;
wget -rpEkH -nc --no-check-certificate
-R *.pdf -R *.bz2 -R *.gz -R *.mov -R *.fla -R *.xml -R *.json -R *.rss
-D mozillaservice.org http://mozillaservice.org
This method scrapes and archives most of the website. It excludes all files that we don't want to download due to space issues, such as PDF files and zipped files (this may vary on a site-by-site basis).
For a site that is approximately 1,200 pages in size, this process took 1 minute 30 seconds and downloaded 22MB of data. If you're concerned about server usage for this particular site, you can use --wait=n and --random-wait to be less aggressive towards the server.
Privacy Actions
Once the site has been downloaded locally in its entirety, you will need to remove all code that refers to or collects user identifiable information.
Forms
Forms that request user information like email addresses and passwords will need to be removed from the codebase. Currently, we are handling this process manually.
[Instructions forthcoming]
If you would like to help automate this process, feel free to document that process below.
Email Addresses
All of the user identifiable information, such as email addresses, will need to be removed from the code. To locate email addresses in the code base, you may use the following egrep statement.
egrep -rn "\w+([._-]\w)*@\w+([._-]\w)*\.\w{2,4}" * | grep -v svn | grep -v "mozilla.org"
Resolving Redirects
All files that have been downloaded, you will need to test urls to ensure that they are redirecting properly.
[Instructions forthcoming]
Commit
Once this site has been downloaded and all privacy concerns have been handled, you will need to commit the site to subversion. Then you will need to file an IT request in Bugzilla to have this code pushed to production.
Archived Header
Finally, you will need to add a note to the top of each page making the user aware that this is an archived website.
[Instructions forthcoming]
Systems Operations Actions
A member of the Mozilla Systems Operations team will need to perform the following actions.
Website
Backup the database for the existing website and take the website offline.
In Apache, redirect the visitor (301) accessing any page of the website to the archived page on the website-archive.mozilla.org website.
For example, a user accessing the retired website: http://mozillaservice.org/activity/stories/en_US
Should be redirected to: http://website-archive.mozilla.org/mozillaservice.org/activity/stories/en_US
Subversion
Perform an update of subversion for http://website-archive.mozilla.org