MozillaQualityAssurance:Web Page Set

From MozillaWiki
Jump to: navigation, search

Goals

  • To create a baseline set of 'clean' web pages for use in performance testing
    • Pages should be collected from the top 50-100 ranked sites
    • Pages need to be collected in their entirety, if they rely on another web site's content then that site should also be downloaded
    • For completeness the web pages should be scrubbed of any dangling links to non-local webpages

Status

  • getpages.sh
    • getpages.sh sitelist.txt indexlist.txt
      • sitelist.txt - a list of web sites, one per line in the form http://site
      • indexlist.txt - the location of the index file is dumped to this file, one per line

#! /bin/bash
#takes two inputs, $1 = file containing list of web pages of form http://pagename
#                  $2 = output file where list of index files is dumped - useful for places list of links into scripts
#
# web pages are dropped in directories named for their urls
if [ $# != 2 ]; then
        echo 'missing command line arguments'
        echo
        echo 'usage: getpages.sh inputfile outputfile'
        echo '  inputfile: file containing one url per line of the form http://url'
        echo '  outputfile: file to be created during execution, contains a list of index files one per url'
        exit
fi
for URL in $(cat $1); do
        #strip the leading http:// from the url
        CLEANURL=$(echo $URL | sed -e 's/http:\/\/\(.*\)/\1/')
        #create a directory with the cleaned url as the name
        mkdir $CLEANURL
        cd $CLEANURL
        wget -p -k -H -E -erobots=off --no-check-certificate -U "Mozilla/5.0 (firefox)" --restrict-file-names=windows $URL -o outputlog.txt
        #figure out where/what the index file for this page is from the wget output log
        FILENAME=$(grep "saved" outputlog.txt | head -1 | sed -e "s/.*\`\(.*\)\'.*/\1/")
        rm outputlog.txt
        cd ..
        #add the index file link to the list of index files being generated
        echo $CLEANURL/$FILENAME >> $2
done
#call the cleanse.sh script to disable any trailing non-localized urls found in the web pages
./cleanse.sh $1

  • cleanse.sh (provided by Darin Fisher)
    • modified to take a sitelist.txt file as input

#!/bin/sh 
if [ $# = 0 ]; then
  echo 'usage: cleanse.sh dir1 [dir2...]'
  exit 1
fi
# generates the list of files to be cleansed (exclude image files)
get_files() {
  find "$1" -type f -a ! -iname \*.jpg -a ! -iname \*.gif -a ! -iname \*.png -a ! -name \*.bak
}
# TODO(darin): make this more performant
cleanse_file() {
  echo "______cleansing $1" 
  perl -pi -e 's/[a-zA-Z0-9_]*.writeln/void/g' $1
  perl -pi -e 's/[a-zA-Z0-9_]*.write/void/g' $1
  perl -pi -e 's/[a-zA-Z0-9_]*.open/void/g' $1
  perl -pi -e 's/"https/"httpsdisabled/gi' $1
  perl -pi -e 's/"http/"httpdisabled/gi' $1
  perl -pi -e 's/<object/<objectdisabled/gi' $1
  perl -pi -e 's/<embed/<embeddisabled/gi' $1
  perl -pi -e 's/load/loaddisabled/g' $1
}
#dirs=($*)
#dirsLen=${#dirs[*]}
for URL in $(cat $1); do
  CLEANURL=$(echo $URL | sed -e 's/http:\/\/\(.*\)/\1/')
  files=$(get_files $CLEANURL)
  for f in $files; do
    cleanse_file $f
  done
done


  • Should also consider creating a generic way in which we could iterate over these web pages and collect their loading time - these a basic operation that could be used in multiple kinds of tests

Issues

  • For double protection should we be running these pages on a local web server that is firewalled so that it cannot call-out to the live web?
  • The best results (most accurate reproduction of the live web page) are found by using http://www.mail-archive.com/wget@sunsite.dk/msg09142.html
    • should we host binaries of this version of wget for the major operating systems? This would allow people to get better web pages without having to build their own wget

To Be Resolved

  • How do we account for latency involved in contacting the server for these pages, since we are only interested in how long it takes Firefox to layout and display the page?