Firefox/Projects/Places DB Creation Scripts: Difference between revisions

m
 
(34 intermediate revisions by 4 users not shown)
Line 4: Line 4:


;Description
;Description
:Create python<strike>/perl</strike> scripts to generate Places DBs with various characteristics such as "many visits within the same domain", "visits across many domains", "many tags", "many bookmarks", etc.
:Create python scripts to generate Places DBs with various characteristics such as "many visits within the same domain", "visits across many domains", "many tags", "many bookmarks", etc. Also, collect data from real-world users to inform the profiles of our generated DBs.
 
= Status =
 
Sprint's been on the back burner while we're getting Firefox 3.5b4 out the door.
 
Currently collecting stats from Mozilla community at https://places-stats.mozilla.com/.  Been doing so since early March.  Stats will inform our database generation script.
 
Database generation script (Python) being worked on.  Patches are up on {{bug|480340}}.  If you are feeling adventurous, please download the Python and try it out. I would like to document bootstrapping this better. Feel free to ping ddahl in #places.
 
Relevant links:
 
* [http://forums.mozillazine.org/viewtopic.php?f=23&t=1172765 Mozillazine forum posting] about stats collection portion
* [http://daviddahl.blogspot.com/2009/03/places-database-generator-stats.html ddahl's blog post] about database generation script
* [http://blog.mozilla.com/adw/2009/03/25/places-stats/ adw's blog post] about stats collection implementation and initial results


= Goals / Use Cases =
= Goals / Use Cases =
The sample data set should actually be quite huge (according to Beltzner and Shaver). We should collect stats from others with Dietrich's extension to see what the average data set looks like at Mozilla.


The chief goal is to be able to automate the generation of these sample sqlite databases for a continuous test to run on Places. We want to be able to reliably set some benchmarks and see what code changes either slow down or speed up queries in Places.
'''The chief goal''' is to be able to automate the generation of these sample sqlite databases for a continuous test to run on Places. We want to be able to reliably set some benchmarks and see what code changes either slow down or speed up queries in Places.
 
The sample data set should actually be quite huge (according to Beltzner and Shaver). We should collect stats from users so that our sample databases reflect real-world use.
 
Next step: take as input to the generation script the data we gather from the stats web page.


= Non Goals =
= Non Goals =
tbd
 
Creating a sample database for every little niche use case.  If at some point it becomes important to test a little niche use case, fine, our generator script should be able to handle it, but we will not be doing so at the outset.
 
Going out of our way to collect data that would help other teams/sprinters at Mozilla.  If we can share our results with others because it would help them, fantastic.  But time is wasting, we need to get going, so we can't accommodate everyone.  Maybe later.


= Design =
= Design =
Line 42: Line 62:
* Keywords
* Keywords


Shawn says:
We can come up with different data points in each dimension, take cartesian product across all dimensions to get a full suite of databases... User of our script should be able to specify a point in each dimension, and our script generates a database.
 
= Implementation =
 
=== Database generator ===
 
set up django:
 
http://www.djangoproject.com/download/1.0.2/tarball/
 
uncompress and run:
 
sudo python setup.py install
 
add django bin to your path
 
export PATH=$PATH:~/code/python/django/bin:~/code/python
 
cd ~/code/python
 
run this:
 
django-admin.py startproject places
 
django-admin.py startapp builddb
 
copy a places.sqlite file to ~/code/python/places
 
export PLACES_DB_PATH=~/code/python/places/places.sqlite
 
export DJANGO_SETTINGS_MODULE=places.settings
 
export PYTHONPATH=$PYTHONPATH:~/code/python
 
edit the places/settings.py:
 
import os
 
DATABASE_ENGINE = 'sqlite3'
 
DATABASE_NAME = os.environ['PLACES_DB_PATH']
 
reverse engineer the Django Models from the schema:
 
cd ~/code/python/places
 
python manage.py inspectdb >> builddb/models.py
 
Now, we need to clean up the foreign keys.
 
=== Stats collector ===
 
https://places-stats.mozilla.com/
 
The stats collector is a CGI script written in Ruby located at the above address.  Visitors are presented with instructions on how to submit statistics related to their Places databases.  They copy a small piece of JavaScript, located at
https://places-stats.mozilla.com/places.js and embedded in the page, and paste it into Firefox's JavaScript console and evaluate it.  The JavaScript computes numerous statistics from their Places database, presents them to the user, and allows him to submit them to the site.  Once submitted, the stats are inserted into a MySQL database, from which they are presented to all visitors to the site.
 
We will publicize the site via blogs, forums, and wherever else to solicit submissions from the community.
 
= Bugs =
* {{bug|480340}}
* https://places-stats.mozilla.com/
 
= Misc notes for ddahl and adw =
 
=== Awesomebar autocomplete ===
 
How should AutoComplete be stressed?  Shawn says:


* http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp
* http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp
Line 210: Line 297:
</pre>
</pre>


= Implementation =
AutoComplete is definitely important, but we'd like our database construction scripts/methodology to be general enough to generate places databases for any kind of testing context.
 
set up django:
 
http://www.djangoproject.com/download/1.0.2/tarball/
 
uncompress and run:
 
sudo python setup.py install
 
add django bin to your path
 
export PATH=$PATH:~/code/python/django/bin:~/code/python
 
cd ~/code/python
 
run this:
 
django-admin.py startproject places
 
django-admin.py startapp builddb
 
copy a places.sqlite file to ~/code/python/places


export PLACES_DB_PATH=~/code/python/places/places.sqlite
=== Frecency ===


export DJANGO_SETTINGS_MODULE=places.settings
* [https://developer.mozilla.org/en/The_Places_frecency_algorithm Algorithm description], though sdwilsh says this may be out of date
* Actual frecency calculation at nsNavHistory::CalculateFrecencyInternal(), http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp#7275


export PYTHONPATH=$PYTHONPATH:~/code/python
=== Stats we should have collected but did not ===


edit the places/settings.py:
For each data point:


import os
* Distribution of moz_historyvisits.visit_type. This value is one of the nsINavHistoryService.TRANSITION_* constants.
 
* Distribution of moz_places.typed
DATABASE_ENGINE = 'sqlite3'
* Distribution of moz_places.frecency
 
* Nested folder stats (ddahl)
DATABASE_NAME = os.environ['PLACES_DB_PATH']
 
reverse engineer the Django Models from the schema:
 
cd ~/code/python/places
 
python manage.py inspectdb >> builddb/models.py
 
Now, we need to clean up the foreign keys.
 
= Bugs =
tbd
Confirmed users, Bureaucrats and Sysops emeriti
3,599

edits