78
edits
No edit summary |
No edit summary |
||
| Line 41: | Line 41: | ||
* The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index | * The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index | ||
* [https://addons.mozilla.org/en-US/firefox/pages/appversions TargetApplication id's and versions] | * [https://addons.mozilla.org/en-US/firefox/pages/appversions TargetApplication id's and versions] | ||
== Crawling == | |||
Crawling and parsing would probably be an intensive and time consuming process | |||
* Google search results (filetype:xpi) is limited. For example, a Google search for (filetype:xpi site:addons.mozilla.org) only returns 62 hits. Probably best to supplement our data rather than be primary. | |||
* How much do we crawl? How deep? | |||
* Aggregate Sites | |||
** Two Kinds | |||
**# Hosting (AMI, AMO) | |||
**# Linking (FoxieWire) | |||
** Site specific. Maybe only second-level domains (eg. addons.mozilla.org/* instead of all of mozilla.org). Addon authors sometimes have links on their addons page to their personal website with a more up-to-date addon. | |||
** Mozdev/others/.. | |||
* Individual Sites | |||
** Wordpress/Blogspot (can extensions be uploaded here?) | |||
** Google/Yahoo search | |||
*** Rich sources of information. But too much information, or lacking quality | |||
* Bouncer | |||
** What kind of information does bouncer collect? | |||
** Does not give context/rating/url probably | |||
** Good/Bad source? | |||
== GUID Collisions == | |||
* Same extension different version | |||
* Same extension, same version, different website (hash comparisons?) | |||
* Different extension, possibly malicious or coincidence | |||
== What to Track == | |||
* Addon url (where did we find it?) | |||
* Filename | |||
* Supported Applications and versions | |||
* locals it supports | |||
* context (entire paragraph) | |||
* Ratings? (Site-specific) | |||
* Categories (How?) | |||
== Tools == | |||
* Something to extract a zippy. | |||
** Look for chrome.manifest | |||
** Look for install.{rdf|js} | |||
** Parse those files (rdf is xml, chrome.manifest should be simple, but what about install.js?) | |||
* Something to crawl | |||
* Something to store (database for better querying?) | |||
* List of websites to crawl | |||
* Crawler's settings (eg. How deep) | |||
= Technical Resources = | = Technical Resources = | ||
* [http://www.robotstxt.org Writing a robot/crawler] | * [http://www.robotstxt.org Writing a robot/crawler] | ||
* [http://www.silfreed.net/blog/2008/04/XUL-extension-parsing XUL Extension Parsing] | * [http://www.silfreed.net/blog/2008/04/XUL-extension-parsing XUL Extension Parsing] | ||
= Manual Extensions = | |||
Extensions that are bundled with an install, and therefore must be added manually | |||
* http://free.grisoft.com/ww.faq.num-1241#faq_1241 | |||
* [http://service1.symantec.com/SUPPORT/norton360.nsf/0/e1be9e4560c11b466525728900757836?OpenDocument| Symantec noting their poor addon] | |||
edits