User:Bhashem/WildOnAddons: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 41: Line 41:
* The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index
* The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index
* [https://addons.mozilla.org/en-US/firefox/pages/appversions TargetApplication id's and versions]
* [https://addons.mozilla.org/en-US/firefox/pages/appversions TargetApplication id's and versions]
== Crawling ==
Crawling and parsing would probably be an intensive and time consuming process
* Google search results (filetype:xpi) is limited. For example, a Google search for (filetype:xpi site:addons.mozilla.org) only returns 62 hits. Probably best to supplement our data rather than be primary.
* How much do we crawl? How deep?
* Aggregate Sites
** Two Kinds
**# Hosting (AMI, AMO)
**# Linking (FoxieWire)
** Site specific. Maybe only second-level domains (eg. addons.mozilla.org/* instead of all of mozilla.org). Addon authors sometimes have links on their addons page to their personal website with a more up-to-date addon.
** Mozdev/others/..
* Individual Sites
** Wordpress/Blogspot (can extensions be uploaded here?)
** Google/Yahoo search
*** Rich sources of information. But too much information, or lacking quality
* Bouncer
** What kind of information does bouncer collect?
** Does not give context/rating/url probably
** Good/Bad source?
== GUID Collisions ==
* Same extension different version
* Same extension, same version, different website (hash comparisons?)
* Different extension, possibly malicious or coincidence
== What to Track ==
* Addon url (where did we find it?)
* Filename
* Supported Applications and versions
* locals it supports
* context (entire paragraph)
* Ratings? (Site-specific)
* Categories (How?)
== Tools ==
* Something to extract a zippy.
** Look for chrome.manifest
** Look for install.{rdf|js}
** Parse those files (rdf is xml, chrome.manifest should be simple, but what about install.js?)
* Something to crawl
* Something to store (database for better querying?)
* List of websites to crawl
* Crawler's settings (eg. How deep)


= Technical Resources =
= Technical Resources =
* [http://www.robotstxt.org Writing a robot/crawler]
* [http://www.robotstxt.org Writing a robot/crawler]
* [http://www.silfreed.net/blog/2008/04/XUL-extension-parsing XUL Extension Parsing]
* [http://www.silfreed.net/blog/2008/04/XUL-extension-parsing XUL Extension Parsing]
= Manual Extensions =
Extensions that are bundled with an install, and therefore must be added manually
* http://free.grisoft.com/ww.faq.num-1241#faq_1241
* [http://service1.symantec.com/SUPPORT/norton360.nsf/0/e1be9e4560c11b466525728900757836?OpenDocument| Symantec noting their poor addon]
78

edits