User:Bhashem/WildOnAddons

From MozillaWiki
Jump to: navigation, search
wildon.jpg

Overview

One of the important questions for the Mozilla and Firefox platform is:

  • How many add-ons make up the Mozilla add-ons eco-system?

It's important to get the answer to this so that:

  • We can understand how pervasive Mozilla add-ons are
  • We can help users find ALL the add-on available on the net (not just those on AMO)
  • We can use this number to show a groundswell of support for the platform to encourage others to develop to it
  • We can start to index this information in a central location (AddonSearch)

This actually turns out to be quite hard to answer. AMO is one of the main distribution point for add-ons but it's certainly not the only one. The goals of this project is to gather and index information about add-ons "in the wild".

Here are a few ideas about where add-ons can be hiding.

Aggregation Sources

  • Mozilla AMO (public & sandboxed)
  • Mozilla AMO Update Service (some authors don't include an update URL which means that Firefox attempts to get updates from AMO and the GUID is logged)
  • AMO-like sites: AMI, Sociz, China, Mozilla Japan Addons, Addons.pl, other locale-specific sites?
  • Source Repos: MozDev projects, Google Code & SourceForge
  • Search results: Google ("filetype:xpi", "firefox add-ons", "firefox extensions"), Yahoo, etc...
  • Those mentioned in Google Alerts (blogs & news) on a regular basis
  • Blog aggregators: Foxiewire
  • Addon-specific sites for XUL Apps (Songbird Nest, Flock Extensions, ...)

Individual Sources

  • Corporations (Google Toolbar, Google Labs)
  • Inside of Installers (Symantec Anti-Virus, McAfee, Skype, Java)
  • Individual authors' blogs and websites

Project Definition

  • Write a crawler that gathers info from some of the sources named above
  • Index the collected info and try to extract metadata from page context and the install.{js/rdf}
  • Allow "manual entries" to be entered into the index (e.g. for add-ons bundled in Installers)
  • Build a search/advanced search UI on top of the index
  • Initial focus should be on Firefox, Thunderbird, SeaMonkey, Flock, Songbird and Nvu only

Tech Notes

  • Thankfully most add-on have a .xpi file extension, so they might be easier to identify
  • .xpi files are ZIP files and usually contain either an install.{js/rdf} which has info about what the add-on does
  • The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index
  • TargetApplication id's and versions

Crawling

Crawling and parsing would probably be an intensive and time consuming process

  • Google search results (filetype:xpi) is limited. For example, a Google search for (filetype:xpi site:addons.mozilla.org) only returns 62 hits. Probably best to supplement our data rather than be primary.
  • How much do we crawl? How deep?
  • Aggregate Sites
    • Two Kinds
      1. Hosting (AMI, AMO)
      2. Linking (FoxieWire)
    • Site specific. Maybe only second-level domains (eg. addons.mozilla.org/* instead of all of mozilla.org). Addon authors sometimes have links on their addons page to their personal website with a more up-to-date addon.
    • Mozdev/others/..
  • Individual Sites
    • Wordpress/Blogspot (can extensions be uploaded here?)
    • Google/Yahoo search
      • Rich sources of information. But too much information, or lacking quality
  • Bouncer
    • What kind of information does bouncer collect?
    • Does not give context/rating/url probably
    • Good/Bad source?

GUID Collisions

  • Same extension different version
  • Same extension, same version, different website (hash comparisons?)
  • Different extension, possibly malicious or coincidence

What to Track

  • Addon url (where did we find it?)
  • Filename
  • GUID
  • Supported Applications and versions
  • locals it supports
  • context (entire paragraph)
  • Ratings? (Site-specific)
  • Categories (How?)
  • Addon version
  • Operating System (using install.rdf targetPlatform. But can be null if we don't know)

Tools

  • Something to extract a zippy.
    • Look for chrome.manifest
    • Look for install.{rdf|js}
    • Parse those files (rdf is xml, chrome.manifest should be simple, but what about install.js?)
  • Something to crawl
  • Something to store (database for better querying?)
  • List of websites to crawl
  • Crawler's settings (eg. How deep)

Technical Resources

Manual Extensions

Extensions that are bundled with an install, and therefore must be added manually