Marketplace/MetricsServer

From MozillaWiki
Jump to: navigation, search
Stop (medium size).png
The Marketplace has been placed into maintenance mode. It is no longer under active development. You can read complete details here.

Motivation

Marketplace requires a metrics server to produce metrics for the marketplace. These metrics cover multiple use cases and multiple consumers.

Some example requirements: https://docs.google.com/document/d/1tlNQqgsCCGC3B4S1lstKx5sbWbTdGOhzxekL6i1XimM/edit

Why not in zamboni?

Some data should not be going through zamboni. Zamboni is too big already. The thought of the production server speaking to a whole pile of servers to get data to provide information to internal users seems like a bad idea.

Why not graphite?

(Or heka or similar) Because we are really looking for OLAP processing of data, where we can filter by pretty much anything. Graphite and friends, are focused on time series data with little filtering and cross referencing data. It would be nice to re-use graphite, but I don't think we can get the queries we need out of it.

Amount of data

Not huge, we can probably just store and manipulate most of it in data. If it's getting really large then are probably doing something wrong, the data source providers should mostly be doing this for us (eg Webtrends rolling up the data for us).

Overview

A server that consumes data and provides a REST API as a output.

Sources

  • Webtrends and/or Google Analytics
  • Bango and any other payment provider or aggregator we might add
  • Solitude our own internal payment server
  • Marketplace itself, for example installs
  • AiTC (maybe)

Incoming

We should be able to take data from multiple sources and push it in. We should have at least two ways to do this:

  • cron jobs will read data from sources and put it into the server. The cron jobs will need to be relatively atomic and assume that the sources we are reading from will go down.
  • parsing through batch data from a source like the Marketplace. This could be log files, or reading from a queue.

At this time we shouldn't be trying to respond to UDP, HTTP posts or other incoming posts of data. Otherwise we run the risk of having a service that is coupled to site load.

Uptime, load etc

We should assume that the Metrics Server will not have a high SLA. We should also avoid, as much as possible, anything that puts a large load on the server. We will assume the service does not have to be multi-homed, but will be accepting metrics from multiple different places.

Aggregation

Once the data is in, we'll need to aggregate in ways that make sense to the views. We either have to keep the logs in raw form, or store the data into the db. Then we can re generate aggregated data as need be.

In the case of Marketplace installs, for example, we track every install and then re-aggregate data as needed into the custom data sets for views.

Currently Marketplace does this in (partially) Elastic Search and MySQL. It might make sense to have all the data stored in MySQL and then aggregated into Elastic Search and let that deal with data aggregation.

Outgoing

A REST service only.


Format

JSON body that contains the format. Pass in keys (something like graphite), filters, date ranges in the JSON. Could just be a pass through to Elastic Search. Everything should be multiple results, single line is just one result.

How will we do key discovery? E.g.: installs, downloads etc. So that adding in new charts would be done by adding in the back end.

Marketplace

The Marketplace and Developer Hub will need to be able to make REST API calls to the server to get data and then cache that, depending upon the data needed. Also since the Marketplace Metrics server will likely have a lower SLA than the viewing pages, any interface that talks to the Marketplace Metrics server should fail gracefully.

Internal Dashboards

Internal dashboards will need to speak to the Metrics Server and pull data out. That might be more sensitive data.

Security

We'll need at the very least to limit internal data being disclosed publicly. We probably don't need a full Authorization/Authentication/Groups/Users/Roles stack. But we might at some point.

Proposed plan

  • Django site with a MySQL backend
  • Dashboards etc interact with Tastypie
  • That calls out to Elastic Search
  • Data is pulled into MySQL and aggregated into Elastic Search

Other things I looked at: http://packages.python.org/cubes/index.html

Existing Data

There's currently the following existing stats data in the Marketplace and zamboni:

  • global_stats
  • client_data

AMO specific:

  • stats_addons_collections_counts
  • stats_collections_counts
  • stats_collections
  • download_counts
  • update_counts
  • stats_share_counts
  • stats_share_counts_totals
  • stats_collection_share_counts_totals