Update:Remora Stats

From MozillaWiki
Jump to: navigation, search

This is a proposal for handling add-on historical statistics and graphs in Remora.

Status: Proposal/brainstorming
Summary: Use rrdtool with time-lapsed data collected by cron jobs
Requirements: rrdtool binary package (certainly), pcntl PHP module (probably) — software requirements beyond the Remora default
Origin: IRC conversation between Mike Morgan and Bogdan Stancescu on Oct 13/14, 2006 (Friday the 13th)

Data

This is the data we're tentatively planning to store/graph with this solution, and how strongly storing/graphing each data type is desired:

  • add-on ratings (strong)
  • add-on downloads (strong)
  • add-on "weekly downloads" (strong)
  • add-on page hits (medium)
  • add-on comment count (weak)

Assumptions

It is assumed that, regardless of how each data type is collected (parsing httpd logs, collating data from temporary tables, or storing real-time data in the database), we can expect all data in the previous section to be available reliably and accurately during a predefined 30 minute window within every hour in the database. (Note: we don't care about DST quirks, shifts and jumps here, we're only talking about whole hours.)

It is assumed that we have enough CPU power and disk speed to update all RRD files associated to all add-ons during the 30-minute window, as long as we distribute the load across that window, as to avoid hampering other routine activities on the server. Updating one RRD is neither CPU-, nor HDD-intensive, but depending on server load and performance, this might be a factor when updating thousands of RRD files. To be evaluated.

It is assumed that losing statistical data older than a few weeks at one-hour resolution, and data older than one or two years at one-day resolution is planned for and accepted.

RRD primer

Feel free to skip this section if you know about RRD, it really is a primer on the topic

RRD stands for Round Robin Database. Conceptually, an RRD is truly a one-dimensional database (a single row of values), but instead of being linear, it's circular: the first entry is stored in position 1, 2nd entry in position 2, Nth in N, Nth+1 in position 1 again, then Nth+2 in 2, 2*Nth in N, 2*Nth+1 in 1, and so on (as long as N is the pre-designed maximum number of values stored in the database; N is user-defined; wrapping around is completely transparent; since this is one-dimensional, there's no primary key to talk about — but see next section for associated timestamps).

RRDs are, by design, useless for storing arbitrary numbers of records. But they handle historical data superbly, because of a feature which is built in by design: constant size. No matter how much new data is pushed in an RRD, it will wrap around, and its size will remain constant. That inherently means that old data is lost, but there are two distinct ways to address that:

  1. If the RRD is properly designed, the data being lost is irrelevant by the time it gets overwritten;
  2. If all data is important, all data can be trivially archived at predetermined moments without loss or significant costs in computing power (it's a matter of copying the RRD file as it's about to wrap around, and not a matter of extracting data via SQL or other language).

rrdtool primer

Feel free to skip this section if you know about rrdtool, it really is a primer on the topic

Hard data

rrdtool is arguably the de facto industry standard for RRD management and graphing. It's licensed under GPL, and it works under the usual gamut of platforms.

There is no known native PHP API for rrdtool1. The de facto standard GUI for rrdtool is Cacti (separate project), but that involves manual intervention, which is not applicable for Remora.

1: by the authors of this paper, at the time of this writing. That doesn't mean there aren't any at all. Feel free to fill in the blanks.

Specifics

In regard to data storage, rrdtool offers the following capabilities as compared to the theoretical RRD model described above:

  • each piece of data is associated with a timestamp, in a mandatory fashion;
  • multiple RRDs can be stored in a single RRA file (Round Robin Archive);
  • a single RRA can store historical data for multiple sources (e.g. temperature over time and CPU load over time; not limited to two data sources);
  • a single RRA can store multiple aggregation variations for the same data source with different steps and periods (e.g. temperature over time for one day in one-second measurements, and temperature over time for one year in one-day averages; not limited to two aggregation variations);
  • aggregation is performed automatically, based on rules defined at RRA creation time;
  • a single RRA can store multiple data sources and multiple aggregation variations, with the only limitation/feature that the same aggregation variations are automatically and necessarily stored for all individual data sources in a single RRA.

IMPORTANT! Given rrdtool's intrinsic smartness in regard to data aggregation, everything it stores features two very significant traits: (1) all data is internally stored in per-second increments, and (2) all data is stored as a delta from the previous measurement, not as an absolute value. These points will come into play later in this paper.
N.B.: these are not limitations of rrdtool, they are direct results of how it manages and aggregates data internally. That topic is out of this paper's scope, but it's worth remembering that nothing can realistically be done about any of these two traits.

Example

Here's an example of a set of graphs generated out of a single RRA for a Firefox extension: Converter graphs. The data is parsed hourly from the extension's homepage on addons.mozilla.org. The page shows three data sources (total downloads, weekly downloads and rating) with two aggregation variations (hourly over one week and daily over one year).

The size of the source RRA file on Oct 14, 2006 was less than 14 KB.

Note: I'm fiddling with the example page, trying to find the best aggregation variations. The basic data sources (total dl, weekly dl, rating) and granularity (1h) stay constant, I'm only working on aggregation periods, graph number and graph sizes. Therefore those specific elements may differ in the actual example from what's described here.

Implementation

This section describes the specifics of this proposal's implementation.

Architecture

As stated in the Assumptions section above, this proposal is based on the expectation that reliable and accurate readings can be made from the database in a pre-determined time window during each hour about all data types to be stored and graphed, regardless of how that is accomplished technically. This section assumes that has been accomplished, and that the time window and database data sources are known.

The two sub-sections here describe the asynchronous and synchronous processes as related to user interactions: asynchronous as in cron jobs happening every hour regardless of any page being accessed via HTTP, and synchronous as in "whenever an add-on's page is accessed".

Asynchronous

  1. An hourly cron job runs at the beginning of the time window; this is the Controller Process (CP); the CP is aware of all prerequisites via configuration, and it has read access to the database;
  2. The CP counts the aA — active add-ons — in the database ("active" in this context meaning add-ons which need to have their stats stored);
  3. The CP divides aA to the length of the time window in seconds1, thus determining the dW — window delta;
  4. The CP spawns2 new Targeted Processes (TP) every dW seconds; each TP is associated with a single add-on;
  5. Each TP updates its associated add-on's RRA.
1: A pre-determined time period must be subtracted from the real time window, in order to account for the last few processes
2: It's irrelevant for the scope and time of this paper whether the new processes are actual children of the CP or completely new processes. It would probably be desirable to spawn child processes via fork() in order to achieve better process control and debugging data, but only marginally so. TBD.

Synchronous

  1. Whenever an add-on's page is rendered, a blocking1 external script is called — the Add-on Graph Render Process (AGRP); this process is aware of the add-on it is associated to;
  2. The AGRP checks whether the timestamp of the associated add-on's RRA is newer than the associated add-on's witness file2;
  3. Depending on the witness file's timestamp,
    • If the witness file's timestamp is newer than the RRA, exit;
    • If the witness file's timestamp is older than the RRA, the AGRP generates new graphs based on the RRA and then touches3 the witness file;
  4. The script at (1) proceeds, serving the graph images, as available on the storage device.
1: "Blocking" in that subsequent code is not executed until the external script is finished.
2: Future developments may change the currently proposed set of graphs, and/or the order in which they are generated — therefore it's good practice to avoid checking against the timestamp of a particular graph file; an independent witness file can be guaranteed to be available and have a meaningful timestamp regardless of any graphs being generated, or their order.
3: Updates its time to the current timestamp.

Implementation specifics

This section describes some specifics of this implementation. The following notation is used: xx:30 means "half past" any hour in any day, xx:45 means "quarter to" any hour in any day, etc. Similarly, xx:xx:15 means "15 seconds after any minute in any hour in any day", etc.

As explained in the relevant section, rrdtool has the particularity of only storing delta values, and only in one-second increments. As a result, if one data source is stored once at xx:xx:03 and the next time at xx:xx:05, it will be effectively1 stored as a float regardless of whether the actual data's type is integer (e.g. downloads per hour).

The difference between the assumed2 and the actual3 hourly averages is somewhat cumulative4 and non-trivial: the differences can reach 25%5. While the actual values would be true to reality, they are not what users expect based on the assumed hourly averages — and we plan to avoid that by explicitly feeding timestamps of the form xx:00:00 to rrdtool6.

The mechanism proposed here avoids all of this complexity and still yields proper results7 by assuming that all data came in at xx:00:00, regardless of when it was cached, and regardless of when it was actually parsed into RRAs.

1: in practice, all values are stored as double by rrdtool. The difference being highlighted here is between doubles of the form 1.50000000000e+0 and 1.5013248762345e+0 — effectively the difference between a float and an integer value
2: "assumed" as in "what the average person would assume" — e.g. that hourly downloads are whole numbers
3: "actual" as in "what the code can actually measure" — the code can't measure hourly downloads by breaking time into hours with Planck time precision (see the next footnote for details)
4: imagine you're plotting a graph of hourly downloads (i.e. integer numbers) on millimetric paper, with x (horizontal) being time, and y (vertical) being number of downloads. In this context, you consider "real" values to be the ones you measure at the intersections of your graph and whole hours on the x scale (xx:00). As a result, all "real" values are going to be non-integer values, except when you happen to have two successive measurements with the same value. Since all measurements are made either before or after xx:00, the delta between the measurements at xx:00 and (xx+1):00 can accumulate, depending on which side of xx:00 each of your actual measurements end up on.
5: given the experiment described in the previous footnote, a sharp decrease over a few hours combined with two measurements inside the same hour (e.g. xx:00:53 and xx:59:08) can result in dramatically erroneous values when measuring the intersections with the assumed, whole hours (xx:00:00 and (xx+1):00:00).
6: in reality, we can assume the data was properly collected at xx:00, and feed the respective timestamp into rrdtool explicitly, regardless of how many seconds or minutes data is actually collected or stored after xx:00:00. That avoids all of the "Planck time" complexity above, and rrdtool ends up providing integer values as a result.
7: the resulting data is real for all intents and purposes, because humans are not really interested if a particular download occurred one or two seconds before or after xx:00 — and on the other hand, all data is accounted for; in essence, all of the compromise is about minor time shifts related to when measurements are made, and not about the quantities being measured.

Parameter recommendations

This section documents the data types and aggregation variations being considered for storage/graphing. References are made to the Converter graphs example introduced above.

Data types

The reasons for including ratings and download counts for storage/graphing, as proposed in the Data section, are obvious and self-explanatory. The other data proposed for graphing in that section is discussed here.

Weekly downloads
Weekly downloads are essentially a derivation of the total downloads. The same way weekly downloads are useful in a static way in the add-on details page, they are also useful for graphs. If you look at the example graphs, the weekly download graph is actually more meaningful than the total downloads graph. Unfortunately there's not enough data in the example at the time of this writing in order to make a meaningful comparison between the weekly and the total yearly data.
Page hits
Unfortunately the current version of addons.mozilla.org doesn't provide page hits, therefore the example doesn't contain that particular data. It is expected however that monitoring this metric would be significant to developers, as to assess how many potential users fail to download their add-on because of poor descriptions/screenshots, or negative comments, etc.
Comment count
We assume that all data will be stored as either GAUGE (static measurement between predetermined values — think add-on rating) or COUNTER (ever-increasing measurement — think page hits). GAUGE data can and should be rendered in absolute values. COUNTER data can't and shouldn't -- for GAUGE data, the only meaningful abstract data type is "speed" — i.e. variation over time. The comment count is GAUGE data, because it's ever-increasing. This particular data type is only a weak candidate for storage/graphing on account of its typically slow variation over time: the graph for most projects would be pretty flat against the X axis, with only a few one-unit spikes here and there over time. To be determined whether it would be worth the trouble for yearly aggregation in an average project.

Aggregation variations

The following aggregation variations are currently being suggested for all data types (compare to the example graphs):

  • two-week data per one-hour measurements, with a single week being shown by default (the previous week can be shown on demand, for comparison with the current week);
  • one-year data per one-day measurements; same data shown.

Optionally, either or both can be archived on-site or sent to the developers just before data wrapping is about to occur (given the nature of RRD, those moments can be predetermined).

Additional information about this document

This document has been designed to be hard to read because it was hard to write in the first place. Not really. This document was designed for completeness: anyone with decent programming skills and a decent level of interest in this topic should be able to implement these features without significant research regarding possible architecture variations (syntax and actual implementation research are assumed to come with the job description though).