Labs/Test Pilot/Dataviz Server: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Added acceptance criteria)
Line 4: Line 4:


They should be able to find answers to their own questions *without* requiring anyone to manually write custom data-processing scripts for them each time. They should be able to easily create and share their own visualizations to answer the questions they care about.
They should be able to find answers to their own questions *without* requiring anyone to manually write custom data-processing scripts for them each time. They should be able to easily create and share their own visualizations to answer the questions they care about.
For more detailed use cases, please check:
https://etherpad.mozilla.org/oEePuVKUn0


= Authors =
= Authors =

Revision as of 20:17, 31 January 2012

Goal

Allow Mozilla community members (and especially UX team members) to easily explore the collected Test Pilot data.

They should be able to find answers to their own questions *without* requiring anyone to manually write custom data-processing scripts for them each time. They should be able to easily create and share their own visualizations to answer the questions they care about.

For more detailed use cases, please check: https://etherpad.mozilla.org/oEePuVKUn0

Authors

authors include: (Jinghua Zhang, Jono, Gregg.Lind others)

Gregg.lind 08:24, 30 January 2012 (PST)


Acceptance Criteria / Sniff Test For Pentaho Solution

Pentaho solves some problems:

  • well defined ideas of different variable types, with annotations.
  • code exists for cut / aggregate / time series queries and the like
  • visualization layer is modular / js / client based. The content for the json that needs to go to the front-end has already been researched and implemented.

Those acknowledged, Pentaho is part of a very large ("enterprise grade") stack/chain with all of the usual (potential) risks and fears:

  • close-coupling?
  • non-repeatable builds
  • non-editable code (with spooky action at a distance)
  • outsourced dependencies and expertise. We need some in-house expertise to make this viable.

So, to allay those fears, I propose these 'sniff tests':

  1. Where is the source code for our pentaho stack?
  2. What are the build instructions? Can I build it on a VM somewhere, such that it's editable without destroying the running instance?
  3. Given some csv data, how do I actually annotate it? Some example broken extensions data is at admin4.generic.metrics.sjc1.mozilla.com:/home/glind/sample_viz_data
  4. How would I load data like that into the system, or make it available for research (starting from the json lines)
  5. URLs for seeing the json that comes out into the front end on existing studies?
  6. URLs for other 'admin' pieces?


The answers to these tests have the benefit of being the foundations for a "Guide To Hack on DataViz Server" document.

(Updated) Expected Deliverables (via Pentaho Route)

  • method / script / skeleton for loading / linking data into the system
  • method / script / skeleton for annotation types / index heirarchies
  • 'meta-widget' that allows univariate and bivariate general statistics, using existing Pentaho framework parts.

Nice to have:

  • multiple data sets / aspects / views 'linked' onto one "page"


NOT EXPECTED:

  • auto-annotation and discovery of data types / heirarchies. When setting up a new data set, there is some work (say, a few hours)
  • Ponies. Advanced analyses will require downloading the (entire | sampled) dataset. We can give instructions for doing more using R/Python/Excel, but the user will have to do that actual work.


TimeLine

Q1/Q2 2012

Access Control

Protecting User Anonymity

  • Raw collected data goes into tables marked as "sensitive", meaning nobody has yet sanitized or aggregated it or scanned it for potentially identifying information. Only internal users (say, those with an LDAP account) can do queries or make visualizations based on sensitive tables.
  • Visualizations based on data pulled from sensitive tables are also marked as sensitive, and can't be viewed except by internal users.
  • An internal user needs to write some queries that pull from the sensitive tables to create new aggregated tables that don't contain anything potentially identifying. These new tables will then be marked "non-sensitive". The queries to create non-sensitive tables will have to be written by hand custom for each new data set, but hopefully most parts of them will be reusable since the things we'll be looking to remove or aggregate will be similar each time.
  • Any user (community members, etc) can create and view visualizations based on non-sensitive tables.
  • To do: clear this plan with PR and legal, make sure they don't object to releasing the non-sensitive tables' data to the wild.

User Roles

(orthogonal to level of access permission)

Vizualization creator

(advanced user)

Will create original visualizations by choosing a data source, choosing one or more variables from that source, and choosing a visualization method (see 3) to apply to them. They can see previews while creating the visualization, and once they have decided what to make they can save it and share it. Designing a usable but powerful interface for visualization creators is an exciting challenge.

Vizualization viewer

Won't create visualizations but will view those created by others, either by following a shared link or by coming to the site and doing a search.

Types of visualizations available

Based on the kinds of questions that the UX team and others have asked me in the past, I think the following types of visualizations will be a good starting point:

  • "How much is feature X used". Choose to see this as: % of users who used feature at least once; or % of total actions which were feature X; or % of actions weighted by user activity
  • "How many X does a typical user have". See a plot of number-of-X vs. how many users have that many, along with median/mean/quartile data.
  • "What do people do after they X". Specify an event X (such as "open a new blank tab") and a time window (5 secs, 10 secs, etc.) and get back a graph of frequency of actions done within that time window after occurences of action X.
  • "Are variables X and Y correlated?". Plot users as data points on scatter plot with X on one axis, Y on the other, perhaps with regression lines and estimation of the statistical significance of the correlation.
  • "How is variable X changing over time?" Where available, show them a comparative visualization using data sets collected months apart
  • "How does any of the above break down by user locale / OS / self-reported tech level / etc". Allow any of the other graph types to be sliced up according to the metadata present in each data submission.

Full-text search

Ideally someone should be able to search for "bookmarks" and find:

  • Any created visualizations that include "bookmarks" as one of the data inputs
  • Any studies that provide bookmarks data, e.g. "week in the life" study (that counted bookmarks) and the menu bar study (which measured how often someone picked a bookmark from the menu). So we'd need annotations on the studies, describing the data with as many keywords as possible, so the search function has something to grab onto.

Performance

Graphing the entire data set will probably take a while. So for the UI to be responsive, the server should respond to user input by showing a preview visualization made with a small random subset of the full data set. Only after the user has finalized the visualization they want will the server do a run based on the full data set.

Anurag suggests that for performance reasons we should marshal the data set into a bit-string representation, to get the size down so it can fit in server memory all at once; then the chart-plotting front-end can avoid hitting the database at all. There could be a nightly job that queries the database and adds the latest stuff to the marshaled data version. We'll have to write custom queries anyway as part of aggregating/sanitizing data and preparing it for the visualizations listed in part 3, so we can make these queries output the marshaled format.

Sharing

After a user creates a visualization by choosing the data set, visualization type, and variables to be used, they get a permanent link to that visualization which they can share with others. The linked visualization should be "live", i.e. when someone else visits it they will see it updated with any new data that has come in.

Developing New Studies

If the user doesn't find the data they're looking for, the server should guide them towards the Test Pilot study development tutorial and the place where they can submit a new Test Pilot study to the review queue (hint hint).