Test Pilot/Metrics

From MozillaWiki
Jump to: navigation, search

The Test Pilot program ended in January 2019.

To read more, please see: Adios, Amigos. The information below is being retained for posterity. Thanks to all who traveled along with us!

Test Pilot is an opt-in platform that allows us to perform controlled tests of new high-visibility product concepts in the general release channel of Firefox. It is not intended to replace trains for most features, nor is it a test bed for concepts we do not believe have a strong chance of shipping in general release. Rather, it is reserved for features that require user-feedback, testing, and tuning before they ship with the browser.

Metrics for individual tests

Each metric should meet two strategic purposes:

  • It should provide an actionable data-point to help you improve your test.
  • It should provide product owners and the Test Pilot Council with the data necessary to make informed decisions about the outcome of each test. For example: should this test result in the product shipping in Release or AMO? Should this test be used as a springboard for future UX iteration?

Core and Secondary Metrics

Because Test Pilot tests will vary in nature, core metrics may shift between individual tests. All tests will be instrumented to measure each of the following. Test Pilot Mission Control will work with each test initiator to determine the most appropriate core metrics for each test.

Net Promoter Score

Answers the Question: Is this the quality of experience users are looking for?

How we measure it: NPS is measured using surveys managed by the Test Pilot add-on that will fire at intervals over the life of your test.

Usage Rate per User

Answers the Question: Is this the type of experience people are looking for?

How we measure it: Measuring usage across different tests is challenging since a healthy usage rate on one service might not look like a healthy usage rate on another. To measure usage for your idea, the Test Pilot Council will help you determine a single event in each usage cycle of your test to be captured (eg. for a screenshotting test, the key usage event might be fired when a screenshot is taken). Usage will be defined as the rate this event is fired over time. Usage is recorded from the time a user starts a test until the test concludes, so that if a user quits a test, this will negatively impact aggregate usage rate.

The Test Pilot Council will define a target usage rate at the outset of your test. Taking action based on usage and NPS

Low NPS High NPS
High Usage This scenario likely indicates you’re building something users want to like, but the quality and polish of your experience is lacking. Recommendations:
  • Focus of quality of experience issues.
  • Dive into feedback with a specific focus on reported bugs and other gripes.
  • Engage with the UX team for directed feedback about your test.
Congrats, your idea is awesome and you are awesome. Let’s ship this thing!
Low Usage The good news is that there are a lot of options to try out here. Start by looking at retention metrics. Are people joining and leaving your test, or do they simply not join at all?
In the former case, dive into user feedback to determine issues with build quality and feature set. In the latter case, start by engaging with the Product team to help repackage your test to make it more intelligible/appealing to potential users.
This scenario likely means you’ve got a high polish product, but that it may not be addressing broad user problems. Recommendations:
  • Focus on feedback addressing feature requests.
  • Engage with the Product team to make sure you are describing your feature accurately to potential users.
  • Consider simplifying your offering or making it more accessible.
  • Consider shipping in AMO.


Answers the Question: Do people believe this is the kind of thing they want?

How we measure it: Engagement is measured as the % of total users of Test Pilot installs with your test enabled.

How to deal with low engagement: Consider whether your feature set is meeting user needs by looking at feedback around feature set. Do users understand what you are offering? Consider whether you are clearly explaining the value proposition of your feature.

Outcomes of low engagement: Engagement speaks to the mass appeal of your experient. Low engagement at the end of an test likely means your product will not ship in Firefox. Depending on NPS and usage, it may be a target for AMO.

Note: Some tests may begin with set installation caps. In this case, engagement isn’t really a useful metric, since you will be testing on a fixed-size cohort.

Usage Retention & Churn & Installation Retention & Churn

Answers the question: Is this idea sticky? Does it work for users?

How we measure it: Retention is the % of users who keep use an idea over time and Churn is the rate at which users stop using.

How to deal with low retention & high churn: Dive into feedback. Is your idea getting in people’s way? Is build quality an issue? Retention may be high if your idea sucks but is unintrusive. On the other hand, retention may be very low if your idea is pretty good but constantly barking at people or interrupting their primary task. If low retention correlates to high usage, this is likely the case. Use retention and churn to understand how the experience of your idea fits into users workflows.

Outcomes of Low retention: Your idea will be taken out to the woodshed.

why measure two types of retention and churn: People don’t uninstall/hide/disable things they don’t use as long as they aren’t in the way. Usage churn gives us a signal about the utility of a test regardless of user action.

Churn on installation doesn’t say much about the overall utility of the idea being tested, but it does indicate that the UX is deficient, annoying, or blocking on core browsing tasks.

Time to Churn

How we measure it: If users leave your test, how long does it take them?

Answers the question: Is there something fundamentally user-hostile to your product, or is it possibly annoying over time?

What to do with fast churn: Your product is likely failing at some fundamental level. Is it taxing browser performance or flagrantly getting in the way of users’ normal workflows? Dive into data to find out.

Churn Rationale Percentages

How we measure it: When users leave a test, we ask them to provide a reason from a list of four (“This thing is broken”, “I don’t like this feature”, “This isn’t useful for me”, Something else”). Users engaged in a test can provide feedback any time they like (“Something seems broken”, “Request a feature”, “This is cool”, “Something else”). Users can optionally provide a more detailed description for each.

Answers the question: Why do users report they are leaving my test?

What to do with this data: Use it to help you make informed decisions about high-impact changes to make to your test.

Bounce rate from Idea Details page

Answers the question: Is the description you give of your test enticing to users? Do users identify with your test.

How do we measure it: A user bounces if they come to your test details page without the test installed and then leave the page without installing your test.

Adoption Curve

Answers the question: Do specific events in Test Pilot (promotions, etc) or changes to the packaging of my add-on move the needle? This metric is useful to the Test Pilot team to help us understand how to better draw users to your idea (and to future ideas).

General for Test Pilot

Core Metrics

% of Users Engaged in 0,1,2,3...n tests over time

Answers the question: Does Test Pilot promote engagement? Are the current batch of ideas useful overall? Are some ideas more useful than others?

Average NPS response score across active tests

Answers the question: Does Test Pilot make Firefox a more recommendable product? Are the current batch of ideas attractive overall?

Health Metrics

Answers the question: How is the site code and infrastructure performing? This is standard stuff for a website, eg. response times, percentage of errors, etc.

Secondary Metrics

Average Idea Duration

Answers the question: How long should ideas plan to be in Test Pilot?

Overall NPS response rate across active tests

Answers the question: Do the current batch of tests promote engagement?

Average Engaged Retention & Churn

Answers the question: Are current tests working for users?

How we measure it: This is aggregate churn data of all active tests.

Retention & Churn

Answers the question: Is Test Pilot working for users? Is the service itself healthy.

How we measure it: We define a churned user in this case as a user who has uninstalled Test Pilot or who has not interacted with Test Pilot UI in one month.

Bounce Rate through Sign Up and Install

Answers the question: Is the Test Pilot flow effective?

MAUs v. Total Sign Ups

This may be a vanity metric, but it will help give us a sense of the scale of our population and give us insight into the statistical significance of our data.

Average number of ideas installed

Detail concerning engagement. May not have a clear value from launch, except that it might help us define user typologies down the road, and gives a bit more insight into the ways in which users are making use of Test Pilot.

Test Pilot Life cycle Events

Since most metrics will be visualized as time-based histograms, we will need to log several types of events in order to see how they affect each metric:

  • Shipping a new test
  • Shipping a train in an existing test
  • Shipping trains in Test Pilot
  • Marketing/promotion push for Test Pilot or an individual test
  • Shipping new versions of Firefox