Data Collection: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Add Chenxia's bugzilla alias)
(Revised documentation to reflect the new review process as well as links to review assets.)
Line 1: Line 1:
Firefox sends various data back to Mozilla. This data keeps the browser up to date, powers various features, provides user support, and helps improve the product itself. This page documents the policy for how and why we add new data collection metrics. The owner and peers of the Firefox Data Collection policy module are responsible for making decisions about data collection systems and measurements.
 
At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Firefox data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products.
 
Recently, we’ve revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:
 
*  Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
*  Privacy - We give users meaningful choices and control over their own data
*  Transparency - We make our decisions about data collection public and accessible
*  Accountability - We assign accountability for the design, approval, and implementation of data collection 


Owner: [https://mozillians.org/en-US/u/rweiss/ Rebecca Weiss]
Owner: [https://mozillians.org/en-US/u/rweiss/ Rebecca Weiss]
Line 11: Line 19:
''Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared [http://docs.telemetry.mozilla.org/ data documentation] which can help!''
''Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared [http://docs.telemetry.mozilla.org/ data documentation] which can help!''


== Data Collection Categories ==
Most assets involved in data review can be found [https://github.com/mozilla/data-review in this repository].  References to who fills out a form when are covered in the documentation below.
 
 
= Key Roles for Data Collection =
 
While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:
 
* Data requester - the person requesting data to be collected
* Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies
 
In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining Firefox data collection policies and can field questions about internal policy and laws governing user privacy
 
= Requesting Data Collection =
== Step 1: Submit Request ==
Data requesters start the process by creating a bug/issue and completing a [https://github.com/mozilla/data-review/blob/master/request.md request form], which requires them to answer questions like why does Mozilla need to collect data, how much data is necessary, how long will it be collected. The request is publicly available, as are any comments and approvals.  The detailed steps in this process are:
 
* Create a bug using the data review bug form
* Clone the “Request for data collection” form and answer all 9 questions.
* Add a file to the bug that contains the measurement code to be landed in Firefox
* Flag a data steward peer for review with r? on the file that contains your completed form
 
== Step 2: Request is reviewed ==
Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers.  The detailed steps in this process are:
 
* Data stewards receive an r? on a file in a bug
* Data stewards complete the [https://github.com/mozilla/data-review/blob/master/review.md data review form] based on the information provided in the data collection request. They ensure that the request:
** Follows Lean Data Practices & Guidelines
** The basic mechanics of what is being measured is documented publicly.
** Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.   
** The request aligns with user consent and control mechanisms outlined in the data collection categories listed below
 
Data stewards document the outcome of their review in the bug with an r+ or r- and their completed form. Typical outcomes include:
* Unapproved requests are returned to data requesters for changes or clarification.
* Simple requests that fall within Category 1 or 2 are often approved quickly.
* Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)
== Step 3: (Optional) Escalated Response ==
More complex requests, like those that call for a new data collection mechanism or require changes to the privacy notice, often require one or more of the following additional reviews:
* Privacy analysis: Feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
* Policy compliance review: An assessment from the Mozilla data compliance team to determine if the request matches the Mozilla data compliance policies and documents.
* Legal review: An assessment from Mozilla’s legal team.
 
Data stewards participate in these discussion and will document the outcome in the same bug used for the collection request. 
 
= Data Collection Categories =


There are four "categories" of data collection that apply to Firefox:
There are four "categories" of data collection that apply to Firefox:
Line 38: Line 90:
** Pre-Release: Default off.  May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
** Pre-Release: Default off.  May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
** Release: Default off.  May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
** Release: Default off.  May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
== Requirements ==
'''Requirements For All Data Collection From Firefox'''
* Specifics about the collected data must be documented using the in-tree histogram definitions or the in-tree documentation system (.rst files). This documentation should be detailed-enough that people don't need to read the code implementation to understand what data is being collected.
* Any changes to data collection must be approved by the data collection module owner or peers by requesting review on the patch which updates the in-tree documentation.
* The bug or documentation must publicly identify the problem statement that will be solved by collecting data.
* There must be a person who takes responsibility for the correctness of the data.
* There must be a concrete plan for using the data, and a person who takes responsibility for this.
* The data must be included in the Firefox privacy notice. Much of the time, data collection requires no changes, but when changes are required the data stewards will work with Marshall Erwin and the Mozilla legal team to make sure that the privacy notice accurately reflects the collected data.
''Note: the data stewards do not typically verify that the patch collects the data correctly according to the documentation. That is the responsibility of the code reviewer.''
== Requesting Approval ==
It is our intention to review every new data collection within Firefox, but to do so quickly and with minimal overhead. For every new measurement, even a simple new Telemetry probe, please request approval by setting the feedback flag for the data collection module owner or a peer. Simple requests should be handled within a day.
More complex requests, and especially requests which add a new kind of data collection mechanism or require changes to the privacy notice, will require more extensive review. Please consider pinging the team about these as they are being designed! Additional discussions/review may include:
* Privacy analysis: This may involve requesting feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
* Data compliance review: a review with the Mozilla data compliance team to ensure that changes match the Mozilla data compliance policies and documents.
* Legal review: If necessary, the module owner will request a legal review from Mozilla's legal team. A legal review will be necessary for any changes to the privacy policies/notices.
* Data quality/statistical review: In cases where data analysis and quality is uncertain, the module owner will request additional feedback from the Mozilla metrics team and other experts to validate data analysis plans.
* UX review: We may request/require feedback from the Firefox UX team on any proposed privacy/data-control UI.
== Common Problems ==
'''histogram descriptions'''
* Histogram descriptions should record *what* is being collected, in detail.
* It is important to say *when* a value is recorded, because this is often a confusing point when constructing analysis.
* Include units: for example, indicate whether a time duration is measured in seconds, milliseconds, or microseconds.
* When counting, be sure to indicate how repeat usage works. for example when counting decoding errors, are multiple issues counted for the same video, or only the first one?
'''enumerated histograms'''
Enumerated histograms should either list all the possible enumeration values in the histogram description, or reference a declared enumeration in the tree by name.
'''keyed histograms'''
Keyed histograms contain arbitrary strings in the key, so they get extra attention. Please be careful of:
* Don't use a keyed histogram if you don't need it! Many times a simpler format such as a count or enumeration histogram can solve the same problem.
* The key should not contain user-input data, or other data that can be used to identify particular users.
* In general, keys should be a limited set of values. If you expect more than tens of values, the default aggregations for this histogram will blow up. If this is still required, you should file a bug to have the default aggregations disabled.
* The histogram description should describe exactly what the key contains, and the format.
'''JS exceptions'''
It is a common request to record information about JS exceptions in certain context. In the general case, it is very difficult to prove that this information cannot contain personal data. If you have specific types of errors which are thrown at known locations, you can record information about those. This is an important but unsolved problem.
== Data Collection Properties ==
When proposing a new measurement or data system, please consider your requirements and the necessary data properties:
Function:
* Is the data collection necessary for Firefox to function properly? For example, the automatic update check must be sent in order to keep Firefox up to date.
* Is there a specific user-visible function planned for the data?
* Population: Is it necessary to take a measurement from all users? Or is it sufficient to measure only prerelease users?
* Sampling: is it necessary to get data from all users, or is it sufficient to collect data from a smaller sample?
* Will data submission be automatic, or will there be opt-in UI?
Analysis and Reporting:
* Who will be analyzing the data?
* Will the data that's being collected answer the questions we have?
* Will it be a single or periodic report?
* Is it desirable to track data changes over time? With what frequency? With what latency?
* Will the data reporting be private or public?
* Will the raw data being collected be private or public?
* Is it necessary to keep the measurement forever, or is it sufficient to run a short-term experiment/single report?
Privacy (and Legal):
* Does the data contain sensitive or personal information?
* Can the data be used in combination with other measurements to identify a particular person?
* What kind of users controls will be exposed to control data submission?
* Will users be able to see their own data before or after it has been submitted, either within Firefox or from the server?
* Does the data conform to the existing Mozilla [https://www.mozilla.org/en-US/privacy/principles/ privacy principles], the [https://www.mozilla.org/en-US/privacy/ Mozilla Privacy Policy], and the [https://www.mozilla.org/en-US/privacy/firefox/ Firefox privacy notice]?
* Does this data collection represent any unusual privacy or legal risk to users or Mozilla?





Revision as of 21:06, 1 November 2017

At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Firefox data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products.

Recently, we’ve revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:

  • Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
  • Privacy - We give users meaningful choices and control over their own data
  • Transparency - We make our decisions about data collection public and accessible
  • Accountability - We assign accountability for the design, approval, and implementation of data collection

Owner: Rebecca Weiss

Peers:

Group email: fx-datastewards@mozilla.com

Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared data documentation which can help!

Most assets involved in data review can be found in this repository. References to who fills out a form when are covered in the documentation below.


Key Roles for Data Collection

While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:

  • Data requester - the person requesting data to be collected
  • Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies

In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining Firefox data collection policies and can field questions about internal policy and laws governing user privacy

Requesting Data Collection

Step 1: Submit Request

Data requesters start the process by creating a bug/issue and completing a request form, which requires them to answer questions like why does Mozilla need to collect data, how much data is necessary, how long will it be collected. The request is publicly available, as are any comments and approvals. The detailed steps in this process are:

  • Create a bug using the data review bug form
  • Clone the “Request for data collection” form and answer all 9 questions.
  • Add a file to the bug that contains the measurement code to be landed in Firefox
  • Flag a data steward peer for review with r? on the file that contains your completed form

Step 2: Request is reviewed

Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers. The detailed steps in this process are:

  • Data stewards receive an r? on a file in a bug
  • Data stewards complete the data review form based on the information provided in the data collection request. They ensure that the request:
    • Follows Lean Data Practices & Guidelines
    • The basic mechanics of what is being measured is documented publicly.
    • Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.
    • The request aligns with user consent and control mechanisms outlined in the data collection categories listed below

Data stewards document the outcome of their review in the bug with an r+ or r- and their completed form. Typical outcomes include:

  • Unapproved requests are returned to data requesters for changes or clarification.
  • Simple requests that fall within Category 1 or 2 are often approved quickly.
  • Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)

Step 3: (Optional) Escalated Response

More complex requests, like those that call for a new data collection mechanism or require changes to the privacy notice, often require one or more of the following additional reviews:

  • Privacy analysis: Feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
  • Policy compliance review: An assessment from the Mozilla data compliance team to determine if the request matches the Mozilla data compliance policies and documents.
  • Legal review: An assessment from Mozilla’s legal team.

Data stewards participate in these discussion and will document the outcome in the same bug used for the collection request.

Data Collection Categories

There are four "categories" of data collection that apply to Firefox:

Category 1 “Technical data”
Information about the machine or Firefox itself. Examples include OS, available memory, crashes and errors, outcome of automated processes like updates, safebrowsing, activation, version #s, and buildid. This also includes compatibility information about features and APIs used by websites, addons, and other 3rd-party software that interact with Firefox during usage.
Category 2 “Interaction data”
Information about the user’s direct engagement with Firefox. Examples include how many tabs, addons, or windows a user has open; uses of specific Firefox features; session length, scrolls and clicks; and the status of discrete user preferences.
Category 3 “Web activity data”
Information about user web browsing that could be considered sensitive. Examples include users’ specific web browsing history; general information about their web browsing history (such as TLDs or categories of webpages visited over time); and potentially certain types of interaction data about specific webpages visited.
Category 4 “Highly sensitive data”
Information that directly identifies a person, or if combined with other data could identify a person. Examples include e-mail, usernames, identifiers such as google ad id, apple id, fxaccount, or certain cookies. It may be embedded within specific website content, such as memory contents, dumps, captures of screen data, or DOM data.

Eligibility for Default on Data Collection

  • Categories 1 & 2 (Technical & Interaction data)
    • Pre-Release & Release: Data may default on, provided the data is exclusively in these categories (it cannot be in any other category). In Release, an opt-out must be available for most types of Technical and Interaction data. Teams may limit data collection to pre-release populations if appropriate for testing/validation, cost reduction, or risk mitigation.
  • Category 3 (Web activity data)
    • Pre-Release: May be eligible for default on data collection, provided there is an opt-out.
    • Release: Default off. May be eligible for opt-out on a case-by-case basis if mitigations are identified. Mitigations may include UX changes that make users aware of additional risk, technical mechanisms that remove the risk, or a risk assessment done of a case-by-case basis that determines the risk is limited.
  • Category 4 (Highly Sensitive data)
    • Pre-Release: Default off. May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
    • Release: Default off. May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.


Other Practices

Every year, the data collection owner and peers will survey all of the existing data collection systems with Firefox. This survey has the following goals:

  • To ensure that it is still necessary and useful to collect a piece of data.
  • To re-identify who is responsible for the collection, monitoring, and reporting of collected data.