Data Publishing: Difference between revisions

Jump to navigation Jump to search
Add table of contents, linkable headers
(added the descriptions for aggregation levels.)
(Add table of contents, linkable headers)
 
Line 1: Line 1:
<big>'''Introduction'''</big>
== Introduction ==


Mozilla’s history is steeped in openness and transparency -  it’s simply core to what we do and how we see ourselves in the world.  We are always looking  for ways to bring our mission to life in ways that help create a healthy internet and support the Mozilla Manifesto.  One of  our commitments says  “We are committed to an internet that elevates critical thinking, reasoned argument, shared knowledge, and verifiable facts”.   
Mozilla’s history is steeped in openness and transparency -  it’s simply core to what we do and how we see ourselves in the world.  We are always looking  for ways to bring our mission to life in ways that help create a healthy internet and support the Mozilla Manifesto.  One of  our commitments says  “We are committed to an internet that elevates critical thinking, reasoned argument, shared knowledge, and verifiable facts”.   
Line 5: Line 5:
To this end, we have spent a good amount of time considering how we can publicly share  our Mozilla telemetry data sets - it is one of the most simple and effective ways we can enable collaboration and share knowledge.  But, only if it can be done safely and in a privacy protecting, principled way. We believe we’ve designed a way to do this and we are excited to outline our approach here.
To this end, we have spent a good amount of time considering how we can publicly share  our Mozilla telemetry data sets - it is one of the most simple and effective ways we can enable collaboration and share knowledge.  But, only if it can be done safely and in a privacy protecting, principled way. We believe we’ve designed a way to do this and we are excited to outline our approach here.


<big>'''Dataset Publishing Process'''</big>
== Dataset Publishing Process ==


We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including a summary of the critical pieces of that process.
We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including a summary of the critical pieces of that process.
Line 23: Line 23:
*  How we characterize the levels of aggregation
*  How we characterize the levels of aggregation


<big>'''How we characterize the levels of aggregation'''</big>
## How we characterize the levels of aggregation


The table below describes the various types of aggregation levels we are defining.
The table below describes the various types of aggregation levels we are defining.
Line 57: Line 57:
|}
|}


<big>'''How we characterize the sensitivity of dimensions'''</big>
== How we characterize the sensitivity of dimensions ==


Based on the [[https://wiki.mozilla.org/Firefox/Data_Collection#Data_Collection_Categories|Data Collection Categories]], most Telemetry data naturally falls within category 1 (technical data) and 2 (interaction data), which are not considered sensitive. A notable exception, however, is geo location, which we geocode from IP addresses to extract City / Region / Country, but only include cities with a population > 15,000 (according to the Geonames database).  
Based on the [[https://wiki.mozilla.org/Firefox/Data_Collection#Data_Collection_Categories|Data Collection Categories]], most Telemetry data naturally falls within category 1 (technical data) and 2 (interaction data), which are not considered sensitive. A notable exception, however, is geo location, which we geocode from IP addresses to extract City / Region / Country, but only include cities with a population > 15,000 (according to the Geonames database).  
Line 74: Line 74:
| '''Category 4 (Highly Sensitive Data)''' || '''1, 2''' || Technically, category 4 often involves highly sensitive data, such as explicit identifiers, that will be removed in the process of aggregation. We include it here for the sake of completeness.
| '''Category 4 (Highly Sensitive Data)''' || '''1, 2''' || Technically, category 4 often involves highly sensitive data, such as explicit identifiers, that will be removed in the process of aggregation. We include it here for the sake of completeness.
|}
|}
<big>'''How do we characterize the sensitivity of metrics?'''</big>
== How do we characterize the sensitivity of metrics? ==


Most metrics are not sensitive information, per se. That said, if a metric indicates or directly implies something about revenue, it is “sensitive”. Example: Search counts.
Most metrics are not sensitive information, per se. That said, if a metric indicates or directly implies something about revenue, it is “sensitive”. Example: Search counts.


<big>'''Dataset Publishing Process'''</big>
== Dataset Publishing Process ==


The goal of this process is to (1) make the “easy” (that is, safe) data publishing requests relatively friction-less, (2) have guard rails in-place so we don’t publish something that exposes us or our users to risk in some way, and (3) ensure that the dataset publishing request process matches closely other processes that are familiar to the data stewards.
The goal of this process is to (1) make the “easy” (that is, safe) data publishing requests relatively friction-less, (2) have guard rails in-place so we don’t publish something that exposes us or our users to risk in some way, and (3) ensure that the dataset publishing request process matches closely other processes that are familiar to the data stewards.
Line 97: Line 97:
*  Once the dataset has been published, it will be announced on the [https://blog.mozilla.org/data/ Data @ Mozilla blog]. Accessing the public data is described on the [https://docs.telemetry.mozilla.org/cookbooks/public_data.html data documentation page].
*  Once the dataset has been published, it will be announced on the [https://blog.mozilla.org/data/ Data @ Mozilla blog]. Accessing the public data is described on the [https://docs.telemetry.mozilla.org/cookbooks/public_data.html data documentation page].


<big>'''Definitions'''</big>
== Definitions ==


'''Metric''' - A metric is anything we want to measure.
'''Metric''' - A metric is anything we want to measure.
Line 111: Line 111:
'''Tabular Data''' - Data that consists of rows (or records) and columns (or fields). Each row has the same number of columns, and each column represents a dimension or metric for that row. Think of a spreadsheet or CSV file as examples of this type of data.
'''Tabular Data''' - Data that consists of rows (or records) and columns (or fields). Each row has the same number of columns, and each column represents a dimension or metric for that row. Think of a spreadsheet or CSV file as examples of this type of data.


<big>'''What's Been Published So Far?'''</big>
== What's Been Published So Far? ==


Our publicly available datasets are [https://public-data.telemetry.mozilla.org/all-datasets.json here].
Our publicly available datasets are [https://public-data.telemetry.mozilla.org/all-datasets.json here].
Confirmed users
955

edits

Navigation menu