Changes

Jump to: navigation, search

Data Publishing

3,154 bytes added, 22:56, 18 September 2020
no edit summary
|-
|}
 
'''How we characterize the sensitivity of dimensions'''
 
Based on the Data Collection Categories, most Telemetry data naturally falls within category 1 (technical data) and 2 (interaction data), which are not considered sensitive. A notable exception, however, is geo location, which we geocode from IP addresses to extract City / Region / Country, but only include cities with a population > 15,000 (according to the Geonames database).
 
Category 3 (web activity) or 4 (highly-sensitive) data should be excluded from the set of “safe” dimensions.
 
Matrix of aggregation safety vs. dimension sensitivity:
{| class="wikitable"
|-
! Category !! Aggregation Level !! Notes
|-
| '''Category 1 (Technical) and 2 (Interaction)''' || '''1, 2, 3''' || For low-sensitivity data, we may not require a minimum bucket size for aggregation.
|-
| '''Category 3 (Web Activity)''' || '''1, 2''' || As sensitivity increases, minimum bucket size becomes increasingly important.
|-
| '''Category 4 (Highly Sensitive Data)''' || '''1, 2''' || Technically, category 4 often involves highly sensitive data, such as explicit identifiers, that will be removed in the process of aggregation. We include it here for the sake of completeness.
|}
'''
How do we characterize the sensitivity of metrics?'''
Most metrics are not sensitive information, per se. That said, if a metric indicates or directly implies something about revenue, it is “sensitive”. Example: Search counts.
 
'''Dataset Publishing Process'''
The goal of this process is to (1) make the “easy” (that is, safe) data publishing requests relatively friction-less, (2) have guard rails in-place so we don’t publish something that exposes us or our users to risk in some way, and (3) ensure that the dataset publishing request process matches closely other processes that are familiar to the data stewards.
 
Having a dataset published requires filling out a bug. Use the nomenclature defined in the preceding sections to answer the following four questions. If the answer to all of them is “no”, you may publish. A “yes” above means extra review is required.
 
* Is the level of aggregation 3 or higher?
* Are there any Data Collection Category 3 (web activity) or 4 (highly-sensitive) dimensions?
* Do any of the dimensions or metrics include sensitive data?
* Are there any data included that do not have a corresponding data review for collection? Please link to relevant data review(s).
 
A data steward will then be assigned to the bug, either by the requester or as part of bug triage, to double-check that these questions are correctly answered and there are no confounding factors inherent to the publishing of the data.
 
Once the request is approved, data engineering will do the implementation work:
* Write (or review) the query
* Schedule it to update on the desired frequency
* Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
* Once the dataset has been published, it will be announced on the new Data @ Mozilla blog. It will also be added to https://docs.telemetry.mozilla.org/datasets/.
39
edits

Navigation menu