De-Identification and Open Data Wiki

Mozilla is an open company and we publish data for a variety of really good reasons. In some cases, we publish our own data from surveys and metrics. In other cases, we receive requests from researchers to access our data. While publishing this data is important, we need to make sure that doing so doesn't compromise an individual's privacy. It's important that individuals can't be uniquely identified or re-identified from what we or others publish. Please do not release any data until you've completed all of the steps on this page and have been through a review with Privacy and Legal about your proposed release of data.

Checklist for Releasing Data

If your data relates to a product (like Firefox), if you haven't done so, check to see if the release of the data works for Product Management and Product Marketing for that product.
Remove the things in the "Always Remove" section.
Determine whether any of the "Things that Can Go Wrong" apply to your data.
Look at the "Tips" and "Control Knobs" sections and try to apply them to your data.
Once you think you have safe data, follow the steps in the "How to Request a Review" section.
Always document your de-identification processes.

Always Remove

Any unique or personal identifier:
- Name, address, email address, phone number, SSN
- Credit card number, other payment information
- IP address, device ID, MAC address
- Any truncated version of the foregoing (such as last 4 digits of SSN or credit card numbers)
- Geolocation data tied to an individual
- any other information which could be used to identify an individual

What Could Go Wrong?

Things that are identifiable that may not be obvious:

Free text fields
A combination of datapoints each, which by themselves seem not to be identifiable, can, when combined, become identifiable
- If any row or column of your data has a few users in it, that data could be identifiable
- Be wary of how your dataset could be combined with another dataset (either inside our outside Mozilla) to create identifiability - use minimum floors of how many individuals fall into each data point to minimize this.
- Be wary of small intersections between large datasets (ex: brown eyed blond women).
- Be wary of granular location or temporal information (ex: total salary paid to all employees published every day, with only one new hire on a particular day).
- Be wary of small tails (ex: higher ages may be populated by small numbers of individuals )
- Any data that turns out to have any analytic density is likely to be at risk of re-identification using external complimentary data.

Tips to Avoid Things Going Wrong

Don't just remove the unique identifier column and call the data de-identified - consider the notes in the "What Could Go Wrong" section above.
Minimize the privacy impact of your released dataset by balancing the requirements for open access to such data while taking into account individual privacy and expectations:
- Consider whether you really need to release the data or if we can do the analysis internally, releasing conclusions and results, rather than raw data.
- Know whether you plan to release individual records and if so, have your survey questions and answers reviewed in advance.
- If you want to release data, then when you gather it, try to gather a really big set.
  - In general, try to have data points consisting of 100 people or more.
- If your data set has free text fields, don't freely release the set without reviewing each field. Consider the alternative of pulling some out representative fields that are not identifiable rather than releasing the entire set.

More Control Knobs

Release fewer variables
Adjust the bins in each variable such that the intersections have higher counts.
Make the time window larger or less recent.
Generate a simulation of your data that has the same analytic properties.
For demographics, include an appropriate "decline to state" option

How to Request a Review

Use the Project Kickoff Form to file a privacy bug (https://bugzilla.mozilla.org/form.moz-project-review)
Attach the data file (raw data) to the request and describe the data by answering the following questions:
- Where does this data come from?
- Who and what is in it?
- What fraction do you want to release and requires approval?
- Why are we releasing it?
- What are we hoping people will do with it?
- What is the process you're using for de-identification?
- Expect to discover that you've missed a couple of things, so plan for a couple weeks to get them corrected.

Definitions

De-identification = Filtering a data set to minimize the risk of re-identification.
Fingerprinting = Selecting a group of attributes, which together make up a data point that is very distinctive.
Identification or re-identification = Identifying some particular thing in the real world based on the data in the data set. Sometimes this is a person. Sometimes it's a device that's closely tied to a person.

Privacy/HowTo/Deidentify

Contents

De-Identification and Open Data Wiki

Checklist for Releasing Data

Always Remove

What Could Go Wrong?

Tips to Avoid Things Going Wrong

More Control Knobs

How to Request a Review

Definitions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools