Privacy/HowTo/Deidentify: Difference between revisions
< Privacy
Jump to navigation
Jump to search
| Line 40: | Line 40: | ||
==How to Request a Review== | ==How to Request a Review== | ||
*File a legal bug with request type "privacy" (https://bugzilla.mozilla.org/enter_bug.cgi?product=Legal&format=legal) and cc both Tom Lowenthal and Gilbert FitzGerald. | |||
*Attach the data file (raw data) to the request and describe the data by answering the following questions: | |||
*Where does this data come from? | |||
*Who and what is in it? | |||
*What fraction do you want us to approve? | |||
*Why are we releasing it? | |||
*What are we hoping people will do with it? | |||
*Expect to discover that you've missed a couple of things, so plan for a couple weeks to get them corrected. | |||
== Definitions == | == Definitions == | ||
Revision as of 23:37, 12 June 2012
De-Identifcation
Mozilla is an open company and we publish data for a variety of really good reasons. In some cases, we publish our own data from surveys and metrics. In other cases, we receive requests from researchers to access our data. While publishing this data is important, we need to make sure that doing so doesn't compromise an individual's privacy. It's important that individuals can't be uniquely identified or re-identified from what we or others publish. Please do not release any data until you've completed all of the steps on this page. Your data must not do any of these things wrong.
Checklist for Releasing Data
- Remove the things in the always remove section.
- Look at the things that can go wrong.
- Look to see whether any apply to your data.
- Look at the tips and control knows sections.
- Try to apply them to your data.
- Once you think you have safe data, follow the steps in the process section.
Always Remove
- Name,address, email address, phone number, IP address, credit card number, SSN, last four digits of cc# or SSN, any other unique identifier.
Things That Can Go Wrong
- Your survey questions can be flawed.
- Free text can be identifiable.
- Several separate pieces of information which alone don't look identifying may add up to an identifier.
- If your data set has any buckets that only contain a few users, that is a lot like a unique identifier.
- Be careful of small intersections between otherwise large buckets (ex: brown eyed blond women).
- Lots of aggregate data may not be aggregate (ex: total salary paid to all employees published every day, only one new hire on a particular day).
- The time window can be too granular, depending on the data set (recency and extent dimensions).
- Outside information can make otherwise opaque information clear.
- Any data that turns out to have any analytic density is likely to be at risk of re-identification using external complimentary data.
Tips to Avoid Things Going Wrong
- Know whether you plan to release individual records and if so, have your survey questions reviewed in advance.
- If you want to release data, then when you gather it, try to gather a really big set.
- Don't just remove the unique identifier column and call it anonymous.
- In general, if you try to have buckets of 100 people or more, you're probably not making a terrible mistake.
- Consider doing more analysis ourselves so that we can release conclusions and results, rather than raw data.
- It's scary to release an entire data set that contains free text fields, but OK to pull some out that are not identifiable and release them separately.
Control Knobs
- Release fewer variables
- Adjust the bins in each variable such that the intersections have higher counts.
- Make the time window larger or less recent.
- Generate a simulation of your data that has the same analytic properties.
How to Request a Review
- File a legal bug with request type "privacy" (https://bugzilla.mozilla.org/enter_bug.cgi?product=Legal&format=legal) and cc both Tom Lowenthal and Gilbert FitzGerald.
- Attach the data file (raw data) to the request and describe the data by answering the following questions:
- Where does this data come from?
- Who and what is in it?
- What fraction do you want us to approve?
- Why are we releasing it?
- What are we hoping people will do with it?
- Expect to discover that you've missed a couple of things, so plan for a couple weeks to get them corrected.
Definitions
- Anonymization = Filtering a data set such that re-identification is impossible.
- Fingerprinting = Selecting a bunch of attributes, which together are very distinctive, even if we don't know what it connects to in the real world yet. Ex: If we know the height, weight, body mass, hair color, facial measurements, etc., we can build a machine that will identify that person if you see them once. Fingerprinting is the basis of cold cases.
- Identification or re-identification = Identifying some particular thing in the real world based on the data in the data set. Normally this is a person. Sometimes it's a device.