MetricsDataPing: Difference between revisions

Move Privacy section to the bottom.
m (→‎Document Identifier Strategy: The old UUID description was accidentally left in.)
(Move Privacy section to the bottom.)
Line 3: Line 3:
Measure adoption, retention, and aggregated search counts by engine. Record possible explanatory dimensions using a statistically unbiased and sound approach. Comparable projects that collect user data are TestPilot and Telemetry. Participants in these programs are self selected. It has been demonstrated that data retrieved from TestPilot is biased and not representative of the Firefox population.  
Measure adoption, retention, and aggregated search counts by engine. Record possible explanatory dimensions using a statistically unbiased and sound approach. Comparable projects that collect user data are TestPilot and Telemetry. Participants in these programs are self selected. It has been demonstrated that data retrieved from TestPilot is biased and not representative of the Firefox population.  


== Opt-in vs. Opt-out: Layman's Explanation  ==
'''Note''': The description below is the current proposal form the metrics team, but has serious privacy problems discussed at the bottom. There is hope that the necessary data can be gathered entirely anonymously. The information below should therefore be considered subject to change.


Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period
= Data Elements =
 
Opt-out: assumes that all users are included in the sample for maximum data coverage and full representativeness and thus we achieve full comprehensiveness – users are able to opt-out via an opt-out mechanism
 
'''How are they different?'''
 
*Superficially the consequences can may seem the same but the validity of any conclusions drawn from data under the two alternative approaches is quite different
*''The attempt to reach a representative set of user data'' is the key differentiator between the approaches. In standard surveys one can easily see that the responses of those who offer or volunteer to take a survey are likely to be quite different from these of a rigorously administered survey. Such self-selection bias is a key weakness of online data collection also.
 
'''What difference does it make for the statistic?'''
 
*We want to acquire representative data and analyze it for the ‘de-averaged’ benefit of multiple but still large sub-populations of users
*Each subpopulation requires insights and actions that are not of the ‘one size fits all’ variety <br>
 
There is a section in the discussion page for the topic of opt-out:<br>[[Talk:MetricsDataPing#Opinions_from_User:BenB]]<br>That opinion and discussion is also included below.
 
'''What difference does it make for the user'''
 
* The argument "if they don't want to, they can opt-out" is a fallacy, because most users will not know about this data gathering. They cannot opt-out, if they don't know about it, because they have never been asked.
 
* The difference between opt-in and opt-out is that opt-out includes many of those users who do not wish to participate, therefore violating their wishes and rights.
 
* So, if the argument is that the result data will be different, then yes, it will be different, because it includes those users who do not wish to be included, but are included anyway in an opt-out scheme. If fact, if they actually do opt-out, then the data would be different again, therefore the same argument of "statistic is biased" applies. If the argument is that opt-out has average data, then only because many user wishes are violated.
 
* This is why European and German law *requires* opt-in for any gathering of data about the user.
 
== Data Elements<br>  ==


A directory of elements collected by the various data collection pings (Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: [https://metrics.etherpad.mozilla.org/ep/pad/view/ro.9e6LG/latest Data Collection Paths]<br>
A directory of elements collected by the various data collection pings (Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: [https://metrics.etherpad.mozilla.org/ep/pad/view/ro.9e6LG/latest Data Collection Paths]<br>
Line 37: Line 11:
The list and definitions of data elements in the Metrics Ping is here [https://metrics.etherpad.mozilla.org/ep/pad/view/ro.9$yFtH/latest MDP Data Point Descriptions]
The list and definitions of data elements in the Metrics Ping is here [https://metrics.etherpad.mozilla.org/ep/pad/view/ro.9$yFtH/latest MDP Data Point Descriptions]


== Document Identifier Strategy ==
== Submission ID ==
When it is time to submit data, the client will collect the latest data and append it to the cumulative view of the data stored locally in the Profile directory.  The client will then generate a new document ID for this submission and post it to the data.mozilla.com service along with a header indicating the previously submitted document ID.  On the server side, the previous document ID is used to delete that document, and the new document is stored with the new ID.  If the server returns a success response to the client, the client saves the ID of the document just submitted as the previous document ID.  If the client does not receive a success response, it will attempt to submit later using the same two document IDs.  This makes sure that we don't leave old data for the installation hanging around.
 
Under the current proposal, when it is time to submit data, the client will collect the latest data and append it to the cumulative view of the data stored locally in the Profile directory.  The client will then generate a new document ID for this submission and post it to the data.mozilla.com service along with a header indicating the previously submitted document ID.  On the server side, the previous document ID is used to delete that document, and the new document is stored with the new ID.  If the server returns a success response to the client, the client saves the ID of the document just submitted as the previous document ID.  If the client does not receive a success response, it will attempt to submit later using the same two document IDs.  This makes sure that we don't leave old data for the installation hanging around.


There is a section in the discussion page for the topic of the old UUID strategy originally proposed:<br>[[Talk:MetricsDataPing#Update_from_User:DEinspanjer]]<br>[[Talk:MetricsDataPing#Opinions_from_User:BenB_2]]<br>
There is a section in the discussion page for the topic of the old UUID strategy originally proposed:<br>[[Talk:MetricsDataPing#Update_from_User:DEinspanjer]]<br>[[Talk:MetricsDataPing#Opinions_from_User:BenB_2]]<br>
That opinion and discussion is also included below.
=== Privacy ===
A UUID is PII. Definition:
"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."
An stable UUID for a user or user device is per definition always a PII and never anonymous.


It is therefore regulated by European and German data protection laws and normally forbidden.
= Client-side =
 
From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be
* interceptions during transmission
* other logging server components before the server component discussed here
* legal requests by various governments
* server break-ins, or
* policy changes on the Mozilla side.
 
Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.
 
The user has no way to verify whether any of the above (break-in, intercept, intended or lawful or not) is happening or not, and that already is a privacy violation. So, it's irrelevant what the intended usage was, only what is theoretically possible. The above must be impossible - not just "We won't do it, we promise!", but impossible.
 
=== Google Chrome ===
 
Google Chrome did use a UUID for each browser, and it was perceived as a serious privacy threat and a topic going through mainstream press, including the largest newspapers, in Germany. Eventually, Google dropped the UUID because of the PR problems it caused.
 
This question of whether a UUID is used by Firefox /will/ be picked up by the press, and the result will be negative for Firefox. This is not a guess, as history shows.
 
=== Perception ===
 
Germany and Europe are very privacy-aware, much more so than people in the US. Firefox has a big and loyal following there, to a big part because Firefox claims to do what users want and is privacy-aware. A UUID will be considered highly offensive in these countries and will cost Firefox market-share.
 
=== Alternative ===
 
Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less."
 
=== What to avoid ===
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ .
 
== Client-side<br>  ==


The meta bug for the client side measurement system can be found here&nbsp;: [https://bugzilla.mozilla.org/show_bug.cgi?id=718066 https://bugzilla.mozilla.org/show_bug.cgi?id=718066]<br>  
The meta bug for the client side measurement system can be found here&nbsp;: [https://bugzilla.mozilla.org/show_bug.cgi?id=718066 https://bugzilla.mozilla.org/show_bug.cgi?id=718066]<br>  
Line 172: Line 108:
</pre>
</pre>


== Server-side<br>  ==
= Server-side =


*Clients will POST data to the configured URL not more than once every 24 hours.
*Clients will POST data to the configured URL not more than once every 24 hours.
Line 186: Line 122:
*''There will be UI elements inside of Firefox that allows users to delete all their data (remote and locally.''
*''There will be UI elements inside of Firefox that allows users to delete all their data (remote and locally.''


== Data Access Policies<br>  ==
= Data Access Policies =


Members of the Metrics Team can access this data for strategic advisory, business operations, analytical research purposes. A more comprehensive set of policies applicable to data at Mozilla in general needs to be determined, presumably in conjuction with the UDC.<br>
Members of the Metrics Team can access this data for strategic advisory, business operations, analytical research purposes. A more comprehensive set of policies applicable to data at Mozilla in general needs to be determined, presumably in conjuction with the UDC.<br>
Line 194: Line 130:
* Must have MPT-VPN access
* Must have MPT-VPN access


== User Data<br>  ==
= User Data =


Complete transparency in the data we collect and how it is used is achieved through blog posts and easy access to UI elements to turn off/on data collection. As a first step:<br>  
Complete transparency in the data we collect and how it is used is achieved through blog posts and easy access to UI elements to turn off/on data collection. As a first step:<br>  
Line 203: Line 139:
In subsequent versions compare user installation data to segments of the population by dimensions (e.g. OS/ hardware/ version etc). For example, given a user's OS and number of extensions, how does his/her startup time compare to a 'peer group' sharing similar characteristics. <br>
In subsequent versions compare user installation data to segments of the population by dimensions (e.g. OS/ hardware/ version etc). For example, given a user's OS and number of extensions, how does his/her startup time compare to a 'peer group' sharing similar characteristics. <br>


== UI Implementation ==
= UI Implementation =


The Metrics Team is consulting with UX to determine the proper UI implementation.&nbsp; Given the opt-out requirement, UX proposes a check box to opt-out in the preferences pane and notifying users through non-modal and non-chrome channels (blog posts, privacy policies, download pages).
The Metrics Team is consulting with UX to determine the proper UI implementation.&nbsp; Given the opt-out requirement, UX proposes a check box to opt-out in the preferences pane and notifying users through non-modal and non-chrome channels (blog posts, privacy policies, download pages).
Line 209: Line 145:
see: https://bugzilla.mozilla.org/show_bug.cgi?id=707970
see: https://bugzilla.mozilla.org/show_bug.cgi?id=707970


== Security Reviews ==
= Security Reviews =


Review for Bagheera, the back end server that recieves and stores user data: [https://bugzilla.mozilla.org/show_bug.cgi?id=655746 https://bugzilla.mozilla.org/show_bug.cgi?id=655746]  
Review for Bagheera, the back end server that recieves and stores user data: [https://bugzilla.mozilla.org/show_bug.cgi?id=655746 https://bugzilla.mozilla.org/show_bug.cgi?id=655746]  


<br>
= Privacy =
 
== Opt-in vs. Opt-out ==
 
=== Layman's Explanation  ==
 
Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period
 
Opt-out: assumes that all users are included in the sample for maximum data coverage and full representativeness and thus we achieve full comprehensiveness – users are able to opt-out via an opt-out mechanism
 
=== How are they different? ===
 
*Superficially the consequences can may seem the same but the validity of any conclusions drawn from data under the two alternative approaches is quite different
*''The attempt to reach a representative set of user data'' is the key differentiator between the approaches. In standard surveys one can easily see that the responses of those who offer or volunteer to take a survey are likely to be quite different from these of a rigorously administered survey. Such self-selection bias is a key weakness of online data collection also.
 
=== What difference does it make for the statistic? ===
 
*We want to acquire representative data and analyze it for the ‘de-averaged’ benefit of multiple but still large sub-populations of users
*Each subpopulation requires insights and actions that are not of the ‘one size fits all’ variety <br>
 
There is a section in the discussion page for the topic of opt-out:<br>[[Talk:MetricsDataPing#Opinions_from_User:BenB]]<br>That opinion and discussion is also included below.
 
=== What difference does it make for the user? ===
 
* The argument "if they don't want to, they can opt-out" is a fallacy, because most users will not know about this data gathering. They cannot opt-out, if they don't know about it, because they have never been asked.
 
* The difference between opt-in and opt-out is that opt-out includes many of those users who do not wish to participate, therefore violating their wishes and rights.
 
* So, if the argument is that the result data will be different, then yes, it will be different, because it includes those users who do not wish to be included, but are included anyway in an opt-out scheme. If fact, if they actually do opt-out, then the data would be different again, therefore the same argument of "statistic is biased" applies. If the argument is that opt-out has average data, then only because many user wishes are violated.
 
* This is why European and German law *requires* opt-in for any gathering of data about the user.
 
== User identification ==
 
== UUID is PII ==
 
Definition:
 
"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."
 
An stable UUID for a user or user device is per definition always a PII and never anonymous.
 
It is therefore regulated by European and German data protection laws and normally forbidden.
 
From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be
* interceptions during transmission
* other logging server components before the server component discussed here
* legal requests by various governments
* server break-ins, or
* policy changes on the Mozilla side.
 
Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.
 
The user has no way to verify whether any of the above (break-in, intercept, intended or lawful or not) is happening or not, and that already is a privacy violation. So, it's irrelevant what the intended usage was, only what is theoretically possible. The above must be impossible - not just "We won't do it, we promise!", but impossible.
 
=== Google Chrome ===
 
Google Chrome did use a UUID for each browser, and it was perceived as a serious privacy threat and a topic going through mainstream press, including the largest newspapers, in Germany. Eventually, Google dropped the UUID because of the PR problems it caused.
 
This question of whether a UUID is used by Firefox /will/ be picked up by the press, and the result will be negative for Firefox. This is not a guess, as history shows.
 
=== Perception ===
 
Germany and Europe are very privacy-aware, much more so than people in the US. Firefox has a big and loyal following there, to a big part because Firefox claims to do what users want and is privacy-aware. A UUID will be considered highly offensive in these countries and will cost Firefox market-share.
 
=== Alternative ===
 
Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less."
 
=== What to avoid ===
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ .
 
=== Submission ID ===
 
The current proposal changed a stable UUID for a profile to a submission ID. However, the previous submission ID is also transferred, which allows the server to trivially match them together and still build a unique ID on the server. (Again, whether the server does that or not is immaterial.) So, the submission ID proposal has the same privacy consequences discussed above.
Confirmed users
596

edits