MetricsDataPing: Difference between revisions
mNo edit summary |
No edit summary |
||
Line 153: | Line 153: | ||
== Opt-in vs. Opt-out == | == Opt-in vs. Opt-out == | ||
=== Layman's Explanation | === Layman's Explanation === | ||
Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period | Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period | ||
Line 183: | Line 183: | ||
== User identification == | == User identification == | ||
== UUID is PII == | === UUID is PII === | ||
Definition: | Definition: | ||
Line 192: | Line 192: | ||
It is therefore regulated by European and German data protection laws and normally forbidden. | It is therefore regulated by European and German data protection laws and normally forbidden. | ||
=== Impact for user === | |||
From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be | From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be | ||
Line 219: | Line 221: | ||
=== What to avoid === | === What to avoid === | ||
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ . | It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ . | ||
Revision as of 17:21, 1 February 2012
Description
Measure adoption, retention, and aggregated search counts by engine. Record possible explanatory dimensions using a statistically unbiased and sound approach. Comparable projects that collect user data are TestPilot and Telemetry. Participants in these programs are self selected. It has been demonstrated that data retrieved from TestPilot is biased and not representative of the Firefox population.
Note: The description below is the current proposal form the metrics team, but has serious privacy problems discussed at the bottom. There is hope that the necessary data can be gathered entirely anonymously. The information below should therefore be considered subject to change.
Data Elements
A directory of elements collected by the various data collection pings (Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: Data Collection Paths
The list and definitions of data elements in the Metrics Ping is here MDP Data Point Descriptions
Submission ID
Under the current proposal, when it is time to submit data, the client will collect the latest data and append it to the cumulative view of the data stored locally in the Profile directory. The client will then generate a new document ID for this submission and post it to the data.mozilla.com service along with a header indicating the previously submitted document ID. On the server side, the previous document ID is used to delete that document, and the new document is stored with the new ID. If the server returns a success response to the client, the client saves the ID of the document just submitted as the previous document ID. If the client does not receive a success response, it will attempt to submit later using the same two document IDs. This makes sure that we don't leave old data for the installation hanging around.
There is a section in the discussion page for the topic of the old UUID strategy originally proposed:
Talk:MetricsDataPing#Update_from_User:DEinspanjer
Talk:MetricsDataPing#Opinions_from_User:BenB_2
Client-side
The meta bug for the client side measurement system can be found here : https://bugzilla.mozilla.org/show_bug.cgi?id=718066
Sample JSON output that is recieved mozilla server side:
2011/11/04: { "ver": 1, "uuid": "e8a583fe-98ec-45be-9e44-96a23759067a", "lastPingTime": 1320340265, "thisPingTime": "2011-11-04T19:30:11.948Z", "currentTime": "2011-11-04T19:30:11.962Z", "env": { "reason": "idle-daily", "OS": "Linux", "appID": "{ec8030f7-c20a-464f-9b0e-13a3a9e97384}", "appVersion": "10.0a1", "appVendor": "Mozilla", "appName": "Firefox", "appBuildID": "20111104162615", "appABI": "x86_64-gcc3", "appUpdateChannel": "default", "appDistribution": "default", "appDistributionVersion": "default", "platformBuildID": "20111103103700", "platformVersion": "10.0a1", "locale": "en-US", "name": "Linux", "version": "2.6.38-12-generic", "cpucount": 4, "memsize": 7889, "arch": "x86-64" }, "simpleMeasurements": { "uptime": 0, "main": 3, "firstPaint": 629, "sessionRestored": 502, "isDefaultBrowser": false, "crashCountSubmitted": 1, "profileAge": 31, "addonCount": 2, "addons": [ { "id": "crashme@ted.mielczarek.org", "appDisabled": false, "version": "0.3", "installDate": "2011-10-25T15:02:03.000Z", "updateDate": "2011-10-25T15:02:03.000Z" }, { "id": "mozmetrics@mozilla.org", "appDisabled": false, "version": "0.1", "installDate": "2011-10-11T14:59:08.000Z", "updateDate": "2011-10-26T13:26:45.000Z" } ] }, "events": { "search": { "abouthome": { "Google": 1 }, "searchbar": { "Google": 3, "Amazon.com": 1, "Other": 1 }, "urlbar": { "Google": 1 } }, "sessions": { "completedSessions": 16, "completedSessionTime": 829, "completedSessionActiveTime": 535, "abortedSessions": 2, "abortedSessionTime": 7, "abortedSessionActiveTime": 15, "abortedSessionAvg": 4, "abortedSessionMed": 4, "currentSessionActiveTime": 10, "currentSessionTime": 20, "aboutSessionRestoreStarts": 0 }, "corruptedEvents": 0 } }
Server-side
- Clients will POST data to the configured URL not more than once every 24 hours.
- The first timer check should be one minute after startup.
- The POST data will consist of a JSON document containing a document ID and all the metrics that were collected since the last submission.
- The server side will receive the POST request and perform GeoIP location on the IP address. The raw IP will never be stored. The GeoIP data and submission timestamp will be added to the JSON document.
- The server will store the JSON document into a daily staging collection with all other documents received during that date, UTC.
- The server will return an HTTP response to the client indicating success of the storage and a document ID. For the initial feature release, this ID will be the same as the one passed in (i.e. an installation GUID). It can easily be changed to be new each time (i.e. a document GUID). If the ID is new, the client should store it to be returned on the next submission.
- In the future this response might also include instructions to the client for things such as changing timing or MetricsDataPing configuration.
- Asynchronously, the server will retrieve a document with the same document ID from the "latest" bucket if one exists and will insert/update the "latest" bucket with a merged document that does not include any metrics we wish to avoid collecting longitudally per installation such as GeoIP. This "current" bucket is used to perform retention analysis since it will have the last submitted data for any installation even if it is no longer in use. We will set a retention policy for when these inactive installation documents shall be deleted from the "latest" bucket.
- Longitudinal data for 6 months (e.g. intensity of use) is stored cumulatively in the JSON objects indexed by GUID. Anything older than 6 months is deleted.
- At the end of the day, UTC, the server will aggregate all the documents submitted on that date and store the aggregate data (with no installation ID) in aggregate history tables in our data warehouse.
- There will be UI elements inside of Firefox that allows users to delete all their data (remote and locally.
Data Access Policies
Members of the Metrics Team can access this data for strategic advisory, business operations, analytical research purposes. A more comprehensive set of policies applicable to data at Mozilla in general needs to be determined, presumably in conjuction with the UDC.
Access control is currently based on the following criteria:
- Must be a member of the metrics team
- Must have an SSH account with LDAP integrated key
- Must have MPT-VPN access
User Data
Complete transparency in the data we collect and how it is used is achieved through blog posts and easy access to UI elements to turn off/on data collection. As a first step:
- about:metrics that gives raw dumps of the locally collected data and a history of data pings. Bug Page: https://bugzilla.mozilla.org/show_bug.cgi?id=719484
- Clear UI for opting out of data collection and getting to know more about data collected
In subsequent versions compare user installation data to segments of the population by dimensions (e.g. OS/ hardware/ version etc). For example, given a user's OS and number of extensions, how does his/her startup time compare to a 'peer group' sharing similar characteristics.
UI Implementation
The Metrics Team is consulting with UX to determine the proper UI implementation. Given the opt-out requirement, UX proposes a check box to opt-out in the preferences pane and notifying users through non-modal and non-chrome channels (blog posts, privacy policies, download pages).
see: https://bugzilla.mozilla.org/show_bug.cgi?id=707970
Security Reviews
Review for Bagheera, the back end server that recieves and stores user data: https://bugzilla.mozilla.org/show_bug.cgi?id=655746
Privacy
Opt-in vs. Opt-out
Layman's Explanation
Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period
Opt-out: assumes that all users are included in the sample for maximum data coverage and full representativeness and thus we achieve full comprehensiveness – users are able to opt-out via an opt-out mechanism
How are they different?
- Superficially the consequences can may seem the same but the validity of any conclusions drawn from data under the two alternative approaches is quite different
- The attempt to reach a representative set of user data is the key differentiator between the approaches. In standard surveys one can easily see that the responses of those who offer or volunteer to take a survey are likely to be quite different from these of a rigorously administered survey. Such self-selection bias is a key weakness of online data collection also.
What difference does it make for the statistic?
- We want to acquire representative data and analyze it for the ‘de-averaged’ benefit of multiple but still large sub-populations of users
- Each subpopulation requires insights and actions that are not of the ‘one size fits all’ variety
There is a section in the discussion page for the topic of opt-out:
Talk:MetricsDataPing#Opinions_from_User:BenB
That opinion and discussion is also included below.
What difference does it make for the user?
- The argument "if they don't want to, they can opt-out" is a fallacy, because most users will not know about this data gathering. They cannot opt-out, if they don't know about it, because they have never been asked.
- The difference between opt-in and opt-out is that opt-out includes many of those users who do not wish to participate, therefore violating their wishes and rights.
- So, if the argument is that the result data will be different, then yes, it will be different, because it includes those users who do not wish to be included, but are included anyway in an opt-out scheme. If fact, if they actually do opt-out, then the data would be different again, therefore the same argument of "statistic is biased" applies. If the argument is that opt-out has average data, then only because many user wishes are violated.
- This is why European and German law *requires* opt-in for any gathering of data about the user.
User identification
UUID is PII
Definition:
"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."
An stable UUID for a user or user device is per definition always a PII and never anonymous.
It is therefore regulated by European and German data protection laws and normally forbidden.
Impact for user
From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be
- interceptions during transmission
- other logging server components before the server component discussed here
- legal requests by various governments
- server break-ins, or
- policy changes on the Mozilla side.
Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.
The user has no way to verify whether any of the above (break-in, intercept, intended or lawful or not) is happening or not, and that already is a privacy violation. So, it's irrelevant what the intended usage was, only what is theoretically possible. The above must be impossible - not just "We won't do it, we promise!", but impossible.
Google Chrome
Google Chrome did use a UUID for each browser, and it was perceived as a serious privacy threat and a topic going through mainstream press, including the largest newspapers, in Germany. Eventually, Google dropped the UUID because of the PR problems it caused.
This question of whether a UUID is used by Firefox /will/ be picked up by the press, and the result will be negative for Firefox. This is not a guess, as history shows.
Perception
Germany and Europe are very privacy-aware, much more so than people in the US. Firefox has a big and loyal following there, to a big part because Firefox claims to do what users want and is privacy-aware. A UUID will be considered highly offensive in these countries and will cost Firefox market-share.
Alternative
Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less."
What to avoid
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ .
Submission ID
The current proposal changed a stable UUID for a profile to a submission ID. However, the previous submission ID is also transferred, which allows the server to trivially match them together and still build a unique ID on the server. (Again, whether the server does that or not is immaterial.) So, the submission ID proposal has the same privacy consequences discussed above.