Talk:MetricsDataPing

From MozillaWiki
Jump to: navigation, search

DEPRECATED: This proposal has been updated and the official project name is "Firefox Health Report". Please see the following links for further discussion.

Post on dev.platform

Firefox Health Report blog post

Firefox Health Report FAQ



Discussion of validity of opt-out approach

Opinions from User:BenB

What difference does it make for the user

  • The argument "if they don't want to, they can opt-out" is a fallacy, because most users will not know about this data gathering. They cannot opt-out, if they don't know about it, because they have never been asked.
  • The difference between opt-in and opt-out is that opt-out includes many of those users who do not wish to participate, therefore violating their wishes and rights.
  • So, if the argument is that the result data will be different, then yes, it will be different, because it includes those users who do not wish to be included, but are included anyway in an opt-out scheme. If fact, if they actually do opt-out, then the data would be different again, therefore the same argument of "statistic is biased" applies. If the argument is that opt-out has average data, then only because many user wishes are violated.
  • This is why European and German law *requires* opt-in for any gathering of data about the user.

Discussion of using random sampling

Comments from DEinspanjer 20:10, 2 February 2012 (PST)

During the security review meeting, User:BenB brought up the idea of using random sampling to enroll installations into the data submission vs. enrolling all installations by default. This had been previously discussed by the Metrics team. It is a viable option with some possibly moderate drawbacks. Anyone manually opting in to the system must be flagged as such so their self-selection bias does not skew analysis. The current proposed system generates aggregate views of the data which roll up any high cardinality groups to an acceptable level (the initial threshold was set at 1000). It is reasonable to assume there will be a lot of long tail groups with the minimum threshold aggregation levels. Heavy sampling is likely to make that long tail unuseful for comparison analysis. For example, it is very likely that even a 10% sampling might not allow Mozilla or an individual user doing local analysis to compare performance of their installation with other installations that have a particular add-on installed.

It is not something that I would consider to be a closed topic by any means, but my personal preference is to make sure the system has adequate privacy controls and can handle the load of the full installation-base and avoid potential issues with sampling errors or reduced analytic capability for both the user and for Mozilla.

Discussion of old UUID method for collecting longitudinal data

Update from User:DEinspanjer

Our original spec called for a UUID to be used per Firefox Profile/installation to allow longitudinal collection and analysis of the data. There were some worries we had about the possible loss of privacy in the event of a user sharing this UUID (discussion here)

This method allows us to implement the cumulative installation data as a write-only service which is better for user privacy. The previous document ID can be used by the client to issue the delete request. When that document is deleted, we will have no further data associated with that ID or Firefox profile/installation. If a document ID is ever made known to other people, they cannot use that ID to retrieve data from our system, and the ID would become orphaned within a day of subsequent use of that Firefox profile which means it would not be useful to anyone as a long term identifier.

Old content

Document Identifier Strategy

Each profile will generate a UUID to be used as the document key. Each day's submission will use that UUID, and this will also be the key for that profile's cumulative data on the server. When each submission is received, the server merges it on the fly with the cumulative data, not persisting the individual documents.

Opinions from User:BenB

Privacy

A UUID is PII. Definition:

"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."

An stable UUID for a user or user device is per definition always a PII and never anonymous.

It is therefore regulated by European and German data protection laws and normally forbidden.

From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be

  • interceptions during transmission
    • If SSL is compromised, we have much bigger problems because someone would be looking at banking information rather than how many crashes a Firefox installation had. DEinspanjer 15:18, 3 February 2012 (PST)
  • other logging server components before the server component discussed here
  • legal requests by various governments
    • The information they could get from the proposed dataset would be things like how many crashes a Firefox installation had, what add-ons were installed, etc. In order to request that information, they would have had to acquired the latest document ID from the installation and request the data before the next update when the document ID is deleted. Of course, if they confiscated the hardware, they'd already have the information from the local profile directory. They couldn't request it based on IP address or user name or anything else because we don't keep that information in the data. If they were able to somehow compel us to modify the code to retain and associate that data and keep it secret, then that same (impossible) scenario would apply to making changes to the product itself, and I don't believe it is a scenario worth entertaining. DEinspanjer 15:18, 3 February 2012 (PST)
  • server break-ins, or
    • Same basic answer as the previous.
  • policy changes on the Mozilla side.
    • I agree that this is a possibility that must always be evaluated. Even without this project, Mozilla could decide to change our policies, to start trying to track users instead of installations, to try to deceive people by displaying source code but secretly modifying it in the product we compile and deliver. Or, we could just cease to be open source altogether. We could state that we will do one thing in our privacy policies and lie (hence breaking many laws) and do something else instead. While I consider all of these things improbable, others might evaluate that risk differently. The risk is not isolated to this project. It exists with AUS, Blocklist, SAMO, and VAMO to focus on just the services that have concrete user facing features and which are opt-out. DEinspanjer 15:18, 3 February 2012 (PST)

Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.

The user has no way to verify whether any of the above (break-in, intercept, intended or lawful or not) is happening or not, and that already is a privacy violation. So, it's irrelevant what the intended usage was, only what is theoretically possible. The above must be impossible - not just "We won't do it, we promise!", but impossible.

I cannot refute that possibility, and that is why the project was designed with a user facing feature to allow greater visibility and control of the data. We feel that the data we are proposing to collect is vital to the organization's ability to continue providing Firefox as a first tier browser choice. We are committed to not only using the data in a way that respects the user's privacy but to demonstrate that commitment by implementing this data collection in a markedly different way than other organizations with transparancy, control, and features that allow the user to directly benefit from the data. DEinspanjer 15:18, 3 February 2012 (PST)

Google Chrome

Google Chrome did use a UUID for each browser, and it was perceived as a serious privacy threat and a topic going through mainstream press, including the largest newspapers, in Germany. Eventually, Google dropped the UUID because of the PR problems it caused.

This question of whether a UUID is used by Firefox /will/ be picked up by the press, and the result will be negative for Firefox. This is not a guess, as history shows.

Google has a unique identifier that is associated with any requests to any Google property. This includes things as simple and un-obvious as the Chrome malware site detection or even typing an address or words into the address bar (unless you turn off the omnibar completion). This sort of finger printing and profiling are why it was easy for Google to remove an installation identifier while still getting the information they want. I don't believe it is useful to compare this project to Chrome in that regard. Especially given the difference in purpose and quanitity of data being collected in the two cases. DEinspanjer 15:18, 3 February 2012 (PST)

Perception

Germany and Europe are very privacy-aware, much more so than people in the US. Firefox has a big and loyal following there, to a big part because Firefox claims to do what users want and is privacy-aware. A UUID will be considered highly offensive in these countries and will cost Firefox market-share.

As far as trying to mitigate loss of Firefox market-share, StatsCounter shows a marked increase in Chrome growth in Europe. There is growth in Germany although not to such a significant degree. Looking at those trends, you can see that a portion of that growth is coming from Firefox. You can't attribute that loss to this potential privacy issue since it hasn't happened yet, and Google has a much clearer picture of why it is happening that they can capitalize upon since they have significant data not only about Chrome but also more data about Firefox users than Mozilla has (or wants). Mozilla must do something to better understand our browser usage if we hope to remain a first tier browser offering. The tools and data we have had at our disposal to date have not helped enough. This project attempts to provide that understanding. DEinspanjer 15:18, 3 February 2012 (PST)

Alternative

Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less."

What to avoid

It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ .


NonIdAlternative

For simplicity, I will take the number of crashes (e.g. in the last week or overall) as data point that you want to gather. The data itself is anonymous and can (apart from fingerprinting, more to that later) not identify a single user.

Avoiding UUID

You wanted to know which profiles are not used anymore (dormant, retention problem) and which characteristics they have. This is inherently difficult without tracking individual users (installations), but it is possible with the following algo:

The client submits:

  • Date of last submission - e.g. 2012-01-18
  • Current date (from client perspective) - only date, not time - e.g. 2012-01-20
  • Age of profile (Firefox installation) in days - e.g. 500
  • (Last submitted age is implied or explicit - e.g. 498 )
  • Number of crashes - e.g. 15
  • Number of crashes submitted last time - e.g. 10

Then, on the server, you write that information in a database, as such:

Date of submission | Age of installation | Crash count | Number of users
2012-01-20         | 500                 | 15          | 100000

Any additional user also submitting today the same combination "age 500, crash count 15" increases the "number of users" column by 1, new value is 100001. Also, you look up the row for the last submission, namely

2012-01-18         | 498                 | 10          | 20000

and decrease the number of users by 1, new value is 19999.

If the user later that day decided that there were too many crashes and switches to Chrome, he will now be stranded on the row

2012-01-20         | 500                 | 15          | 5000

while other users who have continued to use FF have been subtracted after a while. So, you can say with certainty that there were 5000 users who used Firefox the last time on 2012-01-20, after having used Firefox for 500 days, and they had 15 crashes (per day/week/total, whatever you submit) when they stopped using Firefox.

That is exactly the information you are so desperately seeking. Tsere, you has it. Without tracking any individual user: it's completely anonymous.

Avoiding Fingerprinting

Now, what about all the other information that you need: startup times, addons, etc.? If we just add all that information to the same table and row, it would allow fingerprinting. But that is not necessary. You merely make one table per atomic information. I.e.

Table A
Date of submission | Age of installation | Crash count | Number of users
Table B
Date of submission | Age of installation | Startup time | Number of users

or of course whatever other database schema you want, as long as each value is separate. That takes care of the fingerprinting.

At least on the server side, not on the submission side. I would have to trust you, and anything between you and me. It would be possible to separate the calls and submit each value separately, but I think that would be overdoing it.

Response from DEinspanjer 20:10, 2 February 2012 (PST)

The example above initially only walks through a couple of measures (age of installation and crash count) and a single dimension (date), but it states that it attempts to handle the requirement of having several data points and dimensions that are needed.

First, I would like to state the opinion that it does not matter whether the data points are sent in multiple requests or a single request. The only reason for using multiple calls would be to attempt to obfuscate the fingerprint of the request from the recipient. Unless the calls were made over a long period of time (which would greatly inhibit useful analysis) and the IP address was changing frequently and there was no fingerprintable user agent string or other HTTP headers (not a reasonable assumption), the sender must still ultimately trust Mozilla not to attempt to reconstruct the request chain and attempt to use the reconstructed fingerprint to identify the user by permanently linking it to PII such as IP or personal information.

Second, I believe the same reasoning holds for the concept of breaking up the data into multiple tables to avoid fingerprinting. Ultimately, the user needs to trust that the company will follow the practice they commit to because the work is taking place outside of their view. There can be mitigating factors to increase the trust such as a simple and sound privacy policy, releasing the server source code, and sharing the aggregated data, but it still comes down to the community reviewing the data that is being sent, agreeing that the data itself isn't harmful to the user's privacy (i.e. it doesn't contain their e-mail address, sites visited, demographic data, etc.) and then trusting that the transport mechanism is not likely to be compromised and that the company will do what they say they are going to do. If the system does not keep PII such as IP and does not longitudinally track sensitive information such as GeoLocation (both of which are strictly forbidden in the current proposal) then even an external party such as a government agent would not be able to get anything more interesting from requesting the data than what they would already have access to. They would have to force the company to modify the source code of either the client or the server in order to collect information useful to them, and if they could actually force that, all bets are already off.

So I want to walk through a more complete example with 16 data points:

 DataName          Prev    Curr
 SubmissionDate    00      01
 AgeOfProfile      10      11
 NumberOfCrashes   20      21
 MainStartup       30      31
 FirstPaint        40      41
 OSNameVersion     50      51
 AppNameVersion    60      61
 AppBuildNumber    70      71
 AppABI            80      81
 AppUpdateChannel  90      91
 Locale            A0      A1
 NumberOfSearches  B0      B1
 NumberOfSessions  C0      C1
 ActiveSessionT    D0      D1
 AddonCount        E0      E1
 SystemMemory      F0      F1

Some of these data points would likely be fairly constant, some of them would have very low cardinality which means large groups of records with the same values. That said, some of them would change a lot and/or have very high cardinality.

If you string together those previous and current tokens, they represent lots of bits that can be used for fingerprinting. To me, it looks equivalent to the following: Header: prevID=00102030-4050-6070-8090-A0B0C0D0E0F0&currID=01112131-4151-6171-8191-A1B1C1D1E1F1 Payload: {expanded data}

In my opinion, the two proposals are functionally equivalent on the transport layer. At the data storage layer, they both require the user to trust that the company will not attempt to subvert the data collected by linking it to PII. If this system is acceptable given those requirements, the first should be as well. DEinspanjer 20:10, 2 February 2012 (PST)


Even if we decide that fingerprinting at the transmission level is not an issue, the method does not allow for sufficient covariate analysis. If we hope to find meaningful results in factors that contribute to performance problems, stability issues, or abandonment, we need joint distributions -- the patterns involving simultaneous recordings of different covariants. DEinspanjer 15:18, 3 February 2012 (PST)

Points were made in the meta bug about not needing installation and update dates for the product or add-ons. Using longitudinal data to analyze how time changes the dynamics of the installation (e.g. what makes the installation be used less often) is a requirement. Currently, we can't even answer simple questions such as whether the rapid release schedule has had a negative impact on retention. We can only listen to opinions and (possibly wild) guesses. DEinspanjer 15:18, 3 February 2012 (PST)