Mozilla Network Outages Data Project

From MozillaWiki
Jump to: navigation, search

Introduction

The Network Outages Data Project was initiated at Mozilla in 2019 to explore mission-aligned innovation on use of Mozilla data to support a healthier internet worldwide, by:

  • (a) creating an anonymized, privacy preserving dataset of signals based on existing Mozilla telemetry that correlate with network outages in countries and cities worldwide.
  • (b) validating the reliability of outages signals with the help of internal network experts and renowned external experts.
  • (c) disseminating our findings and publicly releasing the dataset.

Dataset Release and Access

Mozilla offers public access to a telemetry dataset that enables researchers to explore signals of internet outages and shutdowns around the world. In order to gain access to the dataset, which is licensed under the Creative Common Public Domain license (CC0) and contains data from January 2020 onward, researchers can apply via this Google Form. We look forward to seeing the exciting work that internet outage researchers will produce with this data set and hope to inspire more use of aggregated datasets for public good.

Project Background

Validation

In order to validate the data set prior to its release, we signed legal agreements with the following organizations prior to sharing it with them and hosted a series of group meetings (starting July 2020) to discuss their findings and feedback. All data under the data set was (and is) anonymized and aggregated to ensure user privacy, with no data included for any unknown location or any city with a population less than 15,000 people. All included locations also had at least 50 active users in a given hour.

Mozilla members of the Internet Outages Data (IOD) working group then conducted individual interviews over Zoom with researchers from these organizations:

  • The Open Observatory of Network Interference (OONI)
  • RIPE NCC / RIPE ATLAS
  • Measurement Lab (M-Lab)
  • The Internet Outage Detection and Analysis (IODA) of Center for Applied Internet Data Analysis (CAIDA) at the San Diego Supercomputer Center
  • Internews
  • Access Now #KeepitOn Coalition

The interviews were conducted over the course of June and July 2021, by at least two members of the relevant Mozilla working group, with one technical member from the Data team always present. Each interview lasted 45 minutes and was followed by a 10-15 minute debrief session for the Mozilla team to revise notes, reflect on learnings, and add any potential action items.

We directed all participating organizations to compare and contrast data from the following countries and documented internet outages with very different characteristics.

  • Belarus in August 2020
  • Uganda in January 2021
  • Myanmar in February 2021

Validation Outcome

Researchers from all six organisations unequivocally confirmed that they found Mozilla’s dataset useful for detecting internet outages. They validated the insights gained from Mozilla’s data by matching them against their own sources as well as from other publicly available data (such as the IODA repository). There was a broad but clear correlation between their sources and Mozilla’s data.


Advantages
Some of the advantages of Mozilla dataset that were identified by the organizations were:
  • corroborative and cross-validation value
  • diversity and breadth of clients (sending data)
  • ability to see the nuance of aggregated user behaviour and internet outages
Disadvantages
Some of the disadvantages of Mozilla dataset that were identified by the organizations were:
  • Lacking granularity (aggregated pings v/s actual information)
  • Inability to segregate data at a non city-level (and other governance districts)
  • Pings do not record timestamps of when data was recorded locally and when it attempted to send (incl. multiple attempts), making it harder to detect smaller outages.

Overview of the dataset

OONI's reportfrom November 2021 is a good resource for researchers who want to use Mozilla's dataset to detect internet outages.

How is the dataset created?

The dataset is produced by an ETL job that lives here. The input data is a set of telemetry histograms (e.g. the time it takes to receive a successful response by a DNS) sent via the Firefox main ping and the telemetry system health ping, the latter measuring the health of the Firefox telemetry submission platform.

The ETL job does exclusively use, at this time, data is already being collected by Firefox for other purposes (there is no new data being collected specifically to detect outages). After the pings hit our pipeline they are processed by our pipeline as any other ping, and end up on their respective intended tables.

Note that there is no client IP address in any of these tables and no IP address is accessed at any point during our processing: the raw data for our ETL job is the processed main and health pings data already accessible by Mozilla employees.

From this point on, what the ETL does is:

  1. Get the list of locations that we want to publish data for using the location data computed by the pipeline using the geoip database and the clients_daily database
  2. We take out any unknown location or any city with a population less than 15k people
  3. Only interested in places that have at least 50 active users for a given hour
  4. Aggregate the histograms of interest from the main ping for a given hour (e.g. DNS_FAILURE_TIME)
  5. Count the presence or absence of a histogram of interest within a session for a given hour (e.g. active sessions that don't perform a DNS lookup for a given hour)
  6. Compute the ratio of errors as reported by the health ping (i.e. the proportion of how many times users reported a timeout when attempting to upload telemetry)
  7. We finally merge the data to produce a table that has a row for each tuple (Country, City, Hour) that we want to disclose data for (given the initial restrictions) containing the data documented here.

How to query the data once access is granted?

To access public datasets in BigQuery, a Google Cloud Platform (GCP) account is required. GCP also offers a free tier which offers free credits to use and run queries in BigQuery. BigQuery sandbox enables users to use BigQuery for free without requiring payment information.

Go to BigQuery website, log in with the account which got access granted, and try to run this query:

select * from `moz-fx-data-shared-prod.internet_outages.global_outages_v1` limit 0

This is a dry-run of a simpler query against the dataset. It should not return any data but it will error out in case there are permission problems.

What does the dataset contain?

This contains a set aggregated metrics that correlate to internet outages for different countries in the world. The dataset contains the following fields (source):

  • `country`: the Country code of the client.
  • `city`: the City name (only for cities with a population >= 15000, 'unknown' otherwise).
  • `datetime`: the date and the time (truncated to hour) the data was submitted by the client.
  • `proportion_undefined`: the proportion of users who failed to send telemetry for a reason that was not listed in the other cases.
  • `proportion_timeout`: the proportion of users that had their connection timeout while uploading telemetry (after 90s, in Firefox Desktop).
  • `proportion_abort`: the proportion of users that had their connection terminated by the client (for example, terminating open connections before shutting down).
  • `proportion_unreachable`: the proportion of users that failed to upload telemetry because the server was not reachable (e.g. because the host was not reachable, proxy problems or OS waking up after a suspension).
  • `proportion_terminated`: the proportion of users that had their connection terminated internally by the networking code.
  • `proportion_channel_open`: the proportion of users for which the upload request was terminated immediately, by the client, because of a Necko internal error.
  • `avg_dns_success_time`: the average time it takes for a successful DNS resolution, in milliseconds.
  • `missing_dns_success`: counts how many sessions did not report the `DNS_LOOKUP_TIME` histogram.
  • `avg_dns_failure_time`: the average time it takes for an unsuccessful DNS resolution, in milliseconds.
  • `missing_dns_failure`: counts how many sessions did not report the `DNS_FAILED_LOOKUP_TIME` histogram.
  • `count_dns_failure`: the average count of unsuccessful DNS resolutions reported.
  • `ssl_error_prop`: the proportion of users that reported an error through the `SSL_CERT_VERIFICATION_ERRORS` histogram.
  • `avg_tls_handshake_time`: the average time after the TCP SYN to ready for HTTP, in milliseconds.

Caveats with the data

As with any observational data, there are many caveats and interpretation must be done carefully. Below is a list of issues we have considered, but it is not exhaustive.

  • Firefox users are not representative of the general population in their region.
  • Users can experience multiple types of failures and so the proportions are not summable. For example, if 2.4% of clients had a timeout and 2.6% of clients had eUnreachable that doesn't necessarily mean that 5.0% of clients had a timeout or were unreachable.
  • Geographical data is based on IPGeo databases. These databases are imperfect, so some activity may be attributed to the wrong location. Further, proxy and VPN usage can create geo-attribution errors.

Contact

The team can be contacted on outages@mozilla.com

Team leaders
Alessio Placitelli, Mozilla Data Organization
Udbhav Tiwari, Mozilla Policy
Solana Larsen, Mozilla Foundation Insights