CloudServices/DataPipeline/Metadata

From MozillaWiki
Jump to: navigation, search

Overview

We need a mechanism for handling server-side data changes, including format/layout changes, storage locations, back-processing, and migrating data.

Generally, the goal is to be able to move the data from one location to another while having minimal impact on consumers of the data.

Implementation

This metadata is currently stored in Amazon S3, and consumers of the data are encouraged to consult this information to determine where the most up-to-date data resides.

Metadata can be found at s3://net-mozaws-prod-us-west-2-pipeline-metadata. At the top level, you will find a file called sources.json, containing an map of data source name -> metadata. For example, you will find an entry for telemetry.

Each source has several pieces of information:

  • description: A human-readable description of the data source
  • bucket: The name of the S3 bucket where the data is located
  • prefix: The prefix within the data bucket where the data can be found
  • doclink: A URL where further documentation may be found (optional).
  • metadata_prefix: The prefix within the metadata bucket where any extended metadata may be found.

As we add more data sources, it is possible that more keys and metadata will be added, but this set is expected for S3-based data sets.

For data that is stored in S3 using the "schema" approach to storage layout, the schema can be found at s3://{metadata_bucket}/{metadata_prefix}/schema.json. This will show you the set of dimensions being used for data storage.

The data itself for a given data source is located at s3://{bucket}/{prefix}/....

Example

An example sources.json would look like:

   {
       "telemetry": {
           "description": "Unified Telemetry v4 data",
           "doclink": "https://ci.mozilla.org/job/mozilla-central-docs/Tree_Documentation/toolkit/components/telemetry/telemetry/pings.html",
           "bucket": "example-data",
           "prefix": "telemetry-2",
           "metadata_prefix": "telemetry-meta-2"
       },
       "telemetry-errors": {
           "description": "Error stream for Telemetry data",
           "doclink": "https://bugzilla.mozilla.org/show_bug.cgi?id=1137747",
           "bucket": "example-data",
           "prefix": "telemetry-errors-2",
           "metadata_prefix": "telemetry-errors-2"
       },
       "example 3": {
           "description": "An example to live in the docs (so meta)",
           "doclink": "https://wiki.mozilla.org/CloudServices/DataPipeline/Metadata",
           "bucket": "example-data",
           "prefix": "example_2",
           "metadata_prefix": "example_2"
       }
   }


In the above example, the telemetry data would be located at s3://example-data/telemetry-2/..., while the storage schema would be located at s3://net-mozaws-prod-us-west-2-pipeline-metadata/telemetry-meta-2/schema.json.


Deploying a server-side change

Deploying a server-side data change will generally use the following workflow:

  1. Reserve a new prefix for the desired data source name(s)
  2. Populate the historic data under the new prefix (by backprocessing "landfill" data, for example)
  3. Once the bulk of the work has been done, cut over the live Data Store loader to populate data using the new prefix / configuration
  4. Populate the gap between the backfill and cutover.
  5. Update the metadata for the data source name to point at the new location.