Glean/Adding or changing Glean metric types

From MozillaWiki
Jump to: navigation, search

Background

Glean is Mozilla’s modern product analytics and telemetry solution that provides data for our new products. It aims to be easy to integrate, reliable and transparent by providing an SDK and integrated tools.

One of the Glean principles is to provide higher-level metric types that map semantically to what users want to measure: for example, it is helpful for both validation and analysis to know that something is a counter rather than just a more general "integer", as this implies that its value cannot be less than or equal to 0.

The current offered metric types were designed to cover the majority of Mozilla use-cases, but we know that new use-cases will come up. Some have already (UrlMetricType, StringList vs StringSet, dropping labelled booleans, coarse timing distributions, error stacks, changes to quantity/counters, enumerations, ratios).

Motivation

The base set of metric types offered by Glean, from our initial design document, were designed by going through the pings sent by our mobile products and identifying the higher level metric types required to reach feature parity with the existing telemetry system. The design document was reviewed by data engineering and that process helped smoothing out some of the rough edges of the existing legacy system. However, no metric type was fundamentally new and this meant we did not have to answer questions such as:

  • can the type answer the business questions we’re adding instrumentation for?
  • can the type be used to leak user data?
  • does the metric type require custom processing when ingested?

While the Glean/Telemetry team has experience and historical knowledge that could inform the answer to these questions, other teams have a much deeper expertise on these topics. Their opinions and recommendations are vital for the process of adding new metric types.

For this reason, the Glean/Telemetry team alone cannot make a call about whether or not a request for a new metric type is reasonable. The Glean end-to-end tech lead must be responsible for that and must base their decision on the feedback from the consulted stakeholders.

The committee

As mentioned, adding a new metric type to the Glean ecosystem does not exclusively have implications on the Glean SDK. All the teams involved in the Glean ecosystem, in addition to the team or individuals (NOTE: Teams or individuals part of the committee can file requests as well, if needed.) requesting the new metric type or changes to the existing ones, need to be consulted.

This process structure attempts to bring together all the points of view of the different stakeholders of the Glean ecosystem. The volume of the incoming requests is expected to decrease over time, as the Glean offering becomes more complete and comprehensive.

The following sections depicts the three roles identified to move the process along.

The requester

This is the team or individuals asking for the change or the addition of a new metric type in Glean.

Team Name Member name(s) Domain of expertise/angle
Requester (depends on the request)
  • Why is the new metric type required?
  • What is the data that needs to be collected?
  • What is the specific question that the data needs to answer?

The triage owner

This is the team or individuals that are responsible for triaging incoming requests and make sure that all the requests are acted on.

Team Name Member name(s) Domain of expertise/angle
Triage owner The owner of the bugzilla component
  • Is not required to be involved in the decision making process.
  • Guarantees that each request is acted on within 6 business days.

The Glean end-to-end tech lead

The individual who has an end-to-end understanding of the Glean ecosystem, oversees its strategy and long-term goals.


Team Name Member name(s) Domain of expertise/angle
Glean end-to-end tech lead Michael Droettboom
  • Does the requested type fit into the product strategy?
  • Does the cost for implementing this outweigh its benefits?
  • If the requested change can be technically achieved, should it still be done?

The consulted stakeholders

This table attempts to capture the stakeholders that need to be consulted with, as they work or have to do with the Glean ecosystem.

Note: Teams can declare that the triage owner can rely on any individual on the team, instead of specific owner names. However, teams that do not want specific individuals to be mentioned are required to provide a mechanism for the triage owner to reach out to the team and make sure to receive an answer in a timely manner.
Team Name Member name(s) Domain of expertise/angle
Data Science Any available data scientist (file a JIRA ticket)
  • How would the new type impact analyses?
  • Can the requested type be used to answer meaningful questions in a scalable way?
  • Can the use-case be satisfied by using any existing metric type?
  • Will the data be easy to misinterpret, and are there ways to minimize that?
  • Help vet definitions posted to organization
Data Stewards Any available steward (to reach out)
  • Does the new type pose privacy challenges?
  • Should the data-review process be changed to address this new type?
Data Tools
  • Marina Samuel
  • Rob Hudson
  • Can this metric type be unambiguously aggregated?
  • Will this metric pose problems when trying to plot it?
  • Can this metric type be accessed?
  • Will this metric type introduce previously unseen complexity to our aggregation process?
  • Will this metric type's data (or its aggregate data) introduce new complexity when we import it into the low latency dbs used by our web-based data tools?
Pipeline
  • Frank Bertsch
  • Mark Reid
  • Would this create problems with the payload?
  • How would the new type translate to BQ types?
  • Can the new type be represented at all in a convenient way?
  • Can the use-case be satisfied by using any existing metric type?
SDK
  • Alessio Placitelli
  • Beatriz Rizental
  • Would the new type violate SDK principles?
  • Would the API be reasonably ergonomic?
  • Would the API work on all the supported platforms?
  • Can the use-case be satisfied by using any existing metric type?
  • Are there any performance concerns? (speed, size on disk, bandwidth, etc.)

The processes

This section outlines the two processes involved in changing or adding metric types. The workflow starts with the user making the request. After that, the requested changes are discussed among the different stakeholders listed in the previous sections.

Requirements

Before any of the following processes can take place, the following requirements need to be satisfied:

  1. Representatives for each team of the committee must be nominated and added to this document in the committee section.
  2. Managers or representatives for each team of the committee must sign-off on this proposal, at the top of the document.
  3. A new Bugzilla component, "Data platforms & tools::Glean Metric Types", must be created.
  4. All the members of the committee must subscribe to the Bugzilla component.
  5. Representatives for each team must nominate the triage owner for the Bugzilla component.
  6. A bugzilla form for submitting requests must be created (see the related paragraph).
  7. A discussion document template must be available to be forked by the triage owner.
  8. Documentation for requesting new metric types or changing existing ones must be available on the Book of Glean.

The proposal process

This section describes how users should file a request for either changing or adding a new metric type.

  1. User files a bug using a custom form in the Data platforms & tools::Glean Metric Types component in Bugzilla.
  2. The triage owner of the Bugzilla component prioritizes this within 6 business days and kicks off the decision making process.
  3. Once the decision process is completed, the bug is closed with a comment outlining the decision that was made.

The custom Bugzilla form

The form contains the following information:

  • A description of the data that needs to be recorded.
  • A raw sample of the data that needs to be recorded. This is in the abstract, and not any particular implementation details about its representation in the payload or the database.
  • The business question/use-case that requires the data to be recorded.
  • How the data would be consumed.
  • Why existing metric types are not enough.
  • The timeline by which the data needs to be collected.

The decision making process

This section outlines the process with which a decision is made when a new request comes in.

  1. The triage owner of the Bugzilla component triages the request.

  2. The triage owner copies all the information from the bug into a document (for allowing easier async communication through comments). Rich text should be copied, not raw markdown.

  3. The triage owner attaches the document to the bug.

    • The triage owner files a ticket in JIRA for the Data Science team with the same template used for Bugzilla.
    • The triage owner adds a link to the JIRA issue linking to Bugzilla and the other way around (any required update and public communication will still happen exclusively on Bugzilla, the JIRA ticket will exclusively be used to track the work item by Data Science).
  4. The Glean end-to-end tech lead designates the initial group required to suggest a technical solution (non-required stakeholders are still allowed and encouraged to contribute at this stage).

    • This is the design stage and should last, at most, 2 weeks.
    • The triage owner flags all the members of the committee on Bugzilla with a needinfo request (and comments on JIRA).
  5. Members of the committee discuss the content of the document using the google docs commenting system.

  6. The Glean end-to-end tech lead ends the design stage once the proposal reaches maturity.

    • The Glean end-to-end tech lead writes a short summary of the proposal at the top of the document.
    • The proposal moves to the comment stage which should last 6 business days.
    • The triage owner flags all the members of the committee on Bugzilla with a needinfo request (and comments on JIRA).
    • Reviewers are expected to leave comments within 6 business days.
  7. After 6 business days, the comment stage is declared finished and the Glean end-to-end tech lead marks the proposal as either approved or rejected.

  8. (optional) The Glean end-to-end tech lead can call for a meeting to be organized to further discuss the document, if needed.

  9. If more information is required from the requester, they are flagged on the document with a comment.

  10. The teams that will need to do the implementation work (whatever implementation work applies for fully supporting the discussed work in the Glean ecosystem including, but not limited to, SDK, pipeline, tooling, ...) will need to provide an estimate of the required work. The effort has to be weighted in the final decision.

  11. All the members of the committee sign off on the document (in the traceability matrix at the top of the document) or leave a comment about their concern with the proposal.

  12. The triage owner closes the bug and leaves a comment with the decision outcome.

    • The triage owner makes the document publicly accessible for the public to consult, if needed, by attaching it to the bug as a Markdown document.

    • If the decision is to proceed with the change, the triage owner files the required bugs in the relevant components (e.g. for new metric types, a bug for SDK changes is likely required).

    • If the decision is to not make any change to metric types, the committee must identify and recommend an existing metric type to use instead.

Important: the triage owner will be responsible for driving the conversations and make sure that a decision is reached within the expected timeframe.

The discussion document

In addition to a copy of all the data present in the bugzilla form, the discussion document must contain the following sections:

  • a traceability matrix at the top of the document, listing all the committee members for facilitating the discovery of the sign-offs;
    • this should have at least two sections: Glean end-to-end tech lead and consulted.
  • a section for each team represented in the committee to add their considerations about the request; ideally each section should summarise their decision with respect to the proposal and highlight any critical issue;

The template is available here (v2).

Q&A

Q: What’s the acceptance criteria for changes/new metrics?

A: In addition to the food for thought from this section, the committee must consider the following aspects:

  • the proposed changes/new metric must not break any of the Glean principles;
  • if changing an existing metric type, the change must not break any existing usage;
    • an email must be sent to fx-data-dev@mozilla.org to announce the intent to change the metric.
    • if it breaks existing uses, the requester must commit to fix the breakage (or partner with the appropriate stakeholders, if identifiable) as agreed with the committee.

Q: What if the committee disagrees on the decision for the changes/new metrics?

A: The Glean end-to-end tech lead has veto power: they can either decide to proceed with the change (e.g. because of strategic product needs) or to reject the request (e.g. because of cost/benefit ratio)