Revision as of 21:22, 11 September 2019

Overview

From time to time, there rises a need to upgrade the underlying operating system of a platform. This need arises in sync with new major releases of various operating systems that form part of the CI infrastructure.

For instance, as of 2019-08-07, all Firefox builds for Linux is executed on Ubuntu 16.04.5 docker containers. In other words, the version of Linux distribution used for testing is at least 2 major releases behind the likely dominant version on the market, which is Ubuntu 18.04.

Upgrade of the underlying operating system version has been, in the past, considered a large undertaking often taking upwards of 6 months. This causes a chicken-and-egg problem where regular upgrades do not occur due to the perceived amount of work, which in turn causes the amount of issues to multiply when the upgrade is finally tackled.

The aim of this document, and process is to establish a standardized process that can be used by anyone in Mozilla engineering to perform operating system upgrades.

Task Overview

Broadly speaking, the following discrete phases are involved when adding new platforms.

ensure availability of machines (if hardware)
- responsibility: Release Engineering
- some people (that I've worked with):
  - jwatkins
  - markco
  - rthijssen
  - dhouse

add platform to tryserver
- responsibility: CI-A
- task checklist: platform checklist and worker checklist

run test suites on tryserver
- responsibility: CI-A
- task checklist: test checklist and baseline checklist

everything below can be executed in parallel among several engineers

begin greening process
- responsibility: CI-A

address issues with test case/platform
- responsibility: developers

create, review and land migration patches
- responsibility: CI-A

Enable platform on taskgraph

The first step of any new test environment is to enable the test platform on Tryserver.

At the bare minimum, ensure the taskgraph is sound with each step. This can be verified using ./mach taskgraph full -v

Enable build

This step may have already been performed by other teams (eg. Releng). If so, skip to the next step.

First, the platform must have builds enabled before tests can be run.

Within the taskcluster/ci/build directory, edit the appropriate YAML file for the platform. For example, if adding a new Windows build type, edit taskcluster/ci/build/windows.yml.

Define all of the required attributes, using existing configurations as a template.

Example with windows10-aarch64 builds:

win64-aarch64/opt:

   description: "AArch64 Win64 Opt"
   index:
       product: firefox
       job-name: win64-aarch64-opt
   attributes:
       enable-full-crashsymbols: true
   treeherder:
       platform: windows2012-aarch64/opt
       symbol: B
       tier: 1
   worker-type: b-win2012
   worker:
       max-run-time: 7200
       env:
           TOOLTOOL_MANIFEST: "browser/config/tooltool-manifests/win64/aarch64.manifest"
           PERFHERDER_EXTRA_OPTIONS: aarch64
   run:
       actions: [get-secrets, build]
       options: [append-env-variables-from-configs]
       script: mozharness/scripts/fx_desktop_build.py
       secrets: true
       config:
           - builds/releng_base_firefox.py
           - builds/taskcluster_base_windows.py
           - builds/taskcluster_base_win64.py
       extra-config:
           stage_platform: win64-aarch64
           mozconfig_platform: win64-aarch64
   fetches:
       toolchain:
           - win64-clang-cl
           - win64-rust
           - win64-rust-size
           - win64-cbindgen
           - win64-sccache
           - win64-nasm
           - win64-node

Example

Bug 1503366

Checklist

has meta bug been created in Bugzilla in Firefox Build System::Task Configuration component?
does ./mach taskgraph full succeed?
does build successfully complete on Tryserver?

Worker configuration

This step may have already been performed by other teams (eg. Releng), or not required at all (eg. OS upgrade). If so, skip to the next step.

Once the build task has been successfully enabled, test workers must be defined.

There are several files that need to have the new platform added in order to satisfy the taskgraph algorithm:

taskcluster/taskgraph/transforms/tests.py
taskcluster/taskgraph/util/workertypes.py

Check and add the new platform details in the following categories:

worker types
tiers
treeherder name translations

Example

Bug 1527469
Bug 1550826 - a more involved example

Checklist

does ./mach taskgraph full succeed?

Test configuration

Once worker configuration is complete, test configuration must be added for the new platform.

Several files must be modified to support running tests against the new platform:

taskcluster/ci/test/test-platforms.yml
taskcluster/ci/test/test-sets.yml

The following is an example platform where the full suite of desktop Firefox tests are run, and thus the list of defined tests are long. Depending on the nature of the platform, the list of tests will vary.

Example test-sets.yml:

macosx1014-64-tests:

   - cppunit
   - crashtest
   - firefox-ui-functional-local
   - firefox-ui-functional-remote
   - gtest
   - jittest
   - jsreftest
   - marionette
   - mochitest
   - mochitest-a11y
   - mochitest-browser-chrome
   - mochitest-chrome
   - mochitest-devtools-chrome
   - mochitest-devtools-webreplay
   - mochitest-gpu
   - mochitest-media
   - mochitest-remote
   - mochitest-webgl1-core
   - mochitest-webgl1-ext
   - mochitest-webgl2-core
   - reftest
   - telemetry-tests-client
   - test-verify
   - test-verify-gpu
   - test-verify-wpt
   - web-platform-tests
   - web-platform-tests-reftests
   - web-platform-tests-wdspec
   - xpcshell

Example test-platforms.yml:

macosx1014-64-shippable/opt:

   build-platform: macosx64-shippable/opt
   test-sets:
       - macosx1014-64-tests
       - macosx64-talos
       - desktop-screenshot-capture
       - awsy
       - raptor-chromium
       - raptor-firefox
       - raptor-profiling
       - marionette-media-tests
       - web-platform-tests-wdspec-headless

Exact list of tests will vary depending on the platform. It is not possible to run cppunit on android platforms, for example.

Note that:

list of tests defined in test-sets.yml is used in test-platforms.yml
build-platform attribute in test-platforms.yml should match the build name chosen in the previous step

Example

Bug 1527469
Bug 1550826

Checklist

does ./mach taskgraph full succeed with test sets and test platforms defined?
do the tests show up in the try fuzzy selector?

Obtain baseline

Once build and tests are enabled on Tryserver, it is time to run a baseline push to take inventory of suites that pass and fail.

Using ./mach try fuzzy --no-artifact --rebuild 10, push tests belonging to the new platform to the Tryserver. Do not use artifact builds for the baseline push as some test results (outside of the compiled tests such as cppunit) are affected by the artifact build.

Once the checklist is complete, create a diff on Phabricator and have it reviewed by

intermittent-reviewers
triage owner or test owner

Example

Debian 10

Checklist

does the build succeed?
are all tests scheduled?
how many tests abort/retry?
how many tests pass?
how many tests fail?
what test chunks appear intermittent?
for failed tests, is it due to platform configuration issues or legitimate failures?

Green up tests

Once the build and tests are running on Tryserver, it is time to begin greening the tests. This task is typically the one that consumes the most time as well as having subtle nuances in how the task should be handled.

For this step, it is possible to have several engineers each tackle a selected suite/subsuite:

engineer A
- mochitest-browser-chrome
- gtest

engineer B
- web-platform-tests
- reftests

Make sure to create meta bugs for each broad category in Bugzilla. A typical breakdown:

platform meta bug
- xpcshell
- mochitest
- compiled tests
- reftest/crashtest
- web-platform-tests

and so on.

A useful reference is the Debian 10 meta bug.

Reporting failures

Use Treeherder where possible when logging failures - this allows all sorts of backend connections to be made, and also nicely includes the debug logs and such if available.

There are some nuances to consider when logging failures; notably, has this bug been reported before? This is critical as everyone should strive to reduce the number of duplicate bugs in the system as much as possible.

Example workflows in different situations will be given below.

Example (with previously logged bug that match the description)

click on the chunk with failures, and wait for Failure Summary pane to open
open an instance of the run logs in a new tab/window
in the nested list below the failure will be a list of bugs that Treeherder thinks match the error description, based on the bug title.
find the bug that best matches the error description from the current run. Quickly check the logs in both the current run and the bug to ensure it describes the same issue.
if satisfied, click on the pushpin icon.
in the new pane that pops above the Failure Summary, click on Save.

Example (with previously logged bug that does not match description)

follow the same steps as above, until the second to last step.
since none of the bugs suit the description, click on the bug icon beside the error description.
a new overlay will show titled Intermittent Bug Filer.

Checklist

meta bugs created in Bugzilla?

@@ Line 255: / Line 255: @@
 ** xpcshell
 ** mochitest
-** compiled test
+** compiled tests
 ** reftest/crashtest
-** raptor
+** web-platform-tests
 and so on.
-An example reference is [https://bugzilla.mozilla.org/showdependencytree.cgi?id=1572242&hide_resolved=1 Debian 10 meta].
+A useful reference is the [https://bugzilla.mozilla.org/showdependencytree.cgi?id=1572242&hide_resolved=1 Debian 10 meta] bug.
+== Reporting failures ==
+Use Treeherder where possible when logging failures - this allows all sorts of backend connections to be made, and also nicely includes the debug logs and such if available.
+There are some nuances to consider when logging failures; notably, <b>''has this bug been reported before?''</b> This is critical as everyone should strive to reduce the number of duplicate bugs in the system as much as possible.
+Example workflows in different situations will be given below.
+=== Example (with previously logged bug that match the description) ===
+* click on the chunk with failures, and wait for Failure Summary pane to open
+* open an instance of the run logs in a new tab/window
+* in the nested list below the failure will be a list of bugs that Treeherder thinks match the error description, based on the bug title.
+* find the bug that best matches the error description from the current run. Quickly check the logs in both the current run and  the bug to ensure it describes the same issue.
+* if satisfied, click on the pushpin icon.
+* in the new pane that pops above the Failure Summary, click on Save.
+=== Example (with previously logged bug that does not match description) ===
+* follow the same steps as above, until the second to last step.
+* since none of the bugs suit the description, click on the bug icon beside the error description.
+* a new overlay will show titled Intermittent Bug Filer.
+*
 == Checklist ==
 * meta bugs created in Bugzilla?
+*

New test environments: Difference between revisions

Revision as of 21:22, 11 September 2019

Contents

Overview

Task Overview

Enable platform on taskgraph

Enable build

Example

Checklist

Worker configuration

Example

Checklist

Test configuration

Example

Checklist

Obtain baseline

Example

Checklist

Green up tests

Reporting failures

Example (with previously logged bug that match the description)

Example (with previously logged bug that does not match description)

Checklist

Navigation menu

New test environments: Difference between revisions

Revision as of 21:22, 11 September 2019

Overview

Task Overview

Enable platform on taskgraph

Enable build

Example

Checklist

Worker configuration

Example

Checklist

Test configuration

Example

Checklist

Obtain baseline

Example

Checklist

Green up tests

Reporting failures

Example (with previously logged bug that match the description)

Example (with previously logged bug that does not match description)

Checklist

Navigation menu

Search