New test environments
- 1 Overview
- 2 Task Overview
- 3 Preparation work
- 4 Enable platform on taskgraph
- 5 Green up tests
- 5.1 Reporting failures
- 5.2 Annotate failing tests
- 5.3 Migrate over a test suite
From time to time, there rises a need to upgrade the underlying operating system of a platform. This need arises in sync with new major releases of various operating systems that form part of the CI infrastructure.
For instance, as of 2019-08-07, all Firefox builds for Linux are executed on Ubuntu 16.04.5 docker containers. In other words, the version of Linux distribution used for testing is at least 2 major releases behind the likely dominant version on the market, which is Ubuntu 18.04.
Upgrade of the underlying operating system version has been, in the past, considered a large undertaking often taking upwards of 6 months. This causes a chicken-and-egg problem where regular upgrades do not occur due to the perceived amount of work, which in turn causes the amount of issues to multiply once the upgrade is finally tackled.
The aim of this document, and process is to establish a standardized process which can be used by anyone in Mozilla engineering to perform operating-system upgrades.
Broadly speaking, the following discrete phases are involved when adding new platforms.
- preparation work - hardware
- responsibility: Release Engineering
- preparation work - docker/virtual
- responsibility: CI-A
- enable on taskgraph/tryserver
- run test suites on tryserver
all tasks below can be parallelized
- begin greening process - file bugs
- responsibility: CI-A
- task checklist: bugs checklist
- address issues with test case/platform
- responsibility: developers
- create, review and land migration patches
- responsibility: CI-A
- task checklist: patch checklist
The underlying test environment must first be set up.
At Mozilla there are two different machine types, both self-explalanatory:
Both require different approaches.
The availability of test hardware is outside of CI-A (and therefore your) control. With some exceptions, Mozilla's server farms are managed by the Release Engineering team. They work in concert with CI-A to develop a plan to upgrade and maintain test hardware.
For instance, the OS upgrade from macosx1010 to macosx1014 was handled by :dividehex. See Bug 1530474.
With some exceptions, CI-A is not particularly involved in this process.
For virtual test environment, it is within the scope of CI-A work.
Both examples involve the creation of new docker images that run on AWS instances.
Enable platform on taskgraph
The first real step of any new test environment is to enable the test platform on Tryserver.
At the bare minimum, ensure the taskgraph is sound with each step. This can be verified using
./mach taskgraph full -v
This step may have already been performed by other teams (eg. Releng). If so, skip to the next step.
First, the platform must have builds enabled before tests can be run.
taskcluster/ci/build directory, edit the appropriate YAML file for the platform. For example, if adding a new Windows build type, edit
Define all of the required attributes, using existing configurations as a template.
description: "AArch64 Win64 Opt" index: product: firefox job-name: win64-aarch64-opt attributes: enable-full-crashsymbols: true treeherder: platform: windows2012-aarch64/opt symbol: B tier: 1 worker-type: b-win2012 worker: max-run-time: 7200 env: TOOLTOOL_MANIFEST: "browser/config/tooltool-manifests/win64/aarch64.manifest" PERFHERDER_EXTRA_OPTIONS: aarch64 run: actions: [get-secrets, build] options: [append-env-variables-from-configs] script: mozharness/scripts/fx_desktop_build.py secrets: true config: - builds/releng_base_firefox.py - builds/taskcluster_base_windows.py - builds/taskcluster_base_win64.py extra-config: stage_platform: win64-aarch64 mozconfig_platform: win64-aarch64 fetches: toolchain: - win64-clang-cl - win64-rust - win64-rust-size - win64-cbindgen - win64-sccache - win64-nasm - win64-node
- has meta bug been created in Bugzilla in Firefox Build System::Task Configuration component?
./mach taskgraph fullsucceed?
- does build successfully complete on Tryserver?
This step may have already been performed by other teams (eg. Releng), or not required at all (eg. OS upgrade). If so, skip to the next step.
Once the build task has been successfully enabled, test workers must be defined.
There are several files that need to have the new platform added in order to satisfy the taskgraph algorithm:
Check and add the new platform details in the following categories:
- worker types
- treeherder name translations
./mach taskgraph fullsucceed?
Once worker configuration is complete, test configuration must be added for the new platform.
Several files must be modified to support running tests against the new platform:
The following is an example platform where the full suite of desktop Firefox tests are run, and thus the list of defined tests are long. Depending on the nature of the platform, the list of tests will vary.
- cppunit - crashtest - firefox-ui-functional-local - firefox-ui-functional-remote - gtest - jittest - jsreftest - marionette - mochitest - mochitest-a11y - mochitest-browser-chrome - mochitest-chrome - mochitest-devtools-chrome - mochitest-devtools-webreplay - mochitest-gpu - mochitest-media - mochitest-remote - mochitest-webgl1-core - mochitest-webgl1-ext - mochitest-webgl2-core - reftest - telemetry-tests-client - test-verify - test-verify-gpu - test-verify-wpt - web-platform-tests - web-platform-tests-reftests - web-platform-tests-wdspec - xpcshell
build-platform: macosx64-shippable/opt test-sets: - macosx1014-64-tests - macosx64-talos - desktop-screenshot-capture - awsy - raptor-chromium - raptor-firefox - raptor-profiling - marionette-media-tests - web-platform-tests-wdspec-headless
Exact list of tests will vary depending on the platform. It is not possible to run
cppunit on android platforms, for example.
- list of tests defined in
test-sets.ymlis used in
test-platforms.ymlshould match the build name chosen in the previous step
./mach taskgraph fullsucceed with test sets and test platforms defined?
- do the tests show up in the try fuzzy selector?
Once build and tests are enabled on Tryserver, it is time to run a baseline push to take inventory of suites that pass and fail.
./mach try fuzzy --no-artifact --rebuild 10, push tests belonging to the new platform to the Tryserver. Do not use artifact builds for the baseline push as some test results (outside of the compiled tests such as
cppunit) are affected by the artifact build.
Once the checklist is complete, create a diff on Phabricator and have it reviewed by
- triage owner or test owner
- does the build succeed?
- are all tests scheduled?
- how many tests abort/retry?
- how many tests pass?
- how many tests fail?
- what test chunks appear intermittent?
- for failed tests, is it due to platform configuration issues or legitimate failures?
Green up tests
Once the build and tests are running on Tryserver, it is time to begin greening the tests. This task is typically the one that consumes the most time as well as having subtle nuances in how the task should be handled.
For this step, it is possible to have several engineers each tackle a selected suite/subsuite:
- engineer A
- engineer B
Make sure to create meta bugs for each broad category in Bugzilla. A typical breakdown:
- platform meta bug
- compiled tests
- web-platform-tests (including web-platform variants of reftests/crashtests)
- miscellaneous tests
Mochitest is a particularly large suite and has many subsuites, so in practice a separate meta bug for each mochitest subsuite could be useful.
Reference: Ubuntu1804 meta bug.
Use Treeherder where possible when logging failures - this allows all sorts of backend connections to be made, and also nicely includes the debug logs and such if available.
There are some nuances to consider when logging failures; notably, has this bug been reported before? Where possible, reduce duplicate bugs in the system as much as possible.
Example workflows in different situations will be given below.
Example (with previously logged bug that match the description)
- click on the chunk with failures, and wait for Failure Summary pane to open
- open an instance of the run logs in a new tab/window
- in the nested list below the failure will be a list of bugs that Treeherder thinks match the error description, based on the bug title.
- find the bug that best matches the error description from the current run. Quickly check the logs in both the current run and the bug to ensure it describes the same issue.
- if satisfied, click on the pushpin icon.
- in the new pane that pops above the Failure Summary, click on Save.
Example (no bugs match description or no bugs filed for particular failure)
- follow the same steps as above, until the second to last step.
- open an instance of the
live_backing.login a separate tab.
- since none of the bugs suit the description or no bugs have been filed for this particular failure, click on the bug icon beside the error description.
- a new overlay will show titled Intermittent Bug Filer.
./mach bugzilla-component file-info <file_path>to identify the bug component.
- prefix the bug title with an appropriate name for the platform.
- in the body, specify the platform, subsuite and paste the relevant portion of the log.
- fill in the
see alsobug as appropriate.
Needinfo test owners
This is the most important part of the process.
Get the bug in front of the triage owner, feature owner or the test developer and provide as much information as possible relating to nature of the bug, reproduction steps, any extra commands or steps necessary, etc.
Set a needinfo request and ask them to chime in. If no response is received after some time, usually a week, mark some activity to 'bump' the bug.
Depending on the migration timeframe, this may happen anywhere between 2-4 times.
- meta bugs created in Bugzilla?
- responsibilities for suites/subsuites/chunks divided amongst engineers (if multiple engineers assigned)?
- for each failure in a chunk, associate an existing bug or create a new bug.
- for each bug created, has triage owner been notified with needinfo describing the nature of the problem?
Annotate failing tests
It is not always possible for developers to address an issue, especially if nature of the failure is deemed to be lower in priority.
Given that the new test environment setup should be time-bounded, this means that often it becomes necessary to disable the particular test and/or manifest files.
In the example above, it became necessary to disable certain tests that continued to fail on macosx1014 despite the test/feature owner being notified of the failure for several weeks.
Generally speaking, the following steps should be taken prior to test being disabled:
- if a bug has not been filed, please do so.
- needinfo the triage/test/module owner.
- comment on the bug to maintain activity every 2-3 weeks, preferably with status update and newer logs.
If no movement occurs:
- annotate the test case with narrowest possible criteria.
- include comment that specifies platform shorthand and bug number.
- submit to phabricator.
- submit a Try push with the proposed patch applied to verify the work.
Be sure to include in the review the test/module/triage owner and intermittent-reviewers group.
- has triage/test/module owner been needinfo'd?
- has consistent activity been reported on the bug for a few weeks?
- is the annotation using narrowest possible criteria?
- has a bug number been added to the comment for the annotation?
- has the owner been notified in the phabricator patch?