Auto-tools/Projects/Stockwell/Isolation

From MozillaWiki
Jump to: navigation, search


Stockwell Test Isolation Experiment (draft)

The experiment is open to new Test Isolation runs.

Test Isolation is an experiment to determine if it is possible to diagnose new intermittent test failures by:

  1. cloning the original test job
  2. cloning the original test job for each directory containing test failures
  3. cloning the original test job for each test which failed

The Test Isolation experiment will run from June 17, 2019 through June 30, 2019 or until 100-200 failures have been retriggered.

Supported Test Frameworks

Test Isolation supports the following test frameworks:

  • crashtest
  • mochitest
  • reftest
  • xpcshell

on the Repositories

  • autoland
  • mozilla-central
  • mozilla-inbound
  • try

When and How to run Test Isolation

During the experiment, Sheriffs should run Isolation Tests after new bugs are filed for test failures of supported test frameworks.

android emulators, linux, linux64, windows7 and window10 (but not windows10 AArch64) should be preferred over macosx or android-hw due to the capacity constraints for the physical hardware used in macosx and android-hw.

  1. After filing a new bug for a test failure, if the test framework supports Test Isolation, and if a test path is associated with the test failure in the Failure Summary and if the isolation tests should be run on the platform, then open the bug and enter [test isolation] in the whiteboard and save the bug.
  2. Select the failing test job in Treeherder
  3. Click the "hamburger" menu in the Job Details Action bar.
  4. Click "Run Isolation Tests"
  5. Enter the number of times to run the test in the Dialog box and click Ok. We recommend 100 iterations for intermittent failures.

Once you have started the Test Isolation job, a decision task it will be created in the Gecko Decision Task. When it runs it will create new Tier 3 jobs as follows:

  • The new tests will be placed in a new Job Group Symbol created from the original Job Group Symbol by adding the suffix -I.
  • Clones of the original test job will be created in the new Job Group using the original Job Symbol as Tier 3 jobs.
  • Clones of the original test job restricted to each of the failing test directories will be created as Tier 3 jobs in the new Job Group using the original Job Symbol with suffix -it.
  • Clones of the original test job restricted to each of the failing test paths will be created in the new Job Group as Tier 3 jobs using the original Job Symbol with suffix -id.

For example, the M-spi(mda3) job has one failing test path. Running Test Isolation on it with 100 iterations created 100 mda3 jobs, 100 mda3-id jobs and 100 mda3-it jobs in the M-spi-I[tier 3] job group.

If the failing job does not have a valid test path for the failure, Test Isolation will still clone the original job the specified number of times but it will not create clones for the directories (-id) or individual tests (-it).

Bugzilla Triage

Bob Clary will triage the Test Isolation results and classify them using [test isolation-] for an unuseful isolation test, [test isolation+] for a useful isolation test and [test isolation?] for an isolation test of questionable usefulness.

Limitations

Test Isolation determines which tests have failed by inspecting the error summary log for the test job. If the test job does not produce an error summary log or does not record the path of the test failure in the error summary log, it is not possible to determine the failing tests.

When creating the test isolation jobs, the environment variable MOZHARNESS_TEST_PATHS is used to restrict the test run to a specific directory or individual test. If the test suite does not support MOZHARNESS_TEST_PATHS, then it is not possible to use Test Isolation.

Known Issues

  • web-platform-tests-reftests (Wr) can be invoked via Test Isolation but will not handle directory tests (-id) or individual tests (-it) due to the difference between wpt test ids and test paths.
  • Leak failures do not generate test paths.
  • ASAN errors do not generate test paths.
  • TEST-UNEXPECTED-TIMEOUT | ... | application timed out after 370 seconds with no output do not appear in errorsummary, but TEST-UNEXPECTED-FAIL | ... | Test timed out! does.
  • Reftests on Android will fail with "Could not find manifest ..." and will result in a "taskcluster:error] Task timeout after 7200 seconds. Force killing container." error.

Contacts

  • Bob Clary irc: bc
  • Joel Maher irc: jmaher