Sheriffing/How To/Intermittent bugs

From MozillaWiki
Jump to: navigation, search

When you find a test that fails sometimes, you've hit an intermittent failure. This is the most annoying class of failures for both sheriffs and developers because it is not necessarily related to the code under test, but more likely indicates that the test itself might need to change to improve stability.

The test failure may have happened before and a bug may already be on file. In such cases, Treeherder should suggest the bug number and title under the failure, e.g.:

Treeherder suggestions

About ADB failures on Android

adb is a program which allows an external computer to control a phone, install software etc. ADBTimeoutError: The command didn't finish for an unknown reason, e.g. a connection issue. ADBError, ADBProcessError: Here one of the command the phone shall run failed. Each of the failures needs an own bug. E.g.

Intermittent raise ADBError("ADBDevice.__init__: ls could not be found")

means the program "ls" was not found. If it were a different program, a new bug should be filed.

TV (test-verify) failure to test suite mapping

The Test-Verify job tests if added or modified test files fail and runs them often sequential and in parallel in one or more sessions. To check if they also fail in their normal test suite, the test suite has to be identified. Search the log for the TV job for the test name and find e.g.

Per-test run found test toolkit/mozapps/extensions/test/browser/browser_webapi_access.js (mochitest-browser-chrome/None)

The test belongs to mochitest browser-chrome - the bc tasks on Treeherder.

Check for intermittency

Don't retrigger or backfill build jobs or gecko decision tasks, only tests. If you are not sure if a failure line is a new frequent failure or even a perma-failure and require a backout of the change which causes the failure, you can request more runs of the job:

  • On the same push ("retrigger") by pressing the r key or clicking the button (circular arrow).
  • On previous pushes by opening the actions menu in the bottom toolbar ("...") and and calling "Backfill". This will run the job on the 5 previous pushes, independent if it already ran for those pushes or not. If the job depends on a build which is missing, that build will be generated before the test runs.
    • A quicker way - if supported - to check if this fails frequently is test-verify backfill.
      1. Get the full path and name of the failing job, e.g. with dxr.mozilla.org and copy it into the clipboard.
      2. From the action menu at the bottom ("..."), select 'Custom action'
        • Set "inclusive" to true
        • Set "depth" to 1 (to run it on the current push and only the one before).
        • Add "testPath:" and the test path and name from the clipboard.
      3. The test will be run multiple times in a job TV-bf. If it fails for the later job but passes for the previous one, it is a strong indicator that the failure is related to changes of the push with the TV-bf failure.

There is also a dashboard to check for permafailing tasks.

General failure messages - deciding if new bug needed

Sometimes CI jobs only provide general failure message, e.g. bug 1411358: "Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container."

If job types start to fail with such a general failure message which didn't do that before and the bug for the general failure message is not new, a new bug only for that job type + failure message shall be created.

Example: Btup builds started to also fail intermittently with the message from bug 1411358. The logs for these jobs showed no output before the timeout got hit, often even for more than 40 minutes. bug 1480494 got created and because the scope was only on that build type, investigation by developers started quickly.

The jobs classified as bug 1411358 - sort by "Test Suite" and look for Test Suite "opt" - showed the issue started on August 3rd while there had been many similar failure messages for other job types already before that.

If a task fails with a message mentioning "ADBError" or something which starts with "ADB" and ends with "Error", it's a failure returned by the phone which runs the task to the controlling computer. Each failure message needs its own bug (e.g. for "init failed", "failed to create directory" etc.)

How to file a bug for an intermittent failure

If there's no bug on file, you'll need to file one.

Bugfiler

A click on the little bug icon beside a failure opens a tool called "bugfiler" that automates most of the manual steps but you shall open the log and copy the relevant lines for the failure (e.g. from last TEST-PASS to the failure, or including the stack trace below the failure) and paste them in Treeherder's bug filing form.

The are two requirements need to be included in the bug that this bug can be displayed automatically by Treeherder when this intermittent failure happens again:

  1. In the summary: Intermittent test_file test failure
  2. In the Keyword field choose the keyword: intermittent-failure

Manually filing a bug

Lets imagine there are issues with bugfiler or you can't use it for other reasons (e.g. security sensitive bug which should not be public) and you have a test failure like TEST-UNEXPECTED-TIMEOUT | /navigation-timing/test_timing_xserver_redirect.html | expected OK in Treeherder and there is no bug on file for this failure.

  1. Open the Treeherder Log
  2. Login into Bugzilla in a different tab/window
  3. Find the Product/Component where you need to file this bug (note: dxr and hg.mozilla.org can be very helpful if you are in doubt)
    1. Copy the file path from the failure line /navigation-timing/test_timing_xserver_redirect.html
    2. Find it in the repository, either with the search term 'path:/navigation-timing/test_timing_xserver_redirect.html' on DXR or '/navigation-timing/test_timing_xserver_redirect.html' in the right path filter field of searchfox. If you don't find anything, then there are still folders from outside the source folder in the path. Delete everything e.g. up to 'gecko' or 'build' and try again.
    3. Copy the full folder and file path, e.g. testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html
    4. In the console with the mozilla-unified folder, run the following command to get the Bugzilla product and component in which bugs related to the file should be posted: ./mach file-info bugzilla-component testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html In this case, we get: Core :: DOM
      testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html
    5. If the failure is real crash (not a crash report because the test execution hang and the application eventually has to be shut down), try to identify the component from the crash:
      1. Check if the crash signature itself discloses were the bug belongs, e.g. [@ webrtc::MouseCursorMonitorX11::CaptureCursor()] goes into Core :: WebRTC.
      2. In case it's not obvious from the crash signature, check the crashing thread. The files mentioned for the first numbers ("stack frame") can be from managing the crash (e.g. contain "report", "panic"). Skip those. After you find one which looks like "real" code, look up the component like mentioned above. Example:
 Thread 24 (crashed)
 0  libxul.so!GeckoCrash [nsAppRunner.cpp:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 5076 + 0x15]
 1  libxul.so!gkrust_shared::panic_hook [lib.rs:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 241 + 0x9]
 2  libxul.so!core::ops::function::Fn::call [function.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 69 + 0x9]
 3  libxul.so!rust_panic_with_hook [panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 482 + 0x6]
 4  libxul.so!std::panicking::begin_panic [panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 412 + 0x1e]
 5  libxul.so!webrender::profiler::TimeProfileCounter::profile [profiler.rs:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 282 + 0xaa]

This is a Webrender failure and belongs into Core :: Graphics: Webrender
If it cannot be identified where the bug belongs, put it in Firefox :: Untriaged

    1. In case it's unknown into which product and component the bug belongs, put it in Firefox :: Untriaged and developers will take a look at it.
  1. Copy the failure text from the log window into the bug
  2. Set the Summary as: Intermittent navigation-timing/test_timing_xserver_redirect.html | expected OK
  3. In the keyword field choose intermittent-failure
  4. Submit the bug

The bug should look like https://bugzilla.mozilla.org/show_bug.cgi?id=1172135

Treeherder syncs with Bugzilla several times a day. Once your bug is added and the systems sync, Treeherder will suggest your new bug as a match for the next intermittent failure of this type.

Machine-specific failures

Machines can get into a bad or be in that from the start (e.g. bad memory). This will fall all or just more tests than usual, often in the same test type.

webgl ("gl") and reftests ("R") might fail because of dead pixels which can be far away from any content that gets rendered. In the following zoomed out example, the red rectangle is at the bottom and is a dead pixel which causes the test to fail.

reftest analyzer with highlighted dead pixel outside of area with content created for testing

Terminate the machine if you discover such an issue.

How to file a security bug

When we see failures which contains “use-after-poison” in the log, it usually means that we have to file a security bug for it. Security bugs are not visible except when you are on the CC list.

Failure example:

Sanitizer failure.png

NOTE: “SEGV on unknown address 0x000000000000” failures don’t require a security bug.

In the example above, the bug should be filed for the second failure line: “SUMMARY: AddressSanitizer: use-after-poison (...)”
Examples what to file as security bugs:

  • access-violation
  • bad-malloc_usable_size
  • global-buffer-overflow
  • heap-buffer-overflow
  • heap-use-after-free
  • stack-buffer-underflow
  • use-after-poison

The bug should be filed manually from Bugzilla, and not from Treeherder. How to file such a bug:

  1. Access Bugzilla (https://bugzilla.mozilla.org/enter_bug.cgi) and search after the relevant Component, in this case Core :: Layout.
  2. Go to the bottom of the page and check the box: “Many users could be harmed by this security problem: it should be kept hidden from the public until it is resolved”
  3. For the Summary, write “Intermittent” + “second failure line”, in this case: “Intermittent SUMMARY: AddressSanitizer: use-after-poison /builds/worker/workspace/build/src/layout/generic/nsIFrame.h:4139:35 in IsFrameModified”
  4. Select "Show Advanced Fields" and add “intermittent-failure” as Keyword
  5. In the Description field, add the log file’s URL and the relevant part of the log file
  6. Submit the bug

NOTE: As the majority of things on Mozilla are judgement calls, when you encounter security bugs you can either file a bug or do a backout. Intermittent security bugs can be hard to tackle, so a backout could have a much more satisfactory outcome. In this case, the normal process is used: retriggers until you find the culprit then backout the revision which started the issue.

Note: If you need to leave a security bug for the next shift for a follow up, make sure to add one member of that shift on the CC list.

Infrastructure related issues

  • Treeherder only displays show and allows to retrigger them.
  • Taskcluster runs them.
  • The task generation is stored in Firefox Build System :: Task Configuration.

If it is necessary to check if an issue is related to Treeherder or Taskcluster, open an affected job by selecting it and then clicking on the task link at the bottom left. Is the job showing as expected in the Taskcluster page, Treeherder might not receive the (correct) data and the bug should be filed against Treeherder, else into Taskcluster.