DevTools/Intermittents: Difference between revisions

Jump to navigation Jump to search
Changed headers levels
(summarize intermittent debugging meeting)
 
(Changed headers levels)
Line 1: Line 1:
= Intermittent Failures =
= Intermittent Test Failures =


While working on devtools, you will inevitably encounter an
While working on devtools, you will inevitably encounter an
Line 5: Line 5:
documents some tips for finding and debugging these failures.
documents some tips for finding and debugging these failures.


== Finding ==
= Finding Intermittents =


Normally you will have no trouble finding out that a particular test
Normally you will have no trouble finding out that a particular test is intermittent, because a bug will be filed and you will see it through the normal mechanisms.
is intermittent, because a bug will be filed and you will see it
through the normal mechanisms.


However, it can still be useful to see intermittents in context.  The
However, it can still be useful to see intermittents in context.  The [[https://brasstacks.mozilla.com/orangefactor/ War on Oranges]] site shows intermittents ranked by frequency.  The orange factor robot also
[[https://brasstacks.mozilla.com/orangefactor/ War on Oranges]] site
shows intermittents ranked by frequency.  The orange factor robot also
posts weekly updates to the relevant bugs in Bugzilla.
posts weekly updates to the relevant bugs in Bugzilla.


You can also see oranges in Bugzilla.  Go to
You can also see oranges in Bugzilla.  Go to [[https://bugzilla.mozilla.org/userprefs.cgi?tab=settings the settings page]] and enable "When viewing a bug, show its corresponding Orange Factor page".
[[https://bugzilla.mozilla.org/userprefs.cgi?tab=settings the settings page]]
and enable "When viewing a bug, show its corresponding Orange Factor page".


== Reproducing ==
= Reproducing Test Failures locally =


The first step to fixing an orange is to reproduce it.
The first step to fixing an orange is to reproduce it.


If a test fails at different places for each failure it might be a
If a test fails at different places for each failure it might be a timeout.  The current mochitest timeout is 45 seconds, so if successful runs of an intermittent are ~40 seconds, it might just be a
timeout.  The current mochitest timeout is 45 seconds, so if
real timeout.  This is particularly true if the failure is most often seen on the slower builds, for example Linux 32 debug.  In this case you can either split the test or call <code>requestLongerTimeout</code>.
successful runs of an intermittent are ~40 seconds, it might just be a
real timeout.  This is particularly true if the failure is most often
seen on the slower builds, for example Linux 32 debug.  In this case
you can either split the test or call <code>requestLongerTimeout</code>.


Sometimes reproducing can only be done in automation, but it's worth
Sometimes reproducing can only be done in automation, but it's worth trying locally, because this makes it much simpler to debug.
trying locally, because this makes it much simpler to debug.


First, try running the test in isolation.  You can use the <code>--repeat</code>
First, try running the test in isolation.  You can use the <code>--repeat</code> and <code>--run-until-failure</code> flags to <code>mach mochitest</code> to automate this a bit.  It's nice to do this sort of thing in a VM (or using Xnest on Linux) to avoid locking up your machine.  Mozilla provides an
and <code>--run-until-failure</code> flags to <code>mach mochitest</code> to automate this a
bit.  It's nice to do this sort of thing in a VM (or using Xnest on
Linux) to avoid locking up your machine.  Mozilla provides an
[[https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Using_the_VM easy-to-use VM]]
[[https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Using_the_VM easy-to-use VM]]


Sometimes, though, a test will only fail if it is run in conjunction
Sometimes, though, a test will only fail if it is run in conjunction with one or more other tests.  You can use the <code>--start-at</code> and <code>--end-at</code> flags to <code>mach mochitest</code> to run a group of tests together.
with one or more other tests.  You can use the <code>--start-at</code> and
<code>--end-at</code> flags to <code>mach mochitest</code> to run a group of tests together.


For some jobs, but not all, you can get an
For some jobs, but not all, you can get an [[https://jonasfj.dk/2016/03/one-click-loaners-with-taskcluster/ interactive shell from TaskCluster]].
[[https://jonasfj.dk/2016/03/one-click-loaners-with-taskcluster/ interactive shell from TaskCluster]].


There's also a
There's also a [[https://wiki.mozilla.org/Electrolysis/e10s_test_tips handy page of e10s test debugging tips]] that is worth a read.
[[https://wiki.mozilla.org/Electrolysis/e10s_test_tips handy page of e10s test debugging tips]]
that is worth a read.


Because intermittents are often caused by races, it's sometimes useful
Because intermittents are often caused by race conditions, it's sometimes useful to enable Chaos Mode.  This changes timings and event orderings a bit. The simplest way to do this is to enable it in a specific test, by
to enable Chaos Mode.  This changes timings and event orderings a bit.
calling <code>SimpleTest.testInChaosMode</code>.  You can also set the <code>MOZ_CHAOSMODE</code> environment variable, or even hack <code>mfbt/ChaosMode.cpp</code> directly.
The simplest way to do this is to enable it in a specific test, by
calling <code>SimpleTest.testInChaosMode</code>.  You can also set the
<code>MOZ_CHAOSMODE</code> environment variable, or even hack
<code>mfbt/ChaosMode.cpp</code> directly.


The amazing rr has
The amazing rr has [[http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html its own chaos mode]].
[[http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html its own chaos mode]].
This can also sometimes reproduce a failure that isn't ordinarily reproducible.  While it's difficult to debug JS bugs using rr, often if you can reliable reproduce the failure you can at least experiment (see below) to attempt a fix.
This can also sometimes reproduce a failure that isn't ordinarily
reproducible.  While it's difficult to debug JS bugs using rr, often
if you can reliable reproduce the failure you can at least experiment
(see below) to attempt a fix.


== That Didn't Work ==
= That Didn't Work =


You couldn't reproduce locally.  You feel doomed, but you are not.
You couldn't reproduce locally.  You feel doomed, but you are not.
You can fight on.
You can fight on.


One useful approach is to add additional logging to the test, then
One useful approach is to add additional logging to the test, then push again.  Sometimes log buffering makes the output weird; you can add a call to <code>SimpleTest.requestCompleteLog()</code> to fix this.
push again.  Sometimes log buffering makes the output weird; you can
add a call to <code>SimpleTest.requestCompleteLog()</code> to fix this.


You can run a single test on try using <code>mach try FILE</code>.  You can also
You can run a single test on try using <code>mach try FILE</code>.  You can also use the <code>--rebuild</code> flag to retrigger test jobs multiple times; or you can also do this easily from treeherder.
use the <code>--rebuild</code> flag to retrigger test jobs multiple times; or you
can also do this easily from treeherder.


== Solving ==
= Solving =


Once you've reproduced the failure, it's time to fix it.  Sometimes
Once you've reproduced the failure, it's time to fix it.  Sometimes you can do this by reading the code or using your usual debugging techniques.  However, here are some useful hints about debugging intermittents specifically.
you can do this by reading the code or using your usual debugging
techniques.  However, here are some useful hints about debugging
intermittents specifically.


Sometimes the problem is a race at a specific spot in the test.  You
Sometimes the problem is a race at a specific spot in the test.  You can test this theory by adding a short wait to see if the failure goes away, like: <code>yield new Promise(r => setTimeout(r, 100));</code>.  See the <code>waitForTick</code> and <code>waitForTime</code> functions in <code>DevToolsUtils</code> for
can test this theory by adding a short wait to see if the failure goes
away, like: <code>yield new Promise(r => setTimeout(r, 100));</code>.  See the
<code>waitForTick</code> and <code>waitForTime</code> functions in <code>DevToolsUtils</code> for
similar functionality.
similar functionality.


You can use a similar trick to "pause" the test at a certain point.
You can use a similar trick to "pause" the test at a certain point. This is useful when debugging locally because it will leave Firefox open and responsive, at the specific spot you've chosen.  Do this
This is useful when debugging locally because it will leave Firefox
open and responsive, at the specific spot you've chosen.  Do this
using <code>yield new Promise(r => r);</code>.
using <code>yield new Promise(r => r);</code>.


<code>shared-head.js</code> also has some helpers, like <code>once</code>, to bind to events
<code>shared-head.js</code> also has some helpers, like <code>once</code>, to bind to events with additional logging.
with additional logging.


You can also binary search the test by either commenting out chunks of
You can also binary search the test by either commenting out chunks of it, or hacking in early <code>return</code>s.  You can do a bunch of these experiments in parallel without waiting for the first to complete.
it, or hacking in early <code>return</code>s.  You can do a bunch of these
experiments in parallel without waiting for the first to complete.


== Verifying ==
= Verifying =


It's difficult to verify that an intermittent has truly been fixed.
It's difficult to verify that an intermittent has truly been fixed.
One thing you can do is push to try, and then retrigger the job many
One thing you can do is push to try, and then retrigger the job many times in treeherder.  Exactly how many times you should retrigger depends on the frequency of the failure.
times in treeherder.  Exactly how many times you should retrigger
depends on the frequency of the failure.
130

edits

Navigation menu