Performance/Fenix/Performance reviews

From MozillaWiki
Jump to: navigation, search

Do you want to know if your change impacts Fenix or Focus performance? If so, here are the methods you can use, in order of preference:

  1. Benchmark: use an automated test to measure the change in duration
  2. Timestamp benchmark: add temporary code and manually measure the change in duration. Practical for non-UI measurements or very simple UI measurements
  3. Profile: take a profile, identify the start and end points of your measurement, and measure the change in duration

We don't necessarily recommend these techniques though they have their place:

  1. Screen recording, side-by-side: take a screen recording of before and after your change, synchronize the videos, and put them side-by-side with timestamps using the perf-tools/combine-videos-side-by-side.sh script.

The trade-offs for each technique are mentioned in their respective section.

Benchmark locally

A benchmark is an automated test that measures performance, usually the duration from point A to point B. Automated benchmarks have similar trade-offs to automated functionality tests when compared to one-off manual testing: they can continuously catch regressions and minimize human error. For manual benchmarks in particular, it can be tricky to be consistent about how we aggregate each test run into the results. However, automated benchmarks are time consuming and difficult to write so sometimes it's better to perform manual tests.

Unfortunately, we don't yet support benchmarks in CI so you'll have to run them manually. Please use a low-end device.

To benchmark, do the following:

  1. Select a benchmark that measures your change or write a new one yourself
  2. Run the benchmark on the commit before your change
  3. Run the benchmark on the commit after your change
  4. Compare the results: generally, this means comparing the median

We currently support the following benchmarks:

Measuring cold start up duration

To measure the cold start up duration, the approach is usually simple:

  1. From the mozilla-mobile/perf-tools repository, use measure_start_up.py.
    The arguments for start-up should include your target (Fenix or Focus).
  2. Determine the start-up path that your code affects this could be:
    1. cold_main_first_frame: when clicking the app's homescreen icon, this is the duration from process start until the first frame drawn
    2. cold_view_nav_start: when opening the browser through an outside link (e.g. a link in gmail), this is the duration from process start until roughly Gecko's Navigation::Start event
  3. After determining the path your changes affect, these are the steps that you should follow:

Example:

  • Run measure_start_up.py located in perf-tools. Note:
    • The usual iteration coumbered list itemnts used is 25. Running less iterations might affect the results due to noise
    • Make sure the application you're testing is a fresh install. If testing the Main intent (which is where the browser ends up on its homepage), make sure to clear the onboarding process before testing
 python3 measure_start_up.py -c=25 --product=fenix nightly cold_view_nav_start results.txt

where -c refers to the iteration count. The default of 25 should be good.

  • Once you have gathered your results, you can analyze them using analyze_durations.py in perf-tools.
  python3 analyze_durations.py results.txt


NOTE:For testing before and after to compare changes made to Fenix: repeat these steps, but this time for the code before the changes. Therefore, you could checkout the parent comment (I.e: using git rev-parse ${SHA}^ where ${SHA} is the first commit on the branch where the changes are)

An example of using these steps to review a PR can be found (here).

Testing non start-up changes

Testing for non start-up changes is a bit different than the steps above since the performance team doesn't have tools as of now to test different part of the browser.

  1. The first step here would be to instrument the code to take (manual timings). By getting timings before and after the changes, it could potentially indicate any changes in performance.
  2. Using profiles and markers.
    1. (Profiles) can be a good visual representative for performance changes. A simple way to find your code and its changes could be either through the call tree, the flame graph or stack graph. NOTE: some code may be missing from the stack since pro-guard may inline it, or the sampling rate of the profiler is more than the time taken by the code.
    2. Another useful tool to find changes in performance is markers. Markers can be good to show the time elapsed between point A and point B or to pin point when a certain action happens.

Timestamp benchmark

A timestamp benchmark is a manual test where a developer adds temporary code to log the duration they want to measure and then performs the use case on the device themselves to get the values printed. Here's a simple example:

val start = SystemClock.elapsedRealtime()
thingWeWantToMeasure()
val end = SystemClock.elapsedRealtime()
Log.e("benchmark", "${end - start}") // result is in milliseconds

We recommend this approach for non-UI measurements only. Since the framework doesn't notify us when the UI is visually complete, it's challenging to instrument that point and thus accurately measure a duration that waits for the UI.

Like automated benchmarks, these tests can accurately measure what users experience. However, they are fairly quick to write and execute but are tedious and time-consuming to carry out and have many places to introduce errors.

Here's an outline of a typical timestamp benchmark:

  1. Decide the duration you want to measure
  2. Do the following once for the commit before your changes and once for the commit after your changes...
    1. Add code to measure the duration.
    2. Build & install a release build like Nightly or Beta (debug builds have unrepresentative perf)
    3. Do a "warm up" run first: the first run after an install will always be slower because the JIT cache isn't primed so you should run and ignore it, i.e. run your test case, wait a few seconds, force-stop the app, clear logcat, and then begin testing & measuring
    4. Run the use case several times (maybe 10 times if it's quick, 5 if it's slow). You probably want to measure "cold" performance: we assume users will generally only perform a use case a few times per process lifetime. However, the more times a code path is run during the process lifetime, the more likely it'll execute faster because it's cached. Thus, if we want to measure a use case in a way that is similar to what users experience, we must measure the first time an interaction occurs during the process. In practice this means after you execute your use case once, force-stop the app before executing it again
    5. Capture the results from logcat. If you log, "average <number-in-ms>", you can use the following script to capture all the results and find the median adb logcat -d > logcat && python3 perf-tools/analyze_durations.py logcat
  3. Compare the results, generally by comparing the median of the two runs

Example: page load

TODO... the duration for a page load is a non-UI use case that is more complex than the very simple example provided above

Example: very simple UI measurements

TODO... if the screen content is drawn synchronously, you can do something like:

view.doOnPreDraw {
  end = SystemClock.elapsedRealtime()
  // Be sure to verify that this draw call is the draw call where the UI is visually complete
  // e.g. post to the front of the main thread queue and Thread.sleep(5000) and check the device
}

Profile

You can take profiles with the Firefox Profiler, identify the start and end points for the duration you're measuring in your profile, and use the difference between them to measure the duration. It's quick to take these profiles but there are big downsides: profilers add overhead so the duration will not be precise, it's difficult to avoid noise in the results because devs can only take so many profiles, and it may be non-trivial to correctly identify the start and end points of the duration especially when the implementations you compare have big differences.

Follow the example below to see how to measure the change in duration for a change with profiles.

Example: time to display homescreen

On a low-end device...

  1. We pick the specific duration we want to measure: the time from hitting the home button when a tab is open until the homescreen is visually complete.
  2. We build & install a release build (e.g. Nightly, Beta; debug builds have unrepresentative perf). You can also use a recent Nightly, like this example does.
  3. We do a "warm up" run to populate the JIT's cache (the first run has unrepresentative perf). We start the app, set up the start state (open a tab), do our use case (click the home button and wait for the UI to fully load). Then we force-stop the app.
  4. We profile: start the app (which should launch to the most recent tab), start the profiler (see here for instructions), perform the use case (click the home button in the toolbar and wait for the homescreen to finish loading), and stop the profiler. Don't forget to enable the profiler permissions (3-dot menu -> Settings -> Remote debugging via USB).
  5. We identify the duration of the change in the raw profile. The most accurate and reproducible way to do this is using the Marker Chart.
    1. In this case, we can identify the start point through the dispatchTouchEvent ACTION_UP marker, right-click it, and choose "Start selection here" to narrow the profile's timeline range. We can then click the magnifying glass with the + in the timeline to clamp the range.
    2. The end point is more tricky: we don't have a marker to identify that the UI is visually complete. As such, we can use the information in the Marker Chart and Stack Chart to make a best guess as to when the UI is visually complete (notice that this creates a point of inaccuracy). If we temporarily clamp our range to after the last marker (onGlobalLayout) is run, we see that there is a measure/layout pass for Compose after it. We make a best guess that the content isn't visually complete until this last measure/layout/draw pass completes. To clamp the range to this, we can double-click on the draw method above measureAndLayout to shrink our range to that method – this lets us accurately capture the end point. Then we can drag the selection handle to re-expand the range all the way to the left, back to our start point. Then we can clamp the range given that the start and end points we want to measure are the start and end points of the range. The final profile – https://share.firefox.dev/3o7EvOI – gives us our final duration, which we can see in the value at the top left of the profiler: 1.4s in this case.

With the measurement in hand, repeat these steps for your changes and compare the resulting times. Note: it's possible the device was under load when you took the profile so you may wish to take more than one profile if you suspect that is the case.