B2G/QA/2014-10-20 Performance Acceptance: Difference between revisions

From MozillaWiki
< B2G‎ | QA
Jump to navigation Jump to search
 
(28 intermediate revisions by the same user not shown)
Line 23: Line 23:
Note when comparing to past reports that the test timings changed with Bug [[bug|1079700]]. Baselines for 2.0 have been regenerated with updated timing code, and are faster in a few cases than in the previous comparison. Where this is significant, it will be noted as a faster 2.0 time will be assumed to mean the 2.1 results were also affected by the timing changes, affecting study-over-study comparison.
Note when comparing to past reports that the test timings changed with Bug [[bug|1079700]]. Baselines for 2.0 have been regenerated with updated timing code, and are faster in a few cases than in the previous comparison. Where this is significant, it will be noted as a faster 2.0 time will be assumed to mean the 2.1 results were also affected by the timing changes, affecting study-over-study comparison.


This initial report is from half-length (240 data points maximimum) runs for 2.0 and 2.1 for expediency. It will be updated with full results for future baselines and any notable differences (unlikely) will be communicated.
This initial report has a limited (360 maximum) data point set for 2.0. It will be updated with full results for future baselines and any notable differences (unlikely) will be communicated.


== Startup -> Visually Complete ==  
== Startup -> Visually Complete ==  
Line 35: Line 35:
=== Execution ===
=== Execution ===


These results were generated from 240 application data points per release, generated over 8 different runs of make test-perf as follows:
These results were generated from 480 application data points per release, generated over 16 different runs of make test-perf as follows:


# Flash to base build
# Flash to base build
Line 44: Line 44:
# GAIA_OPTIMIZE=1 NOFTU=1 NO_LOCK_SCREEN=1 make reset-gaia
# GAIA_OPTIMIZE=1 NOFTU=1 NO_LOCK_SCREEN=1 make reset-gaia
# make reference-workload-light
# make reference-workload-light
# For 8 repetitions:
# For 16 repetitions:
## Reboot the phone
## Reboot the phone
## Wait for the phone to appear to adb, and an additional 30 seconds for it to settle.
## Wait for the phone to appear to adb, and an additional 30 seconds for it to settle.
Line 82: Line 82:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 1139 ms
* Median: 1136 ms
* p95: 1385 ms
* p95: 1385 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 1191 ms
* Median: 1191 ms
* p95: 1320 ms
* p95: 1299 ms


'''Result''': '''FAIL''' (small regression, over guidelines)
'''Result''': '''FAIL''' (small regression, over guidelines)


'''Comment''': Calendar's times showed an improvement from the last study, coming out about 75 ms ahead. However, results still show a small regression from 2.0, and remain well over the 1000 ms guidelines, even for best case.
'''Comment''': Calendar's times showed an improvement from the last study, coming out about 75 ms ahead. However, results still show a small regression from 2.0, and remain well over the 1000 ms guidelines, even for best case.  


The p95 behavior has improved as well, both over the last study and 2.0, suggesting more consistent performance.
The p95 behavior has improved as well, both over the last study and 2.0, suggesting an overall better experience.


==== Camera ====
==== Camera ====
Line 105: Line 105:


'''2.0'''
'''2.0'''
* 180 data points
* 300 data points
* Median: 1479 ms
* Median: 1477 ms
* p95: 1738 ms
* p95: 1776 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 1525 ms
* Median: 1521 ms
* p95: 1629 ms
* p95: 1629 ms


Line 118: Line 118:
'''Comment''': Camera's 2.1 results have improved by ~50 ms from the last study.
'''Comment''': Camera's 2.1 results have improved by ~50 ms from the last study.


While Camera does show a ~45 ms regression from 2.0 per these results, investigation by the developers have shown this to not reflect an actual regression. Rather, the investigation found that a change in how the calculation was made occurred between the two branches, and that Camera's real-world performance has actually significantly improved between the branches. It does remain over the absolute release acceptance guidelines.  
While Camera does show a ~50 ms regression from 2.0 per these results, investigation by the developers have shown this to not reflect an actual regression. Rather, the investigation found that a change in how the calculation was made occurred between the two branches, and that Camera's real-world performance has actually significantly improved between the branches. It does remain over the absolute release acceptance guidelines.  


The p95 behavior has improved significantly from both the last study and the 2.0 results, suggesting more consistent performance.
The p95 behavior has improved significantly from both the last study and the 2.0 results, and is now closer to the median behavior. This suggests more consistent performance.


==== Clock ====
==== Clock ====
Line 130: Line 130:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 915 ms
* Median: 916 ms
* p95: 1143 ms
* p95: 1141 ms


'''2.1'''
'''2.1'''
* 210 data points
* 420 data points
* Median: 999 ms
* Median: 997 ms
* p95: 1096 ms
* p95: 1100 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''
Line 153: Line 153:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 807 ms
* Median: 812 ms
* p95: 905 ms
* p95: 910 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 844 ms
* Median: 846 ms
* p95: 909 ms
* p95: 914 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''


'''Comment''': Contacts remains well within acceptance guidelines. While the median (but not p95) has regressed from 2.0, both are significantly better than as measured in the last study.
'''Comment''': Contacts remains well within acceptance guidelines. While the median (but not p95) has regressed from 2.0, both are significantly better than as measured in the last study and the gap between median and p95 has shrunk, suggesting more consistently good performance.


==== Dialer ====
==== Dialer ====
Line 174: Line 174:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 560 ms
* Median: 564 ms
* p95: 674 ms
* p95: 674 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 529 ms
* Median: 528 ms
* p95: 571 ms
* p95: 583 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''


'''Comment''': Dialer remains well within guidelines. Even so, its median startup has improved significantly since the last study, by over 100 ms. Its p95 performance improved even more, by almost 150 ms, suggesting more consistently good performance.
'''Comment''': Dialer remains well within guidelines. Even so, its median startup has improved significantly since the last study, by over 100 ms, and is now faster than 2.0. Its p95 performance improved even more, by almost 150 ms, suggesting more consistently good performance as well.


==== Email ====
==== Email ====
Line 195: Line 195:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 347 ms
* Median: 347 ms
* p95: 494 ms
* p95: 504 ms


'''Result''': '''N/A'''
'''Result''': '''N/A'''
Line 213: Line 213:


'''2.0'''
'''2.0'''
* 180 data points
* 300 data points
* Median: 629 ms
* Median: 627 ms
* p95: 771 ms
* p95: 767 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 503 ms
* Median: 504 ms
* p95: 719 ms
* p95: 696 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''


'''Comment''': FM Radio remains well within guidelines. Its numbers have shown a large improvement since the last study, with median startup improving by almost 200 ms and p95 startup improving by around 150 ms.
'''Comment''': FM Radio remains well within guidelines. Its numbers have shown a large improvement since the last study, with median startup improving by almost 200 ms and p95 startup improving by around 170 ms.


==== Gallery ====
==== Gallery ====
Line 234: Line 234:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 956 ms
* Median: 958 ms
* p95: 1225 ms
* p95: 1226 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 954 ms
* Median: 953 ms
* p95: 1186 ms
* p95: 1186 ms


Line 255: Line 255:


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 846 ms
* Median: 845 ms
* p95: 967 ms
* p95: 979 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''


'''Comment''': Music is not tested in 2.0. Startup number have improved radically for 2.1 since the last study, by around 250 ms for both median and p95. Results are now unimodal, suggesting that the patch mentioned above fixed timing issues for this application, and these numbers do correspond with the "faster" mode of the previous bimodal results. It is well within release acceptance guidelines.
'''Comment''': Music is not tested in 2.0. Startup numbers have improved radically for 2.1 since the last study, by around 250 ms for both median and p95. Results are now unimodal, suggesting that the patch mentioned above fixed timing issues for this application, and these numbers do correspond with the "faster" mode of the previous bimodal results. It is well within release acceptance guidelines.


==== Settings ====
==== Settings ====
Line 271: Line 271:


'''2.0'''
'''2.0'''
* 240 data points
* 330 data points
* Median: 3385 ms
* Median: 3391 ms
* p95: 3803 ms
* p95: 3806 ms


'''2.1'''
'''2.1'''
* 210 data points
* 450 data points
* Median: 3137 ms
* Median: 3131 ms
* p95: 3398 ms
* p95: 3393 ms


'''Result''': '''FAIL''' (well over guidelines, significant regression from last comparison)
'''Result''': '''FAIL''' (well over guidelines, significant regression from last comparison)


'''Comment''': TBD
'''Comment''': Settings numbers have regressed radically since the last study, by ~550 ms median and over 600 ms p95. While Settings had previously shown much better performance than 2.0, now it is only somewhat better at median.
 
Interestingly, the 2.0 numbers remain bimodal even though the timing change that should have fixed bimodality was applied to both branches prior to the study. Also interestingly, albeit possibly coincidentally, the 2.1 numbers now correspond closely with the "faster" mode of the 2.0 results.
 
Though not shown here, an intermediate (but unpublished) study on 10-17 produced results that were somewhat better (but still regressed from 10-02) at 2577 median/2790 p95. Settings' performance appears to be fluctuating late in the release.


==== SMS ====
==== SMS ====
Line 292: Line 296:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 1739 ms
* Median: 1742 ms
* p95: 1919 ms
* p95: 1932 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 1191 ms
* Median: 1190 ms
* p95: 1251 ms
* p95: 1250 ms


'''Result''': '''FAIL''' (over guidelines)
'''Result''': '''FAIL''' (over guidelines)


'''Comment''': TBD
'''Comment''': SMS's results are now unimodal, suggesting the timing fix outlined above did work well for this app. The calculated results are radically faster than in the previous study, showing a nearly 500 ms improvement in median startup time and a similar improvement in p95 performance. The new results are now even better than the "faster" mode of the previous study. They still remain outside the acceptance guidelines, but the gap has been closed significantly.


==== Video ====
==== Video ====
Line 313: Line 317:


'''2.0'''
'''2.0'''
* 240 data points
* 360 data points
* Median: 960 ms
* Median: 958 ms
* p95: 1202 ms
* p95: 1200 ms


'''2.1'''
'''2.1'''
* 240 data points
* 480 data points
* Median: 935 ms
* Median: 936 ms
* p95: 1035 ms
* p95: 1042 ms


'''Result''': '''PASS'''
'''Result''': '''PASS'''


'''Comment''': TBD
'''Comment''': Video's 2.1 median numbers improved by around 20 ms since the previous study, with the p95 performance improving by around 30 ms. These may or may not reflect actual performance improvements, as the 2.0 numbers also improved by around 25 ms suggesting the harness timing changes may have contributed to the delta. It remains within the acceptance guidelines.


== Raw Data ==
== Raw Data ==


Will be added after full dataset is updated.
Will be added after full dataset is updated.

Latest revision as of 22:21, 23 October 2014

2014-10-20 Performance Acceptance Results

Overview

These are the results of performance release acceptance testing for FxOS 2.1, as of the Oct 20, 2014 build.

Our acceptance metric is startup time from launch to visually-complete, as metered via the Gaia Performance Tests, with the system initialized to make reference-workload-light.

For this release, there are two baselines being compared to: 2.0 performance and our responsiveness guidelines targeting no more than 1000ms for startup time.

The Gecko and Gaia revisions of the builds being compared are:

2.0:

  • Gecko: mozilla-b2g32_v2_0/c17df9fe087d
  • Gaia: 9c7dec14e058efef81f2267b724dad0850fc07e4

2.1:

  • Gecko: mozilla-b2g34_v2_1/12dc9b782f2a
  • Gaia: 2904ab80816896f569e2d73958427fb82aebaea5

Note when comparing to past reports that the test timings changed with Bug 1079700. Baselines for 2.0 have been regenerated with updated timing code, and are faster in a few cases than in the previous comparison. Where this is significant, it will be noted as a faster 2.0 time will be assumed to mean the 2.1 results were also affected by the timing changes, affecting study-over-study comparison.

This initial report has a limited (360 maximum) data point set for 2.0. It will be updated with full results for future baselines and any notable differences (unlikely) will be communicated.

Startup -> Visually Complete

Startup -> Visually Complete times the interval from launch when the application is not already loaded in memory (cold launch) until the application has initialized all initial onscreen content. Data might still be loading in the background, but only minor UI elements related to this background load such as proportional scroll bar thumbs may be changing at this time.

This is equivalent to Above the Fold in web development terms.

More information about this timing can be found on MDN.

Execution

These results were generated from 480 application data points per release, generated over 16 different runs of make test-perf as follows:

  1. Flash to base build
  2. Flash stable FxOS build from tinderbox
  3. Constrain phone to 319MB via bootloader
  4. Clone gaia
  5. Check out the gaia revision referenced in the build's sources.xml
  6. GAIA_OPTIMIZE=1 NOFTU=1 NO_LOCK_SCREEN=1 make reset-gaia
  7. make reference-workload-light
  8. For 16 repetitions:
    1. Reboot the phone
    2. Wait for the phone to appear to adb, and an additional 30 seconds for it to settle.
    3. Run make test-perf with 31 replicates

Result Analysis

First, any repetitions showing app errors are thrown out.

Then, the first data point is eliminated from each repetition, as it has been shown to be a consistent outlier likely due to being the first launch after reboot. The balance of the results are typically consistent within a repetition, leaving 30 data points per repetition.

These are combined into a large data point set. Each set has been graphed as a 32-bin histogram so that its distribution is apparent, with comparable sets from 2.0 and 2.1 plotted on the same graph.

For each set, the median and the 95th percentile results have been calculated. These are real-world significant as follows:

Median
50% of launches are faster than this. This can be considered typical performance, but it's important to note that 50% of launches are slower than this, and they could be much slower. The shape of the distribution is important.
95th Percentile (p95)
95% of launches are faster than this. This is a more quality-oriented statistic commonly used for page load and other task-time measurements. It is not dependent on the shape of the distribution and better represents a performance guarantee.

Distributions for launch times are positive-skewed asymmetric, rather than normal. This is typical of load-time and other task-time tests where a hard lower-bound to completion time applies. Therefore, other statistics that apply to normal distributions such as mean, standard deviation, confidence intervals, etc., are potentially misleading and are not reported here. They are available in the summary data sets, but their validity is questionable.

On each graph, the solid line represents median and the broken line represents p95.

Pass/Fail Criteria

Pass/Fail is determined according to our documented release criteria for 2.1. This boils down to launch time being under 1000 ms.

Median launch time has been used for this, per current convention. However, as mentioned above, p95 launch time might better capture a guaranteed level of quality for the user. In cases where this is significantly over 1000 ms, more investigation might be warranted.

Results

Calendar

FxOS Performance Comparison Results, 2.1 2014-10-20 Calendar


Previous Comparison

2.0

  • 360 data points
  • Median: 1136 ms
  • p95: 1385 ms

2.1

  • 480 data points
  • Median: 1191 ms
  • p95: 1299 ms

Result: FAIL (small regression, over guidelines)

Comment: Calendar's times showed an improvement from the last study, coming out about 75 ms ahead. However, results still show a small regression from 2.0, and remain well over the 1000 ms guidelines, even for best case.

The p95 behavior has improved as well, both over the last study and 2.0, suggesting an overall better experience.

Camera

FxOS Performance Comparison Results, 2.1 2014-10-20 Camera


Previous Comparison

2.0

  • 300 data points
  • Median: 1477 ms
  • p95: 1776 ms

2.1

  • 480 data points
  • Median: 1521 ms
  • p95: 1629 ms

Result: FAIL (over guidelines)

Comment: Camera's 2.1 results have improved by ~50 ms from the last study.

While Camera does show a ~50 ms regression from 2.0 per these results, investigation by the developers have shown this to not reflect an actual regression. Rather, the investigation found that a change in how the calculation was made occurred between the two branches, and that Camera's real-world performance has actually significantly improved between the branches. It does remain over the absolute release acceptance guidelines.

The p95 behavior has improved significantly from both the last study and the 2.0 results, and is now closer to the median behavior. This suggests more consistent performance.

Clock

FxOS Performance Comparison Results, 2.1 2014-10-20 Clock


Previous Comparison

2.0

  • 360 data points
  • Median: 916 ms
  • p95: 1141 ms

2.1

  • 420 data points
  • Median: 997 ms
  • p95: 1100 ms

Result: PASS

Comment: Clock's median startup performance has improved from the last study by nearly 50 ms. The improvement is sufficient to put Clock barely within acceptance guidelines, though it does still show an ~85 ms regression from 2.0.

The p95 behavior has improved from both the last study and 2.0 and suggests more consistently good results.

Contacts

FxOS Performance Comparison Results, 2.1 2014-10-20 Contacts


Previous Comparison

2.0

  • 360 data points
  • Median: 812 ms
  • p95: 910 ms

2.1

  • 480 data points
  • Median: 846 ms
  • p95: 914 ms

Result: PASS

Comment: Contacts remains well within acceptance guidelines. While the median (but not p95) has regressed from 2.0, both are significantly better than as measured in the last study and the gap between median and p95 has shrunk, suggesting more consistently good performance.

Dialer

FxOS Performance Comparison Results, 2.1 2014-10-20 Dialer


Previous Comparison

2.0

  • 360 data points
  • Median: 564 ms
  • p95: 674 ms

2.1

  • 480 data points
  • Median: 528 ms
  • p95: 583 ms

Result: PASS

Comment: Dialer remains well within guidelines. Even so, its median startup has improved significantly since the last study, by over 100 ms, and is now faster than 2.0. Its p95 performance improved even more, by almost 150 ms, suggesting more consistently good performance as well.

Email

FxOS Performance Comparison Results, 2.1 2014-10-20 Email


Previous Comparison

2.0

  • 360 data points
  • Median: 347 ms
  • p95: 504 ms

Result: N/A

Comment: Email has been removed from the 2.1 test manifest. New 2.0 baseline results are given here, but this application will be eliminated in future reports unless the test is restored.

However, 2.1 is almost certainly well under launch requirement guidelines with this test and should not be a concern.

FM Radio

FxOS Performance Comparison Results, 2.1 2014-10-20 FM Radio


Previous Comparison

2.0

  • 300 data points
  • Median: 627 ms
  • p95: 767 ms

2.1

  • 480 data points
  • Median: 504 ms
  • p95: 696 ms

Result: PASS

Comment: FM Radio remains well within guidelines. Its numbers have shown a large improvement since the last study, with median startup improving by almost 200 ms and p95 startup improving by around 170 ms.

Gallery

FxOS Performance Comparison Results, 2.1 2014-10-20 Gallery


Previous Comparison

2.0

  • 360 data points
  • Median: 958 ms
  • p95: 1226 ms

2.1

  • 480 data points
  • Median: 953 ms
  • p95: 1186 ms

Result: PASS

Comment: Gallery's median startup performance has improved by over 50 ms since the last study, with the p95 performance improving only slightly. Performance is now virtually identical to 2.0, and within acceptance guidelines.

Music

FxOS Performance Comparison Results, 2.1 2014-10-20 Music


Previous Comparison

2.1

  • 480 data points
  • Median: 845 ms
  • p95: 979 ms

Result: PASS

Comment: Music is not tested in 2.0. Startup numbers have improved radically for 2.1 since the last study, by around 250 ms for both median and p95. Results are now unimodal, suggesting that the patch mentioned above fixed timing issues for this application, and these numbers do correspond with the "faster" mode of the previous bimodal results. It is well within release acceptance guidelines.

Settings

FxOS Performance Comparison Results, 2.1 2014-10-20 Settings


Previous Comparison

2.0

  • 330 data points
  • Median: 3391 ms
  • p95: 3806 ms

2.1

  • 450 data points
  • Median: 3131 ms
  • p95: 3393 ms

Result: FAIL (well over guidelines, significant regression from last comparison)

Comment: Settings numbers have regressed radically since the last study, by ~550 ms median and over 600 ms p95. While Settings had previously shown much better performance than 2.0, now it is only somewhat better at median.

Interestingly, the 2.0 numbers remain bimodal even though the timing change that should have fixed bimodality was applied to both branches prior to the study. Also interestingly, albeit possibly coincidentally, the 2.1 numbers now correspond closely with the "faster" mode of the 2.0 results.

Though not shown here, an intermediate (but unpublished) study on 10-17 produced results that were somewhat better (but still regressed from 10-02) at 2577 median/2790 p95. Settings' performance appears to be fluctuating late in the release.

SMS

FxOS Performance Comparison Results, 2.1 2014-10-20 SMS


Previous Comparison

2.0

  • 360 data points
  • Median: 1742 ms
  • p95: 1932 ms

2.1

  • 480 data points
  • Median: 1190 ms
  • p95: 1250 ms

Result: FAIL (over guidelines)

Comment: SMS's results are now unimodal, suggesting the timing fix outlined above did work well for this app. The calculated results are radically faster than in the previous study, showing a nearly 500 ms improvement in median startup time and a similar improvement in p95 performance. The new results are now even better than the "faster" mode of the previous study. They still remain outside the acceptance guidelines, but the gap has been closed significantly.

Video

FxOS Performance Comparison Results, 2.1 2014-10-20 Video


Previous Comparison

2.0

  • 360 data points
  • Median: 958 ms
  • p95: 1200 ms

2.1

  • 480 data points
  • Median: 936 ms
  • p95: 1042 ms

Result: PASS

Comment: Video's 2.1 median numbers improved by around 20 ms since the previous study, with the p95 performance improving by around 30 ms. These may or may not reflect actual performance improvements, as the 2.0 numbers also improved by around 25 ms suggesting the harness timing changes may have contributed to the delta. It remains within the acceptance guidelines.

Raw Data

Will be added after full dataset is updated.