Electrolysis/Release Criteria/Jank: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(more details)
(→‎Event loop lag: Answering open questions.)
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
e10s can't browser responsiveness
{{bug|1251377|Bug 1251377 - <nowiki>[e10s release criteria]</nowiki> Jank, responsiveness should not regress}}


RASCI:
= RASCI =


* Responsible: chutten
* Responsible: chutten
Line 9: Line 9:
* Informed: cpeterson, elan, release management
* Informed: cpeterson, elan, release management


Technical proposal:
= Metrics =


The following metrics will be compared with e10s and non-e10s A/B tests:
== FX_REFRESH_DRIVER_{CHROME,CONTENT}_FRAME_DELAY_MS ==


=== FX_REFRESH_DRIVER_*_FRAME_DELAY_MS ===
Note from billm: {{bug|1228147}} seems invalid because it considers users outside the experiment.


Please provide detailed explanation of how we are comparing/validating this. This needs to include:
This measures the delay from when the underlying platform informs us of a vsync edge to when we handle it on the main thread of the stated process. As such, this is a reasonable measure of how main thread lag influences perceived jank (since it measures ''some'' of how long it takes changed pixels to show on-screen).


* confidence that the telemetry histogram is measuring the right thing (automated or manual testing?)
* In non-e10s, CHROME is the only measure with values.
* the statistical method to compare e10s and non-e10s metrics, since they are not directly comparable
* In e10s, CHROME and CONTENT both have values.
* explanation of any confounding factors such as APZ
 
Note from billm: {{bug|1228147}} seems invalid because it considers users outside the experiment.


Email thread was unclear about whether this needs to include REFRESH_DRIVER_TICK metrics.
These metrics are only useful if the distribution of paint requests prompting the measures remains comparable. Both e10s and APZ change the distribution (the first by splitting the work and changing what work is performed in what process, the second by changing how many scrolling events are measured by these metrics), meaning that these measures are not able to be meaningfully compared between cohorts that have different e10s or APZ settings.


=== BHR/Chrome hangs ===
== BHR/Chrome hangs ==


We have concerns about the accuracy of the data being collected for each of these measures, see {{bug|1240887}}. But we have agreed to accept the existing analysis which says that BHR and chromehangs improved with e10s and consider this requirement PASSed.
We have concerns about the accuracy of the data being collected for each of these measures, see {{bug|1240887}}. But we have agreed to accept the existing analysis which says that BHR and chromehangs improved with e10s and consider this requirement PASSed.
Line 31: Line 28:
Followup may be required if BHR data is used to validate future addon-related jank.
Followup may be required if BHR data is used to validate future addon-related jank.


Please provide: links to the bugs/spark analyses.
* [https://github.com/vitillo/e10s_analyses/blob/master/beta45-withaddons/e10s_experiment.ipynb|e10s_experiment.ipynb for Beta45ex1] - calculates hangs_per_minute, shows an improvement in parent-only hangs, no statistically-significant change in child+parent hangs.
* [https://github.com/vitillo/e10s_analyses/blob/master/beta45-withaddons/e10s_top_hang_stacks.ipynb|e10s_top_hang_stacks.ipynb for Beta45ex1] - shows top hang stacks for parent process in e10s-enabled cohort
* {{bug|1182637|Original "measure e10s jank" bug}} - I sidetracked the discussion at the end to INPUT_EVENT_RESPONSE_MS instead of hangs_per_minute. For more on that measure, see the "Event loop lag" section
 
== Event loop lag ==


=== Event loop lag ===
INPUT_EVENT_RESPONSE_MS is the better measure for e10s/nonE10s comparisons than the originally proposed EVENTLOOP_UI_ACTIVITY_EXP_MS. INPUT_EVENT_RESPONSE_MS is valid across more than one OS and more than one process (EVENTLOOP_UI_ACTIVITY_EXP_MS is valid on Windows only and in the chrome process only). I was using the analysis of that measure as the primary reason for closing bug 1223780 ( https://bugzilla.mozilla.org/show_bug.cgi?id=1223780 ) using analyses on beta45ex1 (prelim analysis was done here: https://gist.github.com/chutten/9b9e29df10e0f7306f99 analysis on the later data was performed, but not published, as it was largely identical) and prelim data from beta45ex2 ( https://gist.github.com/chutten/3129baf8d5e0f10ef54a )


Originally proposed: EVENTLOOP_UI_ACTIVITY_EXP_MS
This metric has been manually verified to have the following characteristics: chrome script slows down both parent and content events, content script slows down only content events.


INPUT_EVENT_RESPONSE_MS is the better measure for e10s/nonE10s comparisons, as it is valid across more than one OS and more than one process (EVENTLOOP_UI_ACTIVITY_EXP_MS is valid on Windows only and in the chrome process only). I was using the analysis of that measure as the primary reason for closing bug 1223780 ( https://bugzilla.mozilla.org/show_bug.cgi?id=1223780 ) using analyses on beta45ex1 (prelim analysis was done here: https://gist.github.com/chutten/9b9e29df10e0f7306f99 analysis on the later data was performed, but not published, as it was largely identical) and prelim data from beta45ex2 ( https://gist.github.com/chutten/3129baf8d5e0f10ef54a )
Holistic camera-based responsiveness will detect problems that manifest on that one machine. Jank is a feature of distributions recorded by populations of users, not just experienced by one user at one time.


Open questions:
I have heard no concerns. I consider this a pass.
* has the new metric recieved appropriate testing? (automated or manual)
* Do we need to follow up and do camera-based analysis to measure latency from keypress to actual visual on-screen?
* are there any concerns? Can we call this a PASS?


=== jank per minute of active usage ===
== jank per minute of active usage ==


This is a combined metric, {{bug|1198650}}. We have made the decision that this no longer blocks e10s, because we are looking at the individual components.
This is a combined metric, {{bug|1198650}}. We have made the decision that this no longer blocks e10s, because we are looking at the individual components.


=== Talos tp5o_responsiveness ===
== Talos tp5o_responsiveness ==
 
* e10s comparison validated: jimm
* Current e10s diff: much better - ~90% on all platforms
** No results on OS X
* Note: measures browser responsiveness during page load. In e10s measured only at the chrome process, therefore the improvement seems real. It would still be useful to also collect data for the content process. TBD.
* {{bug|631571}} - add the test to talos
* {{bug|710296}} - enable the test in e10s (later comments)
 
== GC pauses ==


This test is hugely better in e10s. tp5o_responsiveness on WinXP mozilla-central (results not available on Aurora).
<code>GC_MAX_PAUSE_MS, CYCLE_COLLECTOR_MAX_PAUSE</code>


Concerns: Is this number valid? Why is it so much better in e10s? Is it from the chrome process or content?
These two metrics are better in e10s than in non-e10s. As are the other <code>GC.*PAUSE</code> and <code>CYCLE_COLLECTOR.*PAUSE</code> metrics. This is to be expected, as they no longer contend for process resources. Analysis was performed on [https://github.com/vitillo/e10s_analyses/blob/master/beta45-withaddons/e10s_experiment.ipynb| Beta45ex1]. Analysis on Beta45ex2 will be [https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/e10s_experiment.ipynb|available here] once out of review.


A RESPONSIBLE PARTY NEEDS TO BE FOUND FOR THIS TEST
PASS


=== GC pauses ===
= Bugs =


GC_MAX_PAUSE_MS, CYCLE_COLLECTOR_MAX_PAUSE
<bugzilla>
e10s does better here
{
PASS ?
    "blocks": "1251377",
    "resolution": "---",
    "include_fields": "id, summary, whiteboard, keywords, assigned_to"
}
</bugzilla>

Latest revision as of 18:54, 2 March 2016

Bug 1251377 - [e10s release criteria] Jank, responsiveness should not regress

RASCI

  • Responsible: chutten
  • Accountable: bsmedberg
  • Supporting: data team, RyanVM, rvitillo, avih, Softvision
  • Consulted:
  • Informed: cpeterson, elan, release management

Metrics

FX_REFRESH_DRIVER_{CHROME,CONTENT}_FRAME_DELAY_MS

Note from billm: bug 1228147 seems invalid because it considers users outside the experiment.

This measures the delay from when the underlying platform informs us of a vsync edge to when we handle it on the main thread of the stated process. As such, this is a reasonable measure of how main thread lag influences perceived jank (since it measures some of how long it takes changed pixels to show on-screen).

  • In non-e10s, CHROME is the only measure with values.
  • In e10s, CHROME and CONTENT both have values.

These metrics are only useful if the distribution of paint requests prompting the measures remains comparable. Both e10s and APZ change the distribution (the first by splitting the work and changing what work is performed in what process, the second by changing how many scrolling events are measured by these metrics), meaning that these measures are not able to be meaningfully compared between cohorts that have different e10s or APZ settings.

BHR/Chrome hangs

We have concerns about the accuracy of the data being collected for each of these measures, see bug 1240887. But we have agreed to accept the existing analysis which says that BHR and chromehangs improved with e10s and consider this requirement PASSed.

Followup may be required if BHR data is used to validate future addon-related jank.

  • for Beta45ex1 - calculates hangs_per_minute, shows an improvement in parent-only hangs, no statistically-significant change in child+parent hangs.
  • for Beta45ex1 - shows top hang stacks for parent process in e10s-enabled cohort
  • Original "measure e10s jank" bug - I sidetracked the discussion at the end to INPUT_EVENT_RESPONSE_MS instead of hangs_per_minute. For more on that measure, see the "Event loop lag" section

Event loop lag

INPUT_EVENT_RESPONSE_MS is the better measure for e10s/nonE10s comparisons than the originally proposed EVENTLOOP_UI_ACTIVITY_EXP_MS. INPUT_EVENT_RESPONSE_MS is valid across more than one OS and more than one process (EVENTLOOP_UI_ACTIVITY_EXP_MS is valid on Windows only and in the chrome process only). I was using the analysis of that measure as the primary reason for closing bug 1223780 ( https://bugzilla.mozilla.org/show_bug.cgi?id=1223780 ) using analyses on beta45ex1 (prelim analysis was done here: https://gist.github.com/chutten/9b9e29df10e0f7306f99 analysis on the later data was performed, but not published, as it was largely identical) and prelim data from beta45ex2 ( https://gist.github.com/chutten/3129baf8d5e0f10ef54a )

This metric has been manually verified to have the following characteristics: chrome script slows down both parent and content events, content script slows down only content events.

Holistic camera-based responsiveness will detect problems that manifest on that one machine. Jank is a feature of distributions recorded by populations of users, not just experienced by one user at one time.

I have heard no concerns. I consider this a pass.

jank per minute of active usage

This is a combined metric, bug 1198650. We have made the decision that this no longer blocks e10s, because we are looking at the individual components.

Talos tp5o_responsiveness

  • e10s comparison validated: jimm
  • Current e10s diff: much better - ~90% on all platforms
    • No results on OS X
  • Note: measures browser responsiveness during page load. In e10s measured only at the chrome process, therefore the improvement seems real. It would still be useful to also collect data for the content process. TBD.
  • bug 631571 - add the test to talos
  • bug 710296 - enable the test in e10s (later comments)

GC pauses

GC_MAX_PAUSE_MS, CYCLE_COLLECTOR_MAX_PAUSE

These two metrics are better in e10s than in non-e10s. As are the other GC.*PAUSE and CYCLE_COLLECTOR.*PAUSE metrics. This is to be expected, as they no longer contend for process resources. Analysis was performed on Beta45ex1. Analysis on Beta45ex2 will be here once out of review.

PASS

Bugs

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);