15
edits
Alexandrui (talk | contribs) |
Bacasandrei (talk | contribs) (→Alert caused by a backout: removed section because we added the details elsewhere) |
||
(22 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
= Context = | = Context = | ||
Performance sheriffs, along with the standard [[TestEngineering/Performance/Sheriffing#Overview|by-the-book definition]], are responsible for checking for performance regressions on a daily basis. This is done by reviewing alerts from our performance tests on [https://treeherder.mozilla.org/perf.html Perfherder]. Any time a test or tests exceed [[TestEngineering/Performance/Sheriffing/Workflow#Thresholds|the set threshold for its framework]], one or multiple alerts will be generated. The goal of the sheriff is to identify the commit or revision responsible for the regression, and file a bug with a :needinfo flag for the author(s) of that commit or revision. | Performance sheriffs, along with the standard [[TestEngineering/Performance/Sheriffing#Overview|by-the-book definition]], are responsible for checking for performance regressions on a daily basis. This is done by reviewing alerts from our performance tests on [https://treeherder.mozilla.org/perf.html Perfherder]. Any time a test or tests exceed [[TestEngineering/Performance/Sheriffing/Workflow#Thresholds|the set threshold for its framework]], one or multiple alerts will be generated. The goal of the sheriff is to identify the commit or revision responsible for the regression, and file a bug with a :needinfo flag for the author(s) of that commit or revision. | ||
= How to Become a Sheriff = | |||
To become a sheriff(and have sheriffing permissions) you will need to file 2 bugs to have sheriffing actions allowed in treeherder. | |||
1) Request to be added to the treeherder sheriff group | |||
- Add your name in the title and in the description add your ldap email address in the | |||
[https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=nobody%40mozilla.org&bug_ignored=0&bug_severity=--&bug_status=NEW&bug_type=task&cf_a11y_review_project_flag=---&cf_accessibility_severity=---&cf_fx_iteration=---&cf_fx_points=---&comment=I%20will%20be%20sheriffing%20performance%20alerts%20on%20Perfherder%2C%20so%20I%20would%20like%20to%20be%20added%20to%20the%20Treeherder%20sheriff%20group.%0D%0A%0D%0AMy%20LDAP%20address%20is%3A%20&component=Treeherder%3A%20Infrastructure&contenttypemethod=list&contenttypeselection=text%2Fplain&defined_cc=cpeterson%40mozilla.com%2C%20cvalaas%40mozilla.com%2C%20dave.hunt%40gmail.com%2C%20jdescottes%40mozilla.com%2C%20sclements%40mozilla.com&defined_groups=1&filed_via=standard_form&flag_type-4=X&flag_type-41=X&flag_type-607=X&flag_type-803=X&flag_type-936=X&needinfo_role=other&needinfo_type=needinfo_from&op_sys=Unspecified&priority=--&product=Tree%20Management&rep_platform=Unspecified&short_desc=Add%20%3Cname%3E%20to%20Treeherder%20sheriff%20group&target_milestone=---&version=--- linked bug] | |||
2) Add yourself to the perf_sheriff group | |||
- Add name and ldap email in title and email in the description of the | |||
[https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=infra%40infra-ops.bugs&bug_ignored=0&bug_severity=--&bug_status=NEW&bug_type=task&cf_fx_iteration=---&cf_fx_points=---&comment=I%20will%20be%20sheriffing%20performance%20alerts%20on%20Perfherder%2C%20so%20I%20would%20like%20to%20be%20added%20to%20the%20Treeherder%20sheriff%20group.%0D%0A%0D%0AMy%20LDAP%20address%20is%3A%20&component=Infrastructure%3A%20LDAP&contenttypemethod=list&contenttypeselection=text%2Fplain&defined_cc=cpeterson%40mozilla.com%2C%20&defined_groups=1&filed_via=standard_form&flag_type-4=X&flag_type-607=X&flag_type-674=X&flag_type-803=X&flag_type-936=X&groups=mozilla-employee-confidential&needinfo_role=other&needinfo_type=needinfo_from&op_sys=Unspecified&priority=--&product=Infrastructure%20%26%20Operations&rep_platform=Unspecified&short_desc=Add%20me%28%3CLDAP%20EMAIL%3E%29%20to%20perf_sheriff%20LDAP%20group&target_milestone=---&version=unspecified linked bug] | |||
Once both bugs are completed check that you can do sheriffing actions. You should be able to now update alerts through the ability to add notes and tags to alerts. | |||
= Flowchart = | = Flowchart = | ||
Line 114: | Line 129: | ||
When an alert is created, it is not necessarily caused by a bug. It can be also caused by the instability/noise of the test or by other causes that are unrelated to the repo code, like CI setup. | When an alert is created, it is not necessarily caused by a bug. It can be also caused by the instability/noise of the test or by other causes that are unrelated to the repo code, like CI setup. | ||
<br /> | <br /> | ||
<br /> | |||
A bug associated with an alert needs to have the perf-alert keyword. The regression bugs have that automatically when the File bug button is pressed, but for the bugs that are not regression bugs needs to be added manually. | |||
== Thresholds == | == Thresholds == | ||
There are different thresholds above which the alerts are considered regressions and they vary depending on the framework: | There are different thresholds above which the alerts are considered regressions and they vary depending on the framework: | ||
Line 133: | Line 150: | ||
=== Harness alerts === | === Harness alerts === | ||
Harness alerts are usually caused by re-recordings or changes to code from testing/raptor component changing the baseline. If the regression is assumed, then link the culprit directly to the alert, close as WONTFIX and add the "harness" tag. Otherwise, open a regression bug. | Harness alerts are usually caused by re-recordings or changes to code from testing/raptor component changing the baseline. If the regression is assumed, then link the culprit directly to the alert, close as WONTFIX and add the "harness" tag. Otherwise, open a regression bug. | ||
===== Baseline Changes ===== | |||
Baseline changes are hard to identify, but we have some questions we can ask ourselves and with other people on the team to help determine if we are dealing with a real regression or a baseline change: | |||
1) Should the patch that caused a change in the baseline metrics because we are aiming for testing in a different environment? | |||
If yes, then the new baseline may be accepted. As the conditions for testing have changed we have to expect that some things will change. Take for instance adding a blank profile to the conditioned profile changes, as this is testing in a different environment we expect things will change | |||
2) Is the patch changing something that shouldn't cause a shift in metrics? | |||
If yes, then do not accept the new baseline immediately, do an analysis and discuss with other sheriffs to get second opinions. | |||
3) Does the patch change the test itself? | |||
If yes, a new baseline may be accepted depending on the changes. If the changes are not functional, then we shouldn’t accept a new baseline. | |||
4) Are test recordings changing? | |||
If yes, then a new baseline should be accepted after checking to make sure that the page that is being tested is different from the previous recording. If they aren’t, then the differences should be investigated further - it’s possible that they still may be a valid baseline change even if the page is the same, as non-visual changes may be different. | |||
5) Any other cases that do not fall in these categories should be discussed with the team for second opinions and with patch authors if needed. | |||
=== Backout/regression-fix alerts === | === Backout/regression-fix alerts === | ||
These alerts are caused by backouts or fixes of regressions and the associated tags are regression-backedout respectively regression-fix. | These alerts are caused by backouts or fixes of regressions and the associated tags are regression-backedout respectively regression-fix. The status of regressions can be changed to WONTFIX. | ||
For alerts caused by backout the process is to reply to the backout comment with the summary of the alert and adding the '''''perf-alert''''' keyword. | |||
<br/> | <br/> | ||
=== Infra alerts === | === Infra alerts === | ||
==== What is an infra alert ==== | |||
The regressions caused by infra changes are probably the most difficult to identify. Excepting the case when the infra change is announced and known of, an infra regression is usually most likely to be detected by the sheriff after all the suspect commits/bugs were removed from the list. | The regressions caused by infra changes are probably the most difficult to identify. Excepting the case when the infra change is announced and known of, an infra regression is usually most likely to be detected by the sheriff after all the suspect commits/bugs were removed from the list. | ||
Line 150: | Line 193: | ||
<br /> | <br /> | ||
For an easy follow-up, there’s a [https://changelog.dev.mozaws.net/ changelog] containing the changes related to the infrastructure that is very useful when the investigation is leading to this kind of regression: https://changelog.dev.mozaws.net/ | For an easy follow-up, there’s a [https://changelog.dev.mozaws.net/ changelog] containing the changes related to the infrastructure that is very useful when the investigation is leading to this kind of regression: https://changelog.dev.mozaws.net/ | ||
<br /> | |||
<br /> | |||
==== How to handle infra alerts ==== | |||
Infra alerts became more frequent starting sometime in 2022, but the causes are still unknown for the moment. They usually come up in pairs of regression-improvement. When a regression comes up: | |||
<br /> | |||
* it has to be tagged with the '''infra''' tag and move it to '''investigating''' queue | |||
As soon as it comes back, if it is an improvement: | |||
<br /> | |||
* it has to be tagged with the '''infra, regression-fix, improvement''' tags and mark it as WONTFIX | |||
<br /> | |||
'''OR''' if the alert contains regression alerts: | |||
<br /> | |||
* it has to be tagged with the '''infra, regression-fix'''' tags and mark it as WONTFIX | |||
<br /> | |||
as well as the regression alert | |||
<br /> | |||
* has to be changed the status to '''FIXED''' | |||
=== Invalid alerts === | === Invalid alerts === | ||
Line 166: | Line 226: | ||
A particular case are '''sccache hit rate''' tests. Most often, those alerts are invalid, but is the hit rate drops stay stays low for at least '''12-24h''' then a regression bug should be open.<br /> | A particular case are '''sccache hit rate''' tests. Most often, those alerts are invalid, but is the hit rate drops stay stays low for at least '''12-24h''' then a regression bug should be open.<br /> | ||
<br /> | <br /> | ||
== Handling Regressions == | == Handling Regressions == | ||
There are two different approaches to handling regressions: | There are two different approaches to handling regressions: | ||
Line 252: | Line 309: | ||
You also need to add the 'perf-alert' keyword, to indicate there is a performance alert associated with this bug. | You also need to add the 'perf-alert' keyword, to indicate there is a performance alert associated with this bug. | ||
Note: If by mistake you left a comment that is actually invalid, just add to the comment the tag '''obsolete'''. | |||
'''Attention!''' The summary could contain alerts reassigned from other alerts. You have to tick the box next to '''each untriaged alert''' and change its status to ''Acknowledge''. <br /> | '''Attention!''' The summary could contain alerts reassigned from other alerts. You have to tick the box next to '''each untriaged alert''' and change its status to ''Acknowledge''. <br /> | ||
Line 277: | Line 336: | ||
* harness - patches that updated that harness and caused improvements | * harness - patches that updated that harness and caused improvements | ||
* regression-backedout - patches backed out due to causing regressions | * regression-backedout - patches backed out due to causing regressions | ||
** ''applies to the alerts linked to bugs that backed out a regression'' '''NOT''' ''to alerts that are linked to regression bugs'' | |||
* infra - improvements caused by infra changes (changes not related to repository code) | * infra - improvements caused by infra changes (changes not related to repository code) | ||
* regression-fix - patches fixing a reported regression bug | * regression-fix - patches fixing a reported regression bug | ||
** ''applies to the alerts linked to bugs that fixed a regression'' '''NOT''' ''to alerts that are linked to regression bugs'' | |||
* improvements - patches causing improvements | * improvements - patches causing improvements | ||
Line 284: | Line 345: | ||
* harness - patches that updated that harness and caused regressions | * harness - patches that updated that harness and caused regressions | ||
* regression-backedout - patches backed out due to causing regressions | |||
** ''applies to the alerts linked to bugs that backed out a regression'' '''NOT''' ''to alerts that are linked to regression bugs'' | |||
* infra - regressions caused by infra changes (changes not related to repository code) | * infra - regressions caused by infra changes (changes not related to repository code) | ||
* regression-fix - patches fixing a reported regression bug | |||
** ''applies to the alerts linked to bugs that fixed a regression'' '''NOT''' ''to alerts that are linked to regression bugs'' | |||
* improvement-backedout - regression caused by backing out an earlier improvement | * improvement-backedout - regression caused by backing out an earlier improvement | ||
Line 313: | Line 378: | ||
= Special Cases = | = Special Cases = | ||
== Multiple Bug IDs on the same push == | == Multiple Bug IDs on the same push == | ||
=== Regressions === | |||
If the push that has been identified as the culprit has multiple revisions and bug ids linked to it, the sheriff can fill in the "Regressed By" field for each of the bugs with the value of the filed bug for the associated alert. <br/> | If the push that has been identified as the culprit has multiple revisions and bug ids linked to it, the sheriff can fill in the "Regressed By" field for each of the bugs with the value of the filed bug for the associated alert. <br/> | ||
If there's a regression-fix, regression-backedout or improvement-backedout bug to be linked to the alert but the revision contains multiple bugs, link the most relevant bug to the alert. If you're unsure, ni? the author to tell which of the bugs they think caused the alert. | If there's a regression-fix, regression-backedout or improvement-backedout bug to be linked to the alert but the revision contains multiple bugs, link the most relevant bug to the alert. If you're unsure, ni? the author to tell which of the bugs they think caused the alert. | ||
=== Improvements === | |||
The general approach for improvements is to leave needinfos to the authors to ask for help in identifiying the corresponding bug. | |||
== Sheriffing Firefox-Android Alerts == | == Sheriffing Firefox-Android Alerts == | ||
Line 330: | Line 398: | ||
Lastly, backfilling on the mozilla-central side when a change is detected in a GeckoView commit is done with try runs. [https://wiki.mozilla.org/Performance/Fenix/Performance_reviews#Backfilling_to_determine_culprit_commit| You can find instructions for that here]. | Lastly, backfilling on the mozilla-central side when a change is detected in a GeckoView commit is done with try runs. [https://wiki.mozilla.org/Performance/Fenix/Performance_reviews#Backfilling_to_determine_culprit_commit| You can find instructions for that here]. | ||
=== How to triage a firefox-android alert === | |||
1. Read the graph, check for job gaps (browsertime & build_metrics): | |||
* click on the graph icon next to an alert item (usually the highest magnitude) | |||
a thin vertical line will show for all the alerts associated with the test, you need to make sure you’re looking at the right one by hovering or * clicking on the datapoint | |||
* click on the job link to go to treeherder job view | |||
* make sure you’re seeing the jobs from current revision back and look for gaps | |||
2. Trigger the jobs in the revisions in the gap (browsertime & build_metrics) | |||
* For every revision in the gap click '''''Add new jobs''''' and trigger the needed job manually | |||
3. Find the responsible push that caused the regression | |||
* If the regression is caused by a mozilla-central push, then you have to take the build id (Update GeckoView (Nightly) to 121.0.'''20231106094018'''.) and search for the revision in [https://buildhub.moz.tools/ Buildhub] | |||
* Build-metrics: file a generic performance regression | |||
* Browsertime: it should be an autoland alert to mark as downstream for | |||
4. If it’s caused by a firefox-android push then the process of filling a regression bug should be straight-forward - the default one from [[TestEngineering/Performance/Sheriffing/Workflow#Filing_a_regression_bug|documentation]]. (browsertime & build_metrics) | |||
==== Special case ==== | |||
There’s a special case that happens very very rare when code from geckoview build (mozilla-central) doesn’t affect geckoview pageload but affects fenix (firefox-android). It is the case for alert [https://treeherder.mozilla.org/perfherder/alerts?id=40028&hideDwnToInv=0 40028]. In this case there’s no other option than bisection on mozilla-central with fenix build. The steps are: | |||
# Follow bisection workflow from documentation | |||
# Checkout to the previous commit of the suspect one | |||
# Download the [https://treeherder.mozilla.org/jobs?repo=firefox-android&group_state=expanded&tochange=71d772f64a8acfd1f916d8cc7fde01957c5ca2d5&fromchange=71d772f64a8acfd1f916d8cc7fde01957c5ca2d5&searchStr=fenix-android-all%2Copt%2CNightly-related%2Ctasks%2Cthat%2Crun%2Con%2Ceach%2Cgithub%2Cpush%2Cbuild-apk-fenix-nightly-simulation%2CB&selectedTaskRun=Zxf1CST0Ru61vzqrElJp3A.0 fenix-android-all fenix-nightlySim(B) build]: [https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Zxf1CST0Ru61vzqrElJp3A/runs/0/artifacts/public/build/target.armeabi-v7a.apk target.armeabi-v7a.apk] (artifacts) | |||
# Add it to the code using the command: ''./mach try perf --browsertime-upload-apk path/to//target.armeabi-v7a.apk'' | |||
# Commit ''hg commit -m="fenix apk"'' | |||
# Push to try the pageload job for geckoview ''Android 11.0 Samsung A51 AArch64 Shippable'' | |||
# Repeat steps 1-6 for pushing to try the suspected revision and then compare the results to try. | |||
== Alerts caused by modifying release flags (EARLY_BETA_OR_EARLIER) / bumping the Firefox version == | |||
These alerts are generated on mozilla-beta and have no bugs associated with the revisions: | |||
# Check the milestone version by looking into the changeset details. | |||
# Check if there was already a bugzilla bug created for this particualr milestone version. | |||
# If not, then clone the following bug https://bugzilla.mozilla.org/show_bug.cgi?id=1879080. <br/> | |||
# Leave a comment with the alert summary you were investigating. | |||
Comment 0 could include the following information: <br/> | |||
''There were some alerts generated on mozilla-beta, that have no bugs associated with the revisions. <br/>'' | |||
''Alert: https://treeherder.mozilla.org/perf.html#/alerts?id=<alert#number> <br/>'' | |||
''Pushlog: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=<rev>&tochange=<rev> <br/>'' | |||
''Changeset Details: https://hg.mozilla.org/releases/mozilla-beta/rev/<rev> <br/>'' | |||
''A few notes to consider:'' | |||
*''for mozilla-beta alerts, we normally search for downstreams on autoland'' | |||
*''most of the time, when a change is pushed to autoland, it is also pushed to mozilla-central during the same day'' | |||
*''on mozilla-beta the change is merged after ~ 6 weeks'' | |||
== build_metrics special cases == | |||
# '''''decision''''' alerts usually are just spikes and not relevant, which is why we are setting them as ''invalid''. In the case that, by looking at the coresponding graph, the regression wasn't improved / fixed, then check with the other sheriffs in the Sheriffs of Performance channel on Element since it might be a cause of concern. | |||
# '''''instrumented''''' in the Tags & Options column should be set as ''invalid''. Alerting will be deactivated for these tests in the future. | |||
== Regression which was already backedout == | |||
If we identify a regression alert summary and that particular changeset was already backout then the safe approach would be to file a bug by following the usual workflow, but mention in the comment0 that this was already backedout and there's no action item for the author. This bug is only created for tracking purposes / being able to acknowledge the change summary of the alert. | |||
== Jobs are failing to run == | |||
In this case we usually resort to bisection. | |||
== Invalid comments == | |||
If by mistake you left a comment that is actually invalid, just add to the comment the tag '''''obsolete'''''. | |||
= Resources = | = Resources = | ||
* [https://wiki.mozilla.org/TestEngineering/Performance/FAQ FAQ] | * [https://wiki.mozilla.org/TestEngineering/Performance/FAQ FAQ] |
edits