The Firefox Nightly respin process
|Purpose||Update our users' broken nightlies|
|Why||A patch on mozilla-central broke Nightly badly (crash, broken UI…)|
|Results||Nightly population retention|
|People involved||Relman, releng, sheriffs|
This page details the process to back out a faulty patch and, if needed, get nightlies respun.
- 1 Summarized process
- 2 When should we back out a patch?
- 3 How to find the patch to back out
- 4 Bug filing
- 5 Stopping automatic background updates
- 6 Asking for a back out of the patch and new nightlies
- 7 Communicating about the issue
- File a bug with as much detail as possible about the regression / crash if a bug hasn't been filed yet
- Ask releng to stop automatic nightly updates because of the bug filed in step 1
- Warn our users about the regression via our Twitter account and #nightly IRC channel, give the bug number.
- Investigate to find the faulty patch via mozregression or stack traces for crashes
- Ask sheriffs for the back out of the patch and nightly respun, give the bug number as reference
- Mark the bug as blocking the bug referenced for the faulty patch
- Ask the patch author to investigate the regression (NeedInfo in Bugzilla)
- When updates are back announce that the fix is served on Twitter and IRC
- Find the regression as soon as possible and notify the author of the regression via #mobile-android-team (tag managers)
- Identify the affected population
- Identify best method to limit the affected population from the unstable build
- Warn our users about the regression via Matrix (#Fenix or #Nightly channel)
- Ask the patch author or the Mobile team to investigate the regression (NeedInfo in Bugzilla or Tag on Github)
- When updates are back announce that the fix is available
When should we back out a patch?
Desktop and Mobile
We want to back out a patch when a significant regression is identified. This is usually either a functional regression (browser unusable, content rendering broken) or a sudden spike of crashes on the Nightly channel.
We want the nightly channel to be as stable and usable by our community as possible. If the browser is crashing or barely usable, our users leave and we need to make we have a sizable nightly community. Deciding on backing out a patch, blocking automatic updates and rebuilding nightlies after the back out is a balance between the fact that bugs are expected on this channel and the fact that we want this channel to be of the highest quality possible to have a user base. The Release Management team are in charge of finding this balance.
How to find the patch to back out
If it is a functional regression (reproducible case), then we should use mozregression. If it is a spike in crashes not necessarily reproducible (random crashes while surfing), then our crash analysis experts in the Release Management team should be contacted. The analysis of the stack trace combined with hg logs on mozilla-central often allow finding the bug number that introduced the instability.
- If the bug was already filed by a community member, then use it to track the regression and qualify it. Add the nightly-community keyword if missing.
- If it is a crash, get a Crash ID from the people that reported it and file a bug via Socorro.
- If it is a functional regression and no bug was filed yet, file it.
Have the status-firefoxN tracking flag set as affected, the tracking-firefoxN set as blocking and the target milestone set to mozillaN where N is the version number for Nightly.
Once the back out is done, mark the bug as FIXED and change the status-firefoxN tracking flag from affected to fixed.
The bug number will be used to track the work to fix the regression. Communicate to our community that a bug exists so as to avoid having many duplicate bugs filed.
Stopping automatic background updates
If you think that a lot of people are going to be impacted by a regression, ask releng or relman to stop automatic update.
Blocking automatic updates will not prevent new users to install Firefox Nightly from mozilla.org but it will mitigate greatly the impact on our existing user base.
We can ask releng for automatic updates to be stopped for a specific OS and potentially set up a fallback update mechanism to the last good known builds.
Most of the time, it is Relman that stops update via Balrog, it stops updates for all OSes.
Most of the major regressions are reported immediately via our @FirefoxNightly Twitter account followers, usually when more than 2 people report a similar regression there is a high chance that it will be serious and stopping automatic updates should be done rapidly.
If you think that a lot of people are going to be impacted by a regression, you need to identify the best method to limit the affected population from the unstable build.
If the offending commit or issue has not reached the mobile build, you can stop the mobile nightly builds by cancelling the scheduled hook.
If the significant regression is due to geckoview changes, the Relman team can create a new commit to "rollback" the Geckoview bump to the last stable build. If it is a mobile change/commit that is causing a crash spike/significant regression, relman team can backout the patch. Depending on the timing of either of these changes, you might need to manually trigger the nightly builds.
If the issue is specific to a device or OS version, the app can be limited or blocked from those devices in the Google Play Store via the device catalog.
Asking for a back out of the patch and new nightlies
You can contact sheriffs in the #sheriffs IRC channel to back out the patch that caused the regression when you have identified it. The back out commit will reference the bug number.
If the next nightly is about to be built and the impact is moderate, it is not necessarily needed to ask for nightly builds to be respun after the back out.
Note: Some members of the release management team have the technical knowledge and permissions to back out patches.
If the next nightly is about to be triggered, it can be canceled via taskcluster.
If a new build is needed, the Relman team can trigger new mobile nightly builds via a taskcluster hook once the branch is in a stable state.
Communicating about the issue
We should not hesitate to communicate the issue with a reference to the bug number to our community so as to minimize the number of duplicate bugs. If the issue needs steps to reproduce which are not obvious or a specific hardware/OS combination, having all communications centralized in a single bug helps.
We should also remember communicating about the resolving of the issue and urge people to upgrade to the updated nightly (so as to reduce automatic crash reports and unneeded bugs filed).
Communicating about major regressions in Nightly is also part of the informal social contract we have with our alpha testers, making sure they are informed of major technical issues impacting them helps keeping them engaged.
When updates are stopped, this will be automatically indicated on https://whattrainisitnow.com/release/?version=nightly with the reason message (usually linking to a bug) entered by release managers or sheriffs in Balrog when they stopped updates.
Once a major regression is identified, we should communicate this to the right party or team about the situation along with a plan of mitigation. Include the affected population or set forth a plan to investigate this metric. Be sure to tag mobile managers via Slack when communicating the problem. In case of an emergency or if immediate escalation is required, use Mozilla People resources to contact the Mobile Managers.
The main communication channels for mobile nightly to communicate a regression are our #Fenix chatroom on Matrix/Element and our #Mobile-Android-Team Slack channels