Notes for 67.0 release post mortem
10:30am PT Thursday June 11 (after the channel meeting) Zoom Meeting ID: 266 532 573 IRC: #release-drivers
marcia, Aryx, jcristau, neha, Ritu, lizzard , overholt,tania, RyanVM
- On nightly: 7 weeks
- On beta: 9 weeks
- Uplifts to beta: 395
** We had 19 betas instead of 17 because we delayed shipping 1 week because of armagaddon ** We also uplifted features and bug fixes for Trailhead (Firefox Account avatar, password manager entry points, Facebook container patches, new about:welcome) ** 66: 323 uplifts (15 betas)
- uplifts to RC: 6
** lower number than previous releases (64: 13, 65: 16, 66: 13) * 38 issues during central-as-beta simulations (+3 issues when version would have increased to 68; no numbers for 66) * (Ritu) Can we add update rollout %age and dates in the stats so we can look back and see how quickly we go from 0 to 100% release over release?
- fixed in 67: patches from 4437 bugs (66: 3475, 65: 4121)
* Fx67 PI requests and deadlines ** We had total 15 desktop features (1 bugwork), - 5 features were moved to Backlog/future releases ** PI Request deadline was Jan 23, 3 features missed the deadline and were submitted by early Feb ** Technical documentation deadline was Jan 31, docs were delayed for 6 features (High compared to last few releases) ** Code readiness date was Feb 5, features landed on time * Total # of bugs reported by QA in 67 Nightly: 100, bugs reported in 67 Beta: 104 ** Details : https://docs.google.com/spreadsheets/d/1IWUx-8AOADdzuKCRl7jo5oQ8djdS0kCjfqkAVAI8jxU/edit#gid=0
What went well
- WNP was again not a problem
- Coordination with the teams that needed uplifts for Trailhead went generally well
- Stability on Release is good with no major crash caused by external software
- [marcia] Happy to see https://bugzilla.mozilla.org/show_bug.cgi?id=1556076 fixed so that macOS 10.15 users will be in a better state
- No forced dot release because of stability or security issues \o/ amazing!!!
- (liz) I think working with antivirus and the inject/eject project helped here...
- Fennec stability is decent on release (it was a bit high on beta)
- We agreed on a plan of action regarding LSNG with the DOM storage team and stuck to it (uplift to early betas, disabled with beta 10). This really helped the implementation!
- The bulk of uplifts was at the beginning of the beta cycle (30 to 50 uplifts up to beta 11) but late in the beta cycle we had fewer uplifts per beta and less risky ones (~10 per beta), we could feel that we were stabilizing
- (Ritu) We should report uplifts that were rejected from next cycle, if possible. This can help the team learn which uplifts are higher risk.
- Few uplifts in RC, probably also because we had an extra week of beta
- All known regressions, including carry overs, had been dealt with and a decision taken about them before RC week. Great cooperation with the REO. \o/
- No bad surprise on release week
- [Fx67] Communication with Engineering was good and productive for some features – NextGen Local Storage, WebRender, Dedicated Profiles
- [Fx67] We added an additional round of AV testing in Beta (Total 3 rounds) and also added another AV to our Test Coverage (McAfee)
- [Fx67] Argentina team was onboarded quickly, thanks to Release Management and everyone else for helping QA to answer any questions during build testing/WNP
- [Trailhead]Communication between QA teams went well, despite the overall pace of things.
- [Trailhead] Postponing some of the Fx68 features helped us accommodate 67.0.1 requests more easily.
- We learned a few things from the 66.0.5 chemspill we had during the weekend of May 4 - 5 and we’re already taking steps to ensure better coverage and provide more detailed documentation for these emergency situations.
- We had to delay the release by one week because the team was firefighting Armagaddon in parallel and the state of Nightly 68 was too bad to merge to beta, one merge blocker happened on localized builds only, not the first time. What can we do about that? (better support for localized builds failure and support for localized builds in mozregression would be great)
- (Ritu) Let's take this to l10n team. I can try!
- We had to manage the Trailhead release in parallel so it felt like shipping two big releases at 2 weeks interval during the whole beta cycle.
- (julien) should we have assigned a separate release owner for 67.0.1?
- ... discussion suggests probably not
- I was on PTO the week before Trailhead, Liz and Ryan covered for me but maybe it caused stress to other teams to see the release owner change for a week
- (liz) it was fine with me! i was 2ndary owner anyway.
- (Ryan) Pascal was diligent about summarizing things, so filling in wasn't too painful
- We had a couple of bad crashers during the whole beta cycle and it took time to identify the causes and get fixes. Our top crasher (https://bugzilla.mozilla.org/1535699) was fixed in RC with several attemps at fixing it over the beta cycle. (For a bit of context, this was a Fission-related change that manifested itself with a Service Worker + Necko interaction that, as you mention, took a while to figure out.)
- Consider bubbling up blockers (attention attention!) to eng mngrs going forward.
- Can we improve nagbot emails to better-highlight newly-added issues? Ritu noted that some people ignore nagbot emails :)
- We could especially note when one team has multiple blockers assigned to it.
- NI triage owner as another mitigation plan
- Late in the beta cycle, Armagaddon uplifts + Trailhead uplifts + regular uplifts was a lot to manage in parallel, fortunately we are more than 3 release managers now, but I don't think we could have done it with a smaller team (2 years ago)
- We had few uplifts in RC (6) but 3 of them were release blockers (thanks to QA for finding out https://bugzilla.mozilla.org/1552156 and https://bugzilla.mozilla.org/1551455). Maybe testing partner build updates should be done earlier in beta?
- [Tania] taking this as an action item on me, will discuss with SV team
- We had problems with nucleus (our Release Notes CMS) for 67.0.1 and 67.0.2. We are following up with the webdev team on both the sync issues and improving it (ex: https://github.com/mozilla/bedrock/issues/7277).
- One uplift in 65 to fix a regression (https://bugzil.la/1495363) caused a worse regression (https://bugzil.la/1542912) but we realized that too late in the 67 beta cycle to just back it out without risking new regressions so we decided with bz to back it out first in pre-release and later into a dot release. We (collectively) could have done better triaging and prioritizing of duplicates to this regression.
- Can bugbug (ML) group these duplicates and therefore help with criticality of this regression from 65?
- On a more general note, I feel that we lack a proper way in Bugzilla to indicate that we are not taking an uplift in the initial release (not enough bake time in pre-release channels) but will most likely take it in a dot release. If we do more trailhead style releases, we should think of improving our tracking flags.
- (Ritu) We should standardize this in our team meeting
- (julien) with 4 week release cycles maybe those no longer exist as they can wait until the next one? ;) more seriously i use the release tracking spreadsheet for dot release uplift candidates
- I use trackkng/blocking for this. (liz)
- Test failure on release: https://bugzilla.mozilla.org/show_bug.cgi?id=1551347
- Doing weekly beta-as-release simulations from now on (already catched 2 issues for 68)
- (julien) Thanks! \o/
- Doing weekly beta-as-release simulations from now on (already catched 2 issues for 68)
- (marcia) Confusing to have a 67.0.1 show up for Fennec even when we didn't ship it. Turns out it is the Chinese build. Not sure if there is anything we can do about it.
- A few others Socorro related things called out in https://wiki.mozilla.org/Firefox/Channels/Meetings/2019-06-06#Roundtable
- (marcia) Having the release version not match the tracking flag was a bit confusing
- (Ritu) +1000 :)
- Use of firefox67 flag changed: first used for pre-trailhead and firefox67.0.5 for trailhead, later firefox67 as what had to be set as fixed even for post-trailhead
- (julien) did we have the same confusion around 50.1 a couple of years ago? was it managed differently?
- (Ritu) I think I managed this and the version was not open to discussion or change ;)
- [Trailhead] There were many 67.0.1 related requests made at the last minute. This translated to more pressure, stress and the need of overtime for QA to test everything in time.
- [Trailhead] The general feeling was that the QA deadline for 67.0.1 requests was considered a hard deadline, while Engineering made requests well after their deadline was OK.
- [Trailhead][FxA] Communication with devs was slow, continuous changes of the feature. The changes were not documented and we weren’t announced in time in order to updated the Test cases
- [Trailhead][Mozilla.org] Late arrival of documentation, the way the scope of testing was reduced several times, finally we were required to check only the basics, disconsidering the UX specifications.
- [Trailhead] There was a bit of confusion caused by the versioning used for this release– 67.5 to 67.0.5 to 67.0.1.
- [Trailhead] Mobile QA team - Confusion caused by the update tests for build 66.0.4 because the mobile process is different from desktop. Mobile team doesn't perform update tests.