Firefox/Channels/Postmortem/46
Contents
- 1 Firefox 46 Post-mortem
- 1.1 please add more issues here!
- 1.1.1 Issue: build promotion
- 1.1.2 Use of the Blocking flag
- 1.1.3 Great teamwork on critical, release blocking issue
- 1.1.4 Issue: Need for backup QE manual, update testers in US time zones
- 1.1.5 Issue: Too many builds, heavy load on QE
- 1.1.6 Issue: More uplifts, more tracked bugs
- 1.1.7 Issue: untracked, untagged bugs
- 1.1.8 Issue: antivirus/junkware/malware
- 1.1.9 Fun with Bizdev
- 1.1.10 Issue: Release notes
- 1.1.11 Issue: Last minute surprises
- 1.1 please add more issues here!
Firefox 46 Post-mortem
Firefox 46 release post-mortem Rescheduled to Tues. May 17 10:30am PST (or just after the channel meeting finishes) Vidyo room: Release Coordination irc: #relman
please add more issues here!
Release health status at halfway point in 46 release: https://mozilla.github.io/releasehealth/?channel=release
Anything relman could be doing better from your perspective? Let us know here...
Issue: build promotion
- We knew this would have some issue to iron out, and it did. Some builds just failed. Some, we had to have multiple builds.
- At some point, tried to use build promotion to give us a jump on shipping a newer changeset. In some ways, this backfired since it meant SV tested two builds.
- Mozmill broke
- l10n broke
- doing things for first time on friday means fewer people are around for problems
- summary of beta build
- Beta 1 only released to en-US.
- Beta 2 was 2 days late.
- Beta 3 skipped.
- Beta 4, OK but needed several builds.
- Beta 5: Easter.
- Beta 6: still easter, also, sec issue meant, not shipped.
- Beta 7 shipped 3 days late. At this point we pushed back the release date by 1 week.
- More detailed notes on what went wrong below (under the line of =====)
Use of the Blocking flag
- We used the "blocking" flag and the release heath dashboard to mark blockers for release and for the point release. This worked well for me (liz) to keep everyone focused on the blocking issues and to communicate clearly across teams.
- Also showed on the TVs in various offices
Great teamwork on critical, release blocking issue
- Shout out to philipp, mayhemer, and FoxShadow for last minute work on bug 1268922 and in SUMO, https://bugzilla.mozilla.org/show_bug.cgi?id=1268922 https://support.mozilla.org/en-US/questions/1120558 philipp and FoxShadow did a lot of work very quickly finding regression ranges and downloading test builds to diagnose the issue and then verify the fix on release. Sending FoxShadow some Mozilla swag.
Issue: Need for backup QE manual, update testers in US time zones
Action:
( for holidays and for unexpected situations) Handoff is important, we need to communicate better Michelle F. is now able to run the update testing. (But, if there are problems, we still don't have help on a Friday to fix them)
Issue: Too many builds, heavy load on QE
Action:
- plan better around holidays (deliberately eliminate a beta build?)
- More people at SV to help
- Plan better for the overlap of 2 esrs (We could note it on the release calendar
Issue: More uplifts, more tracked bugs
Action:
- We increased the number of bugs we looked at (in the 46 beta cycle) with platform triage queries. More churn in beta.
- The backlog/ workload on relman should slowly improve as we catch up on resolving carryover regressions
- The extra work from the platform triage team helped find and resolve more issues! (but also meant more work for relman)
Issue: untracked, untagged bugs
Action:
7 or 8 blocking issues were put in as uplift requests on Monday morning after the beta to release merge, on the wrong channel (mozilla-beta rather than m-r)! None were tracked or tagged "regression". That is too sloppy. Devs and QE both need to add "regression" or request tracking!
Issue: antivirus/junkware/malware
- Causing crashes or other issues, we don't have a clear path to fix this stuff
- frequent driver of dot releases/ panic in late beta
- Maybe we could add more early beta testing with win/mac common a/v software?
Is there any automated testing we could do like this? build on windows with a/v, check for startup crash and to see that pages load. analyze crashes which contain unusual DLLs on earlier channels? Action: try building some tools to expose these crashes sooner, investigate automation/canary idea, follow up with bsmedberg too
Fun with Bizdev
Issue: last minute partner deal uplift surprise Action:
Desktop (Search)
- https://bugzilla.mozilla.org/show_bug.cgi?id=1264786
- https://bugzilla.mozilla.org/show_bug.cgi?id=1266462
Fennec (Distribution partners)
- https://bugzilla.mozilla.org/show_bug.cgi?id=1260758 Less than 2 weeks before release?!
- https://bugzilla.mozilla.org/show_bug.cgi?id=1262591
- We did 2 extra RC builds for this in the last week of beta. Then, it (Search) partially drove a dot release, 46.0.1
google deal: we knew about it earlier, we expected google to need more time? but they didn't Point release vs. doing late betas? Set up a private repo/ branch, for testing, earlier. Test plan from BizDev (mkaply should have done that)
Congrats on your baby mconnor :) \o/
Issue: Release notes
Action:
- not very many release notes. no one was nominating bugs for relnotes
- it is hard on every channel to put together notes while doing other last minute work
- mistakes in android notes, fixed a day after release
- default/profiles removed, important for enterprise, no release note
Issue: Last minute surprises
Action:
Add-on team Several issues. Action:
* Addon signing cert expiration issue drove the 45.1.1esr, 46.0.1 mobile and desktop dot releases. Our tests caught this just before the release (over the weekend)
- last minute beta / addon cookie compatibility stuff
https://bugzilla.mozilla.org/show_bug.cgi?id=1259169#c48 Turned out not to be needed on beta after all No try push...... which would have told us it was a bad idea
This should have had a security review before moving to aurora:
- https://bugzilla.mozilla.org/show_bug.cgi?id=1245956 side loaded addon certs (landed in feb; last minute problem in beta 11)
Sync client FxA traffic overload!
action:
FxA server db was overloaded by client updates from Sync client code. https://bugzilla.mozilla.org/show_bug.cgi?id=1262312
Postmortem on this specific issue - https://docs.google.com/document/d/1OxHpHxqgEHMNW7ue7_qd8PYqHGPQvsEx3xla9vDbKGE/edit
Root cause: https://bugzilla.mozilla.org/show_bug.cgi?id=1262312 in the Fx46.0.0 FxA client code that sent a post every time the device name was simply read (rather than actually changed). This caused an unexpected increase in post operations being sent to FxA causing excessive load on the FxA db server.
Resolution: There was a hotfix released (46.0.1) which uplifted the existing fix to the original bug
Next steps: see doc. Various process changes, and monitoring changes to improve detection of problems in Beta release.
GTK2 watershed
- We forgot to put in a watershed to prevent people using gtk2 from updating to 46. There was a bug for this that should have blocked the feature moving to release. https://bugzilla.mozilla.org/show_bug.cgi?id=1227023 But that bug was only tracked (and marked fixed) for 45. We didn't realize this until a week and a half after releasing 46 and just after releasing 46.0.1 (updates were turned off but 22MM instances were already on 46.0)
BSD builds were broken
no one escalated till after the (dot) release, they complained a lot 2nd (3rd?) tier
- is this fixed for 47 now? liz will check
Miscellaneous problems
- 45 dot releases meant no one had time to work on 48 in nightly. We came into aurora 48 with many issues untouched. uplifts are backed up. This situation should improve with Marcia now on the team focusing on tracking requests/regressions in nightly + the Uptime team.
- fx android beta download links broken https://bugzilla.mozilla.org/show_bug.cgi?id=1262460
Issue: Missing signatures/key/hash for 45.0.1/.0.2
Issue: EME free repacks not offered till late in beta 46
Issue: We lacked documentation on partial builds for RC builds. I did it wrong. nthomas corrected and i added it to the beta checklist wiki page.
- 48 aurora on Android had an update blocker (Tuesday of release week). Sat for 4 days before it was fixed over the weekend. No clear owner to move it forward
- more stuff: metrics about how many people are running which version? how to expose this better?
============
Nitty gritty details of Beta issues (mostly build promotion, l10n, infrastructure, also the sec issue) We don't have to go over this point by point!
- (KaiRo) After release promotion caused us to have the builds later than usual (expected), we did run into the issue of mozmill being broken with Firefox 46+ and had to switch update tests to Marionette on short notice
- https://bugzilla.mozilla.org/show_bug.cgi?id=1255566 and https://github.com/mozilla/mozmill-ci/issues/765
- Henrik (whimboo) put in a lot of work within a short time to get this up and running
- KaiRo was unprepared for using very different workflow of finding out what any test errors are (not much treeherder experience)
- On the plus side, we now run the same tests as nightly/aurora and they report to treeherder in a nice fashion
- This was also happening on a Friday, extra stressful, SV in vegas not around
- (KaiRo) once update tests for b1 ran correctly (on Friday), we ran into some locales failing and realized they are completely broken. Finally tracked this to L10n-merge being broken on the releng side
- https://bugzilla.mozilla.org/show_bug.cgi?id=1255811 has the details
- really stressful to run into yet another issue on Friday, when b1 is far overdue
- apparently we don't even do any basic testing that localized builds run, catlee filed https://bugzilla.mozilla.org/show_bug.cgi?id=1255825 but we probably need some very basic UI testing as well to catch if the browser window even comes up correctly
- Liz decided in the end to ship b1 updates for en-US only
- beta 2
- released 2 days late (build promotion/l10n issues) on Thurs. Mar 17
- This was the first beta for non-en-US locales. So most crash data is from March 17 onwards.
- beta 3 skipped
- we did not have time to build and release anything significantly different from beta 2.
- beta 4 went ok. But it had to have a build 2 for fennec and build 3 for desktop. Stress for releng + relman + SV
- beta 5 - Easter friday, stressful, not enough people around
- Beta 6 also stressful, still easter. In retrospect should have planned to move the date to Tues/Wed.
- BUT just as we were about to ship, we ran into the sec/infrastructure issue. Could not ship beta 6.
- build post processing was also not right so we would have needed a build 2.
Beta 7 mobile + desktop
- Build failures (bringing servers back online from "infrastructure issues"
- Could not go with the earlier beta 6 build as there were some errors there anyway
- could have shipped beta 7 late on Friday, but didn't realize this till Monday
- Beta 7 released on Monday morning April 4
- Decision to push back release date
- Beta 8