Firefox/Channels/Postmortem/46

From MozillaWiki
Jump to: navigation, search
« previous release | index | next release »

Firefox 46 Post-mortem

Firefox 46 release post-mortem Rescheduled to Tues. May 17 10:30am PST (or just after the channel meeting finishes) Vidyo room: Release Coordination irc: #relman

please add more issues here!

Release health status at halfway point in 46 release: https://mozilla.github.io/releasehealth/?channel=release


Anything relman could be doing better from your perspective? Let us know here...



Issue: build promotion

  • We knew this would have some issue to iron out, and it did. Some builds just failed. Some, we had to have multiple builds.
  • At some point, tried to use build promotion to give us a jump on shipping a newer changeset. In some ways, this backfired since it meant SV tested two builds.
  • Mozmill broke
  • l10n broke
  • doing things for first time on friday means fewer people are around for problems
  • summary of beta build
    • Beta 1 only released to en-US.
    • Beta 2 was 2 days late.
    • Beta 3 skipped.
    • Beta 4, OK but needed several builds.
    • Beta 5: Easter.
    • Beta 6: still easter, also, sec issue meant, not shipped.
    • Beta 7 shipped 3 days late. At this point we pushed back the release date by 1 week.
    • More detailed notes on what went wrong below (under the line of =====)


Use of the Blocking flag

  • We used the "blocking" flag and the release heath dashboard to mark blockers for release and for the point release. This worked well for me (liz) to keep everyone focused on the blocking issues and to communicate clearly across teams.
    • Also showed on the TVs in various offices

Great teamwork on critical, release blocking issue


Issue: Need for backup QE manual, update testers in US time zones

Action:

( for holidays and for unexpected situations) Handoff is important, we need to communicate better Michelle F. is now able to run the update testing. (But, if there are problems, we still don't have help on a Friday to fix them)

Issue: Too many builds, heavy load on QE

Action:

  • plan better around holidays (deliberately eliminate a beta build?)
  • More people at SV to help
  • Plan better for the overlap of 2 esrs (We could note it on the release calendar


Issue: More uplifts, more tracked bugs

Action:

  • We increased the number of bugs we looked at (in the 46 beta cycle) with platform triage queries. More churn in beta.
  • The backlog/ workload on relman should slowly improve as we catch up on resolving carryover regressions
  • The extra work from the platform triage team helped find and resolve more issues! (but also meant more work for relman)

Issue: untracked, untagged bugs

Action:

7 or 8 blocking issues were put in as uplift requests on Monday morning after the beta to release merge, on the wrong channel (mozilla-beta rather than m-r)! None were tracked or tagged "regression". That is too sloppy. Devs and QE both need to add "regression" or request tracking!

Issue: antivirus/junkware/malware

  • Causing crashes or other issues, we don't have a clear path to fix this stuff
  • frequent driver of dot releases/ panic in late beta
  • Maybe we could add more early beta testing with win/mac common a/v software?
Is there any automated testing we could do like this? 
build on windows with a/v, check for startup crash and to see that pages load. 
analyze crashes which contain unusual DLLs on earlier channels?
Action:  try building some tools to expose these crashes sooner, investigate automation/canary idea, follow up with bsmedberg too

Fun with Bizdev

Issue: last minute partner deal uplift surprise Action:

Desktop (Search)

Fennec (Distribution partners)

  • We did 2 extra RC builds for this in the last week of beta. Then, it (Search) partially drove a dot release, 46.0.1

google deal: we knew about it earlier, we expected google to need more time? but they didn't Point release vs. doing late betas? Set up a private repo/ branch, for testing, earlier. Test plan from BizDev (mkaply should have done that)


Congrats on your baby mconnor :) \o/


Issue: Release notes

Action:

  • not very many release notes. no one was nominating bugs for relnotes
  • it is hard on every channel to put together notes while doing other last minute work
  • mistakes in android notes, fixed a day after release
  • default/profiles removed, important for enterprise, no release note

Issue: Last minute surprises

Action:

Add-on team Several issues. Action:

* Addon signing cert expiration issue drove the 45.1.1esr, 46.0.1 mobile and desktop dot releases. Our tests caught this just before the release (over the weekend) 
   
  • last minute beta / addon cookie compatibility stuff

https://bugzilla.mozilla.org/show_bug.cgi?id=1259169#c48 Turned out not to be needed on beta after all No try push...... which would have told us it was a bad idea

This should have had a security review before moving to aurora:


Sync client FxA traffic overload! action: FxA server db was overloaded by client updates from Sync client code. https://bugzilla.mozilla.org/show_bug.cgi?id=1262312 Postmortem on this specific issue - https://docs.google.com/document/d/1OxHpHxqgEHMNW7ue7_qd8PYqHGPQvsEx3xla9vDbKGE/edit

   Root cause: https://bugzilla.mozilla.org/show_bug.cgi?id=1262312 in the Fx46.0.0 FxA client code that sent a post every time the device name was simply read (rather than actually changed).  This caused an unexpected increase in post operations being sent to FxA causing excessive load on the FxA db server.
   Resolution: There was a hotfix released (46.0.1) which uplifted the existing fix to the original bug
   Next steps: see doc. Various process changes, and monitoring changes to improve detection of problems in Beta release.


GTK2 watershed

  • We forgot to put in a watershed to prevent people using gtk2 from updating to 46. There was a bug for this that should have blocked the feature moving to release. https://bugzilla.mozilla.org/show_bug.cgi?id=1227023 But that bug was only tracked (and marked fixed) for 45. We didn't realize this until a week and a half after releasing 46 and just after releasing 46.0.1 (updates were turned off but 22MM instances were already on 46.0)



BSD builds were broken

no one escalated till after the (dot) release, they complained a lot
2nd (3rd?) tier 
  • is this fixed for 47 now? liz will check


Miscellaneous problems

  • 45 dot releases meant no one had time to work on 48 in nightly. We came into aurora 48 with many issues untouched. uplifts are backed up. This situation should improve with Marcia now on the team focusing on tracking requests/regressions in nightly + the Uptime team.

Issue: Missing signatures/key/hash for 45.0.1/.0.2

Issue: EME free repacks not offered till late in beta 46

Issue: We lacked documentation on partial builds for RC builds. I did it wrong. nthomas corrected and i added it to the beta checklist wiki page.

  • 48 aurora on Android had an update blocker (Tuesday of release week). Sat for 4 days before it was fixed over the weekend. No clear owner to move it forward


  • more stuff: metrics about how many people are running which version? how to expose this better?
============

Nitty gritty details of Beta issues (mostly build promotion, l10n, infrastructure, also the sec issue) We don't have to go over this point by point!


  • (KaiRo) After release promotion caused us to have the builds later than usual (expected), we did run into the issue of mozmill being broken with Firefox 46+ and had to switch update tests to Marionette on short notice
    • https://bugzilla.mozilla.org/show_bug.cgi?id=1255566 and https://github.com/mozilla/mozmill-ci/issues/765
    • Henrik (whimboo) put in a lot of work within a short time to get this up and running
    • KaiRo was unprepared for using very different workflow of finding out what any test errors are (not much treeherder experience)
    • On the plus side, we now run the same tests as nightly/aurora and they report to treeherder in a nice fashion
    • This was also happening on a Friday, extra stressful, SV in vegas not around
  • (KaiRo) once update tests for b1 ran correctly (on Friday), we ran into some locales failing and realized they are completely broken. Finally tracked this to L10n-merge being broken on the releng side
  • beta 2
    • released 2 days late (build promotion/l10n issues) on Thurs. Mar 17
    • This was the first beta for non-en-US locales. So most crash data is from March 17 onwards.
  • beta 3 skipped
    • we did not have time to build and release anything significantly different from beta 2.
  • beta 4 went ok. But it had to have a build 2 for fennec and build 3 for desktop. Stress for releng + relman + SV
  • beta 5 - Easter friday, stressful, not enough people around
  • Beta 6 also stressful, still easter. In retrospect should have planned to move the date to Tues/Wed.
    • BUT just as we were about to ship, we ran into the sec/infrastructure issue. Could not ship beta 6.
    • build post processing was also not right so we would have needed a build 2.

Beta 7 mobile + desktop

  • Build failures (bringing servers back online from "infrastructure issues"
  • Could not go with the earlier beta 6 build as there were some errors there anyway
  • could have shipped beta 7 late on Friday, but didn't realize this till Monday
  • Beta 7 released on Monday morning April 4
  • Decision to push back release date
  • Beta 8