Releases/Post-mortems/Firefox 3.6.4

From MozillaWiki
Jump to: navigation, search

Schedule / Location / Call Information

  • Thursday, 2010-08-12 @ 11:00 am PST (scheduled to last no longer than 1 hour 45 mins, shooting for 1 hour)
  • In Warp Core
  • 650-903-0800 x92 Conf# 8605 (US/INTL)
  • 1-800-707-2533 (pin 369) Conf# 8605 (US)

Other communication channels

  • join irc.mozilla.org #post-mortem for back channel (will be logged and attached here after)
  • Etherpad for meeting notes can be found here if people want to use it

Overview

The project can be analyzed by slicing it into the following components:

  1. Feature development
  2. "One more blocking bug"/schedule slipping
  3. Crash-stats irregularities
  4. Quick spin for bug 574905

Timeline

  1. Firefox 3.6.3 shipped on April 1
  2. Christian originally scheduled 3.6.4 for May 11, got feedback and tightened up to May 4
  3. Lorentz / 3.6.3plugin1 beta went out on April 8
  4. Firefox 3.6.4 build #1 went as an opt-in beta on April 16
  5. Bug 561308 prevented build 2 from being built on schedule and then bug 561817 made us start the builds a day late
  6. Firefox 3.6.4 build #2 had an issue and never went out
    • nthomas found that bug 534666 landed on default and not the relbranch
  7. Christian emails socorro team with OOPP reporting concerns on April 28
  8. Firefox 3.6.4 build #3 went as an opt-in beta on May 4
    • Bug 563847 was determined to be a blocker, decided we couldn't go to the entire beta audience with build #3
  9. Christian posted in Farmville forums on May 5th asking for testing. The thread was promptly deleted
  10. chofmann asked on May 8 if we had Zynga contacts as some users were complaining about Farmville via Hendrix, beltzner said he would reach out and suggested we post in the user forums
  11. This was the status on May 10th
  12. On May 12th decided to create build #4 even though there were outstanding issues.
    • This was also when it was first decided that 3.5.10 would stay tied to 3.6.4, in response to KaiRo
  13. Firefox 3.6.4 build #4 went to beta on May 14
  14. On May 17th metrics/Daniel set up the super-useful page at https://metrics.mozilla.com/stats/firefox.shtml
  15. Firefox 3.6.4 build #5 went to beta on May 26
    • Didn't go on the 25th due to MV network issues
    • Found bug 568129 before releasing to beta, knew we would have to respin but decided it wouldn't hurt to ship it to beta users on older builds
  16. Firefox 3.6.4 build #6 went to beta on May 28
    • We called this a "release candidate" and did more press/blog posts. Weren't comfortable calling out 80% improvement as we weren't sure it would hold up in the release audience
    • We were watching bug 563361 and bug 569104
  17. We were trying to get a handle on the Cnet issue. Also, this is the first time we seriously started to discuss turning off OOPP
    • We had escalated with Adobe and were getting to the right people at Cnet
    • More talk of splitting 3.5.10 and 3.6.4 at this time, TB guys getting antsy
  18. Christian got to the lead Flash developer and download.com product manager at Cnet on June 4
  19. Around June 10th (probably earlier) we notice and get concerned about the crash spike
    • Main tracking bug was bug 571118
    • Also dbaron asked in conversation if socorro would be able to handle the increase in "crash" volume due to oopsies after releases
  20. Decided not to block on Cnet issue, but bug 562198 became a blocker as it prevented Linux users from using banking sites
  21. Firefox 3.6.4 build #7 went to beta on June 14
  22. Most crash-stats investigations were wrapping up by June 15
  23. TB team ships on June 17, didn't disclose security vulns that affect Firefox
  24. dbaron asked about socorro capacity on June 21
  25. crash-stats spike finally solved on June 21 (configuration problem)
  26. Firefox 3.6.4 shipped on June 22
  27. A security researcher got turned around and went public with this bug, thinking it was fixed in 3.6.4 (it wasn't shipped until 3.6.7)
  28. This bug flared up. RRRT saw it, and Lilly contacted Zynga. Required a quick 3.6.6 (as 1.9.2.5 was taken by Fennec)
  29. Socorro team had to turn throttling to 10% on June 25 as the system was overwhelmed
  30. Firefox 3.6.6 shipped on June 26

Discussion points

  • What does baking on trunk mean to us now?
  • Could we have foreseen every new blocking bug and/or the bug that caused the respin? How can we not get blindsided in the future?
  • Did we make the right call keeping 3.5.10 and 3.6.4 tied together?
  • Did the "project branch → opt-in beta → beta → release" format work well? How might we do it differently/better?
  • How can schedule be better communicated when things are in flux?
  • Is socorro at a state we are confident with? Are there more changes that need to be made? Are there future projects that may have the same sort of issues?
  • Do we need an action plan for dealing with 3rd parties? How long are we expected to wait? Should we have a more formal outreach/partner program?
  • What sort of things might we backport in the future? Are the lessons here specific to OOPP or can they be applied generally?
  • clear mails (with correct subjects) on rel-drivers for record keeping
    • not everyone reads never-ending scrollback
    • easy to miss an important handoff if not reply-all, with changed subject
    • hard to figure out historical
  • problems tracking patches across branches
    • bsmedberg reported problems tracking fixes on lorentz to m-c and then to moz192 and relbranch
    • nthomas caught missed fix on relbranch with build#2
  • any way to avoid one patch per respin?
  • Metrics can work on Operational Metrics dashboards for systems that have complex interactions or systems that can be monitored for the trending affect of things such as a config change or a release. See [[1]]

Things that went right

Things that went wrong

Suggested improvements

  • Release codenames to reduce confusion (?) (clegnitto)
  • Branch landing verifier scripts (clegnitto)
  • Need to not use IRC and meetings, need a written record
  • Emails to release-drivers should have clear subjects with the version # and not be threaded
  • Date-scoped queries for historical mining of bug state
  • Need better defined/more formal beta program and feedback channels
  • Create alternate plans at the beginning and add firebreaks with mitigation plans
  • Use a rage for certain schedule items (shipping in particular), to give some wiggleroom and prevent excessive schedule churn
  • Would be useful for RelEng to have SLAs so that release drivers can set expectations/urgency for each build. The information can also be looked at in post-mortems as well
  • Front-load / "pre-mortem" (meeting, etc) the QA test plan to bring new ideas and unique testing. Formally modify the plan as needed, prevent ad-hoc QA test plan
  • Socorro team came up with better trends/operational stats and linking them with events. Make this general / use it everywhere
  • "Sightings" in bugs will potentially make sure fixes aren't missed when porting between branches
  • Implement a "Related items in external systems" (key/value fields) to link bugs to commits via automation

Other reference material