Releases/Post-mortems/Firefox 8.0.1

From MozillaWiki
Jump to: navigation, search

Schedule / Location / Call Information

Other communication channels

  • join irc.mozilla.org #post-mortem for back channel (will be logged and attached here after)

Overview

We rolled an 8.0.1 update to all Mac users and non-FF8 Windows/Linux users with fixes for a Mac top crasher (bug 700835), and an extension blocklist top crasher (bug 699134) which was also believed to help mitigate the most serious startup crash (bug 691847) in the absence of a low risk fix. An additional fix was taken in support of bug 699134 to prevent a newly found regression for a subset of affected users. This release took 12 days from start to finish.

Timeline

Correct on a per-day basis, not ordered correctly on the same day.

Tue Nov 8

FF8 released

Wed Nov 9

Turned off updates on Mac due to Mac crasher, plan on 8.0.1, consider ridealongs

  • Started getting rumblings about Mac crasher bug 700835
    • We were unsure of where the regression was - 3rd party or FF
  • Turned off automatic updates for Mac in bug 701148
  • Font redirect bug 701262 brought to our attention
  • Started planning a chemspill for bug 700835

Thu Nov 10

Startup crash found, turned off all auto updates, 8.0.1 expanded to all platforms, tried to understand drivers, desire to bring ridealongs

  • Crash on update bug 691847 found
    • Immediately started looking at the comments and tried to find correlations
    • Cheng started to reach out to affected users that provided emails, pulled phone #s out of the comments just in case
  • Started planning a chemspill for both bug 691847 and bug 700835
  • Turned off automatic updates for all platforms (left manual/web on)
  • bug 699633 also brought to our attention
    • Landed on Beta in preparation for possible release inclusion
  • Also landed Font redirect bug 701262 on Beta for possible release inclusion
  • After channel/triage meeting, considered taking the following ride-alongs as well
    • bug 699776 to blocklist Rapport after seeing it have a significant number of crashes
    • bug 687220 was considered as the crash was very easily reproducible
  • Because bug 691847 was believe to be related to debug JS code, we asked QA to test top-50 extensions, especially those related to web development (firebug)
  • By the end of the day, we had a reproducible case for bug 691847 (system restoring to FF7 after upgrading to FF8) because of leads in crash comments, but did not know the reason behind system restores or whether this fully covered the crashes.
  • We considered the following for reasons why people would be system restoring
    • perf/crashes
    • "losing" addons due to the new disable dialog
    • external software
    • Support made documentation re startup crashes and top known issues (Java update and Roboform) as well as other issues.

Fri Nov 11

Ways to mitigate system restore crashes considered, continued to try to find drivers, weekend planned

  • Roboform crashes (bug 691271) were spiking so we considered taking bug 699134
    • also considered as a possible driver for system restores
  • Jeff continued his investigation into bug 691847
  • We met throughout the day to discuss bug 691847
    • Still did not understand the cause behind it - why hadn't we seen this before if system restores cause it? Were we causing the system restores?
  • We decided to try to be proactive with tackling bug 691847
    • Considered bug 701944 for a dialog that would explain that the user should reinstall if omni.jar had incorrect byte codes
    • Considered disabling the disable add-on dialog
    • Considered taking the hot-fix add-on if ready (it wasn't close to being ready to land on release)
  • Trusteer extension/DLL block taken off the table due to a lack of multiple versions of the binary, inability to fully test block
    • Also could not get in touch with the Trusteer engineers
  • A 7.0.2 update was discussed to mitigate the system restore issue, but the timeline necessary
  • Attempted to devise an experiment for Saturday which would enable updates for 12 hours on Saturday and compare to ADUs, hoping for the ratio to go down significantly if this was caused by an external problem
  • Alex sent out email to all@ asking for people to update their OS and 3rd party software to help figure out drivers for system restores
  • After testing updates/system restores from older versions of FF (6->7, etc.), we realized that this could have occurred previously but would have failed silently
  • Initial patch for Mac crasher bug 700835 resolved and checked into the release branch

Sat Nov 12

Continued trying to find sysrestore drivers, new crashers investigated, 12hr experiment done

  • bug 701944 discussed as too risky, localization effort too great - nixed
  • Spent a lot of time in the office trying to figure out drivers for the system restores
  • Continued to look at crash comments and data for the system restore crash
  • QA asked to find affected RoboForm DLL names/versions for bug 699134
  • Enabled automatic updates for 12 hours - 10AM-10PM
  • New crash spikes found on XP - bug 702040, bug 702041, bug 702042
    • decided to investigate in case this was somehow related to the system restores
  • Started to discuss another fix to mitigate system restores - renaming the omni.jar file with incorrect byte codes so that it wouldn't be picked up by sytem restores bug 701875
  • The fix in bug 691847 gets us back to the unreported crash, so we decided to not take on release

Sun Nov 13

Decide to mitigate sysrestore crashes instead of fixing them

  • Final patch for Mac crasher bug 700835 resolved and checked into the release branch
  • Decided data from Saturday experiment was inconclusive
  • Held meeting to discuss current state of blockers, decide on blockers
    • Engineering brought up risk of ride-alongs and dialog box
    • Decided we did not have enough data that points to the need to disable the add-on disable dialog to mitigate the system restore crash
    • Decided to find affected RoboForm population before making final decision on blocklist
    • Decided bug 701875 and bug 701944 was off the table due to risk
    • Asked Gilbert to come up with another 24 hour experiment if we wanted to see if we were past the issue

Mon Nov 14

Decided on update strategy to only give 8.0.1 to non-8 windows users, no ride-alongs, thought we finalized changeset

    • All ride-alongs denied for release because ride-alongs no longer being considered to keep rendering in versions in sync
  • Engineering asked to look at crash spikes found on XP - bug 702040, bug 702041, bug 702042
    • Determined that the crash rate was actually low, and related to malware
  • Came up with DLL versions and patch for bug 699134, tested try build, went to build (8.0.1b1)
  • Decided on "final" changeset
    • bug 699134 – Extension block request: Roboform
    • bug 700835 – [Mac] Firefox 8 and up crash with Apple's latest Java updates for OS X 10.6 and 10.7 closing tab/window containing Java applet

Tue Nov 15

New crashers found on XP with blocklist patch

  • Theme update bug 702558 brought to our attention, but the scope of the release had become much narrower and we'd already gone to build
  • The fact that the highlighter feature was accidentally enabled was also denied since we weren't rolling out to all users and we'd already gone to build
  • Placed 8.0.1 on hold because during QA we found that unblocked versions of RoboForm were causing new crashes
    • Ended up being an XP specific issue for unblocked versions of blocked DLLs

Wed Nov 16

Pursued blocklist fix patch, found out affected populations to decide whether to move forward

  • Patch to fix bug 699134 attached, found to not fix the issue
  • Found affected RoboForm users (startup crashes) for the situation where we did/didn't fix bug 699134
Population on XP with Roboform <7.6.2
% cat ~/Downloads/roboform_vamo_pings.csv | egrep -v "7\.6\.(2|3)"  | egrep "NT 5\.(1|2)" | awk -F"|" '{sum+=$3} END {print sum}'
24110
Population on XP with Roboform 7.6.2/3
% cat ~/Downloads/roboform_vamo_pings.csv | egrep "7\.6\.(2|3)"  | egrep "NT 5\.(1|2)" | awk -F"|" '{sum+=$3} END {print sum}' 
17396
Population on Vista/Win7/Win8 with Roboform <7.6.2
% cat ~/Downloads/roboform_vamo_pings.csv | egrep -v "7\.6\.(2|3)"  | egrep "NT 6" | awk -F"|" '{sum+=$3} END {print sum}' 
63655
Population on Vista/Win7/Win8 with Roboform 7.6.2/3
% cat ~/Downloads/roboform_vamo_pings.csv | egrep "7\.6\.(2|3)"  | egrep "NT 6" | awk -F"|" '{sum+=$3} END {print sum}' 
58809

Thu Nov 17

Pursued blocklist fix patch

  • New (believed) working patch to fix bug 699134 attached
  • QA could not complete testing because of failed try builds
  • We don't go to build since we don't have the necessary r+

Fri Nov 18

Testing confirmed, ready to go to build

  • Final revision of working patch to fix bug 699134 attached
  • QA tests try server builds successfully
  • Land this on all branches
    • Decide to allow the fix to bake

Sat Nov 19

Holding

  • People lived on nightly, didn't report issues

Sun Nov 20

Go to build, Start QA Testing of 8.0.1b2

  • People lived on nightly, didn't report issues
  • Sunday afternoon, go to build for 8.0.1b2. Final changeset
    • 2 patches from bug 699134 – Extension block request: Roboform
    • 1 patch from bug 700835 – [Mac] Firefox 8 and up crash with Apple's latest Java updates for OS X 10.6 and 10.7 closing tab/window containing Java applet
  • Sunday night Romanian QA contractors perform initial testing

Mon Nov 21

Good to go, shipped

  • Monday morning, final QA testing performed
  • Pushed build out to all, done

Discussion points

  • 3rd party software
    • knowing
      • Alex to talk with Kev about getting on mailing lists for major 3rd party software (and visa versa)
      • ADC
      • Windows updates
      • Flash
      • AVG
      • QA/Matt - to figure out testing pre-release software (contractors?). Also will be blocked on 3rd party software list from Kev
    • testing
  • system restores
  • Plugin crashes
    • Kev - flash plugin blocklist - https://bugzilla.mozilla.org/show_bug.cgi?id=704158
    • Sheila - to come up with top plugin crashers and plugin startup crashers today (12/2)
    • Alex/Sheila/Marcia - meet today 12/2 about startup crashers
    • startup crash mitigation?
  • Unique crashes
    • Gilbert - looking into unique information
  • Public communication
  • communication to internal (all@mozilla.com)
    • There is value to info like "nothing for your group to do before xx:xx, you can go get dinner". Especially as chemspill drags on, and also we need to communicate and planning human availabilty & handoffs-across-timezones.
  • Value of 12-hour experiments
  • Availability of people who could add information
  • What could we have seen in betas? QA?
  • Value of ridealongs
  • Roboform testing during 8.0.1 builds
    • Why blocklist testing
    • We need to understand the blocklisting system better (in general, we should 100% understand our mitigation tools)
  • System restores seem to be more common than we think
  • Should we have done outreach to the Roboform developers?
  • format for communicating chemspills
    • wiki page?
    • chemspill fatigue
  • making sure we have somebody on point for every issue
  • what thresholds will we roll for (.5 million users?) - startup crashes versus normal crashes
  • urgency? do we have to ship 8?
  • Alex - Don't ship on patch Tuesday!!!
  • Alex - shared calendar for releases/3rd party schedules
  • Consider something similar to the FF9 release strategy for future strategies
  • Alex/Christian - blocklisting ownership/strategy, worthy of a fulltime job?
    • partner phonebook
  • parallelized builds (try and go-to build)
  • FF8.0 shipped on same day as patch tuesday (win32 patches, java update).
    • Mozilla had no preview of what else was being shipped on tuesday. This complicated debugging.
    • can we get previews of upcoming patch tuesday changes and add to QA test matrix? akeybl to ask kev for contacts
    • on our 6 week cadence, when do we next hit a patch tuesday?
      • I don't think they're scheduled more than a few weeks in advance. I THINK the rule is "2nd tuesday, except if things are delayed and sometimes a bonus one on the 4th tuesday" From a wikipedia reference: "Microsoft releases security updates on the second Tuesday of every month and Windows Update releases non-security updates on the fourth Tuesday of every month."
    • to avoid patch tuesdays, should we start releasing on any-day-but-tuesday?
      • I like it :) -- Cww

Things that went right

  • good careful choice of "when to say go to build".
    • Temptation is to want to say "go to build" and scrub if there is a problem. However, this costs time, not saves time. Its faster to "wait for build to go green, then say "go to build while we see if builds/tests are good". If we "go to build" then have to abort/cleanup/restart, its slower overall, and lots of manual work.

Things that went wrong

Couldn't use the CDN when we were ready to ship build2

The build1 files had been pushed to the internal mirrors, and the CDN uses one of those as the origin server. When we respun the files were deleted from the mirrors, and IT purged the CDN (see bug 703729). After pushing build2 to mirrors, we were getting build1 bits back from requests to the CDN. The CDN was disabled and we waited for the mirrors to pick up normally, which took an extra 60-90 minutes. IT doesn't have any information for what went wrong with the CDN - would require further testing and followup with the provider. joduinn to followup with IT about testing/debugging this - bug 707560 filed

Emailing all crashers

Given the response rate (overwhelming), we got overwhelmed with crash emails.

Suggested improvements

  • Have a single source of truth for what's going on in a chemspill

Other reference material