Releases/Post-mortems/Firefox 3.6.9
< Releases
Jump to navigation
Jump to search
Schedule / Location / Call Information
- Thursday, 2010-09-30 @ 1:00 pm PDT
- In Very Good Very Mighty (3rd floor)
- 650-903-0800 x92 Conf# 8605 (US/INTL)
- 1-800-707-2533 (pin 369) Conf# 8605 (US)
- join irc.mozilla.org #post-mortem for back channel
Overview
- 3.6.9 took bug 532730, which caused a crash on startup for some users
- Mark noticed the new Thunderbird top crash (also affected Seamonkey)
- Updates were turned off during investigation
- Press learned of updates being turned off
- Engineering investigation took a bit, still didn't know why
- Cheng contacted people to try to figure out root cause
- Eventually fixed by bug 594699
- During the 3.6.10 release there were issues around the release that were bumpy
- Less hurry as the updates were turned off
- Mirror uptake and when/if to use the CDN
- QA coverage in relation to the QA offsite
Things that went right
- chofmann had produced reports of new crashes
Things that went wrong
- The crashes were there in nightlies (with much less volume), but masked by different crashes with the same signature
- Below is a copy of the notes Al Billings left for me:
3.6.10 and 3.5.13 Post-Mortem Items 1) Lack of Information on progress going live: Shipping 3.6.10 (and 3.5.13) took an especially long time. The initial "go" to go live was given at 12:45 PM. More than an hour and a half later, QA and Release Management (Christian) were told that pickup on mirrors was still low and we had to wait. Eventually, Christian tried to get us to use the paid network (CDN?) because of uptake issues but that was blocked by JustinF. It turns out that when the "go" was given at 12:45 PM, RelEng began copying the release bits from one internal server to another. Only when this copying was done, could we go live. All of the delay and lack of uptake turned out to be because it took over 2 hours for the bits to all be copied internally. We were not waiting on external uptake because we had not offered the bits to external mirrors yet but no one out of RelEng knew this. Once the internal copying was done, we were able to get uptake enough to do release testing within 30 minutes. So the overall problem is: a) RelEng may or may not have the right processes to enable a quick release. For example, why did it take two hours to copy all of the data between internal boxes but 30 minutes for third parties outside of Mozilla to deploy it to the level of being able to go live? Can RelEng frontload certain internal tasks in order to facilitate quicker releases? b) Lack of RelEng transparency: Within the RelEng team, what was actually going on appears to be known but it was not communicated to anyone outside RelEng. Based on this lack of knowledge, decisions were made (such as the use of a paid network that costs Mozilla money) that were not necessary. RelEng insists pretty commonly on transparency and details as to what and how other groups are doing things but not for its own processes. We should have a clearer idea outside of RelEng as to what is going on during various parts of the release process. c) Lack of communication: As part of the lack of transparency, there was not clear minute by minute communication. We (QA and Release Management) would be told that things were in process and then have to ask again in 30 or 40 minutes to get an update. Potential solutions: a) Clear checklists of what is being done and by whom at each stage in the release process. b) As items in checklists are cleared, this status change should be communicated via e-mail (as John O'Duinn insists for all official communications) to all parties (probably release-drivers e-mail list). c) Clear criteria of expected times for each task in checklist: We need to know what the expected time to complete an item is and at what point this value has been exceeded enough to invoke some kind of emergency response (such as paying for network bandwidth, throwing more on-call engineers at the problem, etc). I hope that this is helpful.
Suggested Improvements
- Make sure we can differentiate different crashes with the same signature (bug 600929)
- Come up with a policy when/when not to use the cdn
- Investigate improvements to the local IT mirroring infrastructure to let us get builds to the mirror sites faster
- Come up with a contingency plan to rely less on mirrors for future updates (bug 596839?)
- We didn't learn that much from users doing the user outreach around crashes... but we got turnaround in about 48 hours. Let's keep that tool on our table.