Schedule / Location / Call Information

Thursday, 2010-09-30 @ 1:00 pm PDT
In Very Good Very Mighty (3rd floor)
650-903-0800 x92 Conf# 8605 (US/INTL)
1-800-707-2533 (pin 369) Conf# 8605 (US)
join irc.mozilla.org #post-mortem for back channel

Overview

3.6.9 took bug 532730, which caused a crash on startup for some users
Mark noticed the new Thunderbird top crash (also affected Seamonkey)
Updates were turned off during investigation
Press learned of updates being turned off
Engineering investigation took a bit, still didn't know why
Cheng contacted people to try to figure out root cause
Eventually fixed by bug 594699
During the 3.6.10 release there were issues around the release that were bumpy
- Less hurry as the updates were turned off
- Mirror uptake and when/if to use the CDN
- QA coverage in relation to the QA offsite

Things that went right

chofmann had produced reports of new crashes

Things that went wrong

The crashes were there in nightlies (with much less volume), but masked by different crashes with the same signature

Below is a copy of the notes Al Billings left for me:

3.6.10 and 3.5.13 Post-Mortem Items

1) Lack of Information on progress going live: Shipping 3.6.10 (and 3.5.13)
took an especially long time. The initial "go" to go live was given at 12:45
PM. More than an hour and a half later, QA and Release Management (Christian)
were told that pickup on mirrors was still low and we had to wait. Eventually,
Christian tried to get us to use the paid network (CDN?) because of uptake
issues but that was blocked by JustinF. It turns out that when the "go" was
given at 12:45 PM, RelEng began copying the release bits from one internal
server to another. Only when this copying was done, could we go live. All of
the delay and lack of uptake turned out to be because it took over 2 hours for
the bits to all be copied internally. We were not waiting on external uptake
because we had not offered the bits to external mirrors yet but no one out of
RelEng knew this. Once the internal copying was done, we were able to get
uptake enough to do release testing within 30 minutes.

So the overall problem is:

a) RelEng may or may not have the right processes to enable a quick release.
For example, why did it take two hours to copy all of the data between internal
boxes but 30 minutes for third parties outside of Mozilla to deploy it to the
level of being able to go live? Can RelEng frontload certain internal tasks in
order to facilitate quicker releases?

b) Lack of RelEng transparency: Within the RelEng team, what was actually going
on appears to be known but it was not communicated to anyone outside RelEng.
Based on this lack of knowledge, decisions were made (such as the use of a paid
network that costs Mozilla money) that were not necessary. RelEng insists
pretty commonly on transparency and details as to what and how other groups are
doing things but not for its own processes. We should have a clearer idea
outside of RelEng as to what is going on during various parts of the release
process.

c) Lack of communication: As part of the lack of transparency, there was not
clear minute by minute communication. We (QA and Release Management) would be
told that things were in process and then have to ask again in 30 or 40 minutes
to get an update.

Potential solutions:

a) Clear checklists of what is being done and by whom at each stage in the release process.

b) As items in checklists are cleared, this status change should be
communicated via e-mail (as John O'Duinn insists for all official
communications) to all parties (probably release-drivers e-mail list).

c) Clear criteria of expected times for each task in checklist: We need to know
what the expected time to complete an item is and at what point this value has
been exceeded enough to invoke some kind of emergency response (such as paying
for network bandwidth, throwing more on-call engineers at the problem, etc).

I hope that this is helpful.

Suggested Improvements

Make sure we can differentiate different crashes with the same signature (bug 600929)
Come up with a policy when/when not to use the cdn
Investigate improvements to the local IT mirroring infrastructure to let us get builds to the mirror sites faster
Come up with a contingency plan to rely less on mirrors for future updates (bug 596839?)
We didn't learn that much from users doing the user outreach around crashes... but we got turnaround in about 48 hours. Let's keep that tool on our table.

Releases/Post-mortems/Firefox 3.6.9

Contents

Schedule / Location / Call Information

Overview

Things that went right

Things that went wrong

Suggested Improvements

Navigation menu

Releases/Post-mortems/Firefox 3.6.9

Schedule / Location / Call Information

Overview

Things that went right

Things that went wrong

Suggested Improvements

Navigation menu

Search