Gecko:Shutdown issues

From MozillaWiki
Jump to: navigation, search

Shutdown problems

Bugzilla is filled with gruesome shutdown war stories. We have put a lot of effort into improving the situation with good results, but it is still pretty bad (as shown by this crash-stats search). Properly shutting down is hard, but it is important in order to accurately check for memory leaks.

Fundamentally I believe that a lot of our issues come from the fact that we write code focusing on making it work well during what we'll call the "stable state" of the program. That is, when modules are started up, the program is running and and hasn't shut down yet, which is most of the duration of the program. With a project of the complexity of gecko, it is already quite an achievement to get anything to work during the stable state, and unless shutdown issues show up on the first try push or at the time of landing, most developers will not think about what could happen if an object was still alive during shutdown while things that it relies on are getting into an invalid state (we usually assume all objects are long gone by that point), and thus most changesets aren't designed in a way that tries to prevent such situations from happening. Later, a thread is introduced here, a process there, timings change and some things that were not designed with shutdown in mind start causing trouble.

After working on shutdown related issues for a while we started to think in terms of two categories of objects:

  • static objects, not truly static in the C++ sense, but typically abstractions with extended lifetime, that short-lived object depend on. For example top-level IPDL protocols, singletons, a thread, the JS engine, etc.
  • dynamic objects with shorter lifetimes such as textures, DOM elements, etc.

This distinction is quite trivial but we'll use this terminology later.

An other way to categorize most objects is to separate manually managed objects with deterministic lifetime and automatically managed objects (reference-counted or garbage-collected) which for all intents and purposes have non-deterministic memory management. It turns out that there is a correlation between these categories: dynamic objects are often automatically managed and static objects are often manually managed. For example, most of the modules (static objects) being destroyed synchronously one after the other in ShutdownXPCOM.

A lot of the shutdown issues boil down to dynamic objects depending on static objects. During shutdown a lot of important static objects are destroyed and if dynamic objects that depend on them are still alive, we run into use-after-free bugs. Let's call this the first family of shutdown issue.

A second family of shutdown issues comes from improperly shutting down IPDL protocols. It's harder than it looks. One has to make sure that both sides have properly processed all incoming messages before closing, make sure that no new message will be lost in-flight when the protocol is destroyed, and take into account that it's common that a message sent may expect an asynchronous response (which should not race with the destruction either).

Right now, every time Firefox is closed, we can see loads of IPDL warnings that PContent messages are received too late while shutting down the protocol and are dropped. It could be that it is not a problem for PContent, however we used to have a similar story with various gfx IPDL protocols. In the case of the gfx protocols it led to some nasty crashes/leaks/corruptions which required us to redesign a lot of our IPC code. My intuition is that if it is fine to lose an IPDL protocol's messages now, it will likely be an issue at some point in the future, because a functionality that works "most of the time" often ends up being used by something that relies on it working every time. The PContent warnings are used as an example here because they are visible, but they may not be what we should focus on fixing right now.

The third family of shutdown issues is shutdown hangs. With e10s, the parent process waits for all content processes to be destroyed before it finishes shutting itself down, and if anything goes wrong on a content process at the wrong moment, we end up in a situation where the parent waits for something that will never happen. These hangs can be symptoms of different problems, some of which are caused by the first two families of shutdown issues presented earlier.

The first family of shutdown issues

I presented it earlier. Dynamic and automatically managed resources outlive static resources they depend on, and it causes crashes. Ideally we would not have manually managed resources that get destroyed one after the other in ShutdownXPCOM and they would all be automatically managed so that everything maintains its dependencies alive. The reality is that it would be hard to get anything to shut down at all in such a situation, or it would require us to rethink how every single module is shut down. For example graphics resources depend on threads so they have to be shut down before ShutdownPhase::ShutdownThreads (after which, well you can't use threads.) XPCOM threads themselves depend on other things, which depend on other things, and so on, and you quickly find out that to automatically manage the lifetime of a certain module you have to make everything else automatically managed. It is probably not a manageable change (I'd be happy that someone prove me wrong).

The current status is that modules are shutdown sequentially in a way that (implicitly) tries to respect the dependencies between modules. Except that this dependency graph has cycles. The cycle collector ends up being destroyed very late, which means some cycle-collected DOM elements end up being destroyed after things that they depend on. As a result, some canvas and media elements end up being destroyed after the modules they depend on (media, gfx), which causes some issues. The graphics and media team have put a lot of effort into mitigating this by trying to find live objects and force them to shut down even if something else will keep them alive longer, but keeping track of all live objects is hard, especially if these objects may be used on other threads. The result is that while we brought the crash volume down significantly, some objects are falling through the cracks and these crashes still exist today.

Two-phase shutdown proposal

Currently we go from stable state to shutdown, and the shutdown is this delicate mixture of destroying dynamic and static objects as well as we can. It would help a lot if we had a two-phase shutdown:

  • Phase 1: All modules destroy all of their dynamic objects, but stay in a usable state. No more DOM objects or documents or gpu textures floating around at the end of this phase. A cycle collection runs at the end of the phase to release the last dynamic objects. If there are still dynamic objects creating reference cycles after this, it should be considered to be a bug.
  • Phase 2: All modules shut down the way they do now, but without having to destroy all of their dynamic objects synchronously (which is the cause of much of our trouble) since the latter are already gone.

This is certainly a lot easier said than done, but managing shutdown the way we do now is arguably even harder.

Shutting down IPDL protocols

In this section we'll look at how we approach shutting down IPDL protocols in graphics, to avoid the second family of shutdown issues mentioned earlier. It may, or may not be useful for other modules. Brace yourselves for the worst ascii diagrams in the history of software engineering.

Some IPDL actors are short-lived cross-process representations of shared resources. Let's look at PTexture as an example. PTexture wraps a shared texture (can be a Shmem, a GPU side texture, or anything that can represent a texture and be shared). Textures are mostly managed by the content process in the sense that the latter creates them, paints into them, tells the compositor which ones to use and where, and eventually decides that the texture is not needed anymore and can be destroyed. PTexture messages can be sent from both the parent and child process.

Destroying a PTexture is therefore an initiative of the content process. we send the PTexture::Destroy message and put the PTextureChild in a state where it will not send any other messages after that, although it will continues receiving and handling messages sent by the PTextureParent. when PTextureParent receives the Destroy message, it sends any other message that it needs and sends the __delete__ message. Obviously it cannot send any other message after that since __delete__ is the way you tell IPDL that you are done with an actor and that it should delete the other side as well.

        Child process             .         Parent process
 PTextureChild::SendDestroy -.    .
                              \   .
                               \  .   ... can receive and send ...
                                \ .
       Can receive               `--> PTextureParent::RecvDestroy
       Can't send                 .   |
                                  .-- | Send__delete__
                                / .   ... can't receive nor send ...
                               /  .
             Recv_delete__ <--'   .

This simple hand-shake solves the problem of messages racing with the destruction of the actor. We apply the same handshake to other similar protocols (for example PCompositable). This works well under one condition: you don't ever need to destroy a PTexture pair synchronously. The problem is that if protocol PManagee is managed by protocol PManager, destroying a PManager actor will synchronously destroy all of its PManagee actors without giving you a chance to do the the asynchronous hand-shake. So we have to implement a similar hand-shake for the protocols that manage PTexture, and this constraint cascades all the way to the top-level protocol. When destroying our hypothetical PManager actor, we iterate over all of its PManagee actors, ask them to destroy themselves (send their Destroy message), and then we send the PManager's own Destroy message. This guarantees that messages will be processed in the proper order and the PManager's destruction will not interrupt the destruction of its PManagees.

        Child process             .         Parent process
      PManagee::SendDestroy  -.   .
                               \  .
      PManagee::SendDestroy  -. \ .
                               \ `- PManagee::RecvDestroy
      PManager::SendDestroy -.  /`- PManagee::RecvDestroy
                              \/ /. 
                              /\/ .   
  PManagee::Recv__delete__<--' /\ .
                              /  `-  PManager::RecvDestroy
  PManagee::Recv__delete__<--'   /.
                                / .
                               /  .
  PManager::Recv__delete__ <--'   .

For top-level protocols we do things a bit differently. We usually have to destroy them synchronously (because we are already in ShutdownXPCOM), so we have a synchronous WillClose message that the content process sends after asking all of the managed actors to shut themselves down. The synchronous nature of the message ensures that after it is done, any message sent as a result of destroying the managed protocols is already in our event-loop, waiting to be processed. The parent side can't send any message after it has received WillClose, and no message is sent on from content side after we have done the synchronous WillClose round-trip. The next thing we do is schedule a task on our own event loop that will call the builtin Close() IPDL method that closes the channel, and manually spin the event loop until this task has been processed. This way we know that although we may have ipdl messages to process in our event loop, Close() will be called after we have processed them. By the time we close the channel we have the guarantee that no ipdl traffic will get interrupted and no message will be dropped.

                 Child process             .         Parent process
               PManagee::SendDestroy  -.   .
                                        \  .
      begin PTopLevel::SendWillClose  -. \ .
                                        \ `-  PManagee::RecvDestroy
     (__delete__ msg in event-loop)  <--'  `- PManager::RecvWillClose 
    /                                      /
   /                                      /.
  /     end PTopLevel::SendWillClose  <--' .
  \     Schedule CloseTask    ---.         .
   \                              \        .
    `-> PManagee::Recv__delete__   \       .
                                   /       .
         PTopLevel::Close()  <----'        .

One last, but very important thing: ActorDestroy. Every IPDL actor can optionally override this method, but it should probably be mandatory. IPDL will call ActorDestroy on an actor just before destroying it in any situation (be it normal or abnormal shutdown). At any time something can go wrong (a process crashes for example), and IPDL actors will automatically get destroyed without the regular and carefully planned shutdown sequence running. Fortunately, IPDL will call ActorDestroy in any situation before destroying the actor. So ActorDestroy is the only reliable way to get notified that an actor is going to be destroyed. If you implement a shutdown sequence like PTexture's described earlier, the code that needs to run when a texture is destroyed should be called by ActorDestroy and not when receiving the Destroy you implemented. ActorDestroy will be called automatically when the __delete__ message is called. Here is an example of bug that was fixed by relying on actor destroy: the parent process waits for some of the IPDL protocols that use the compositor thread to be destroyed before it finishes shutting down. Some of these protocols would be sending the notification as part of their destruction sequence, but not from ActorDestroy (the notification itself is more involved than a simple reference count, I'll spare the gory details). It worked well unless a child process crashed, in which case we would miss a notification and hang the parent process's shutdown forever. Another example was shared resources managed by PTexture never being properly deallocated in the parent process if the child had crashed, which potentially leaked a lot of memory until we hooked the deallocation of the shared resource in ActorDestroy instead of when receiving the Destroy message.

Shutdown hangs