Gecko:Shutdown issues: Difference between revisions

Clarification in the 2-phase shutdown proposal
(Discussions about Gecko shutdown related issues)
 
(Clarification in the 2-phase shutdown proposal)
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Shutdown problems ==
== Shutdown problems ==


Bugzilla is filled with gruesome shutdown war stories. We have put a lot of effort into improving the situation with good results, but it is still pretty bad (look at crash-stats).
Bugzilla is filled with gruesome shutdown war stories. We have put a lot of effort into improving the situation with good results, but it is still pretty bad (as shown by this [https://crash-stats.mozilla.com/search/?product=Firefox&shutdown_progress=shutdown&_facets=signature&_columns=date&_columns=signature&_columns=version&_columns=build_id&_columns=platform&_columns=shutdown_progress#facet-signature crash-stats search]).
Properly shutting down is hard, but it is important in order to accurately check for memory leaks.
Properly shutting down is hard, but it is important in order to accurately check for memory leaks.


Line 7: Line 7:
Fundamentally I believe that a lot of our issues come from the fact that we write code focusing on making it work well during what we'll call the "stable state" of the program. That is, when modules are started up, the program is running and and hasn't shut down yet, which is most of the duration of the program. With a project of the complexity of gecko, it is already quite an achievement to get anything to work during the stable state, and unless shutdown issues show up on the first try push or at the time of landing, most developers will not think about what could happen if an object was still alive during shutdown while things that it relies on are getting into an invalid state (we usually assume all objects are long gone by that point), and thus most changesets aren't designed in a way that tries to prevent such situations from happening. Later, a thread is introduced here, a process there, timings change and some things that were not designed with shutdown in mind start causing trouble.
Fundamentally I believe that a lot of our issues come from the fact that we write code focusing on making it work well during what we'll call the "stable state" of the program. That is, when modules are started up, the program is running and and hasn't shut down yet, which is most of the duration of the program. With a project of the complexity of gecko, it is already quite an achievement to get anything to work during the stable state, and unless shutdown issues show up on the first try push or at the time of landing, most developers will not think about what could happen if an object was still alive during shutdown while things that it relies on are getting into an invalid state (we usually assume all objects are long gone by that point), and thus most changesets aren't designed in a way that tries to prevent such situations from happening. Later, a thread is introduced here, a process there, timings change and some things that were not designed with shutdown in mind start causing trouble.


After working on shutdown related issues for a while we started to think in terms of two category of objects:
After working on shutdown related issues for a while we started to think in terms of two categories of objects:
* static objects, not truly static in the C++ sense, but typically abstractions with extended lifetime, than short-lived object depend on. For example top-level IPDL protocols, singletons, the JS engine, etc.
* static objects, not truly static in the C++ sense, but typically abstractions with extended lifetime, that short-lived object depend on. For example top-level IPDL protocols, singletons, a thread, the JS engine, etc.
* dynamic objects with shorter lifetimes such as textures, DOM elements, etc.
* dynamic objects with shorter lifetimes such as textures, DOM elements, etc.
This distinction is quite trivial but we'll use this terminology later.
This distinction is quite trivial but we'll use this terminology later.


An other way to categorize most objects is to separate manually managed objects with deterministic lifetime and automatically managed objects (reference-counted or garbage-collected) which for all intents and purposes have non-deterministic memory management. It turns out that there is a correlation between these categories: dynamic objects are often automatically managed and static objects are often manually managed.
An other way to categorize most objects is to separate manually managed objects with deterministic lifetime and automatically managed objects (reference-counted or garbage-collected) which for all intents and purposes have non-deterministic memory management. It turns out that there is a correlation between these categories: dynamic objects are often automatically managed and static objects are often manually managed.
For example, about all of the modules (static objects) being destroyed synchronously one after the other in ShutdownXPCOM.
For example, most of the modules (static objects) being destroyed synchronously one after the other in ShutdownXPCOM.


A lot of the shutdown issues boil down to dynamic objects depending on static objects. During shutdown a lot of important static objects are destroyed and if dynamic objects that depend on them are still alive, we run into use-after-free bugs. Let's call this the '''first family of shutdown issue'''.
A lot of the shutdown issues boil down to dynamic objects depending on static objects. During shutdown a lot of important static objects are destroyed and if dynamic objects that depend on them are still alive, we run into use-after-free bugs. Let's call this the '''first family of shutdown issue'''.
Line 19: Line 19:
A '''second family of shutdown issues''' comes from improperly shutting down IPDL protocols. It's harder than it looks. One has to make sure that both sides have properly processed all incoming messages before closing, make sure that no new message will be lost in-flight when the protocol is destroyed, and take into account that it's common that a message sent may expect an asynchronous response (which should not race with the destruction either).
A '''second family of shutdown issues''' comes from improperly shutting down IPDL protocols. It's harder than it looks. One has to make sure that both sides have properly processed all incoming messages before closing, make sure that no new message will be lost in-flight when the protocol is destroyed, and take into account that it's common that a message sent may expect an asynchronous response (which should not race with the destruction either).


Right now, every time Firefox is closed, we can see loads of IPDL warnings that PContent messages are received too late while shutting down the protocol and are dropped. It could be that it is not a problem for PContent, however we used to have a similar story with various gfx IPDL protocols. In the case of the gfx protocols it led to some nasty crashes/leaks/corruptions which required us to redesign a lot of our IPC code. My intuition is that if it is fine to loose an IPDL protocol's messages now, it will likely be an issue at some point in the future, because a functionality that works "most of the time" often ends up being used by something that relies on it working every time.
Right now, every time Firefox is closed, we can see loads of IPDL warnings that PContent messages are received too late while shutting down the protocol and are dropped. It could be that it is not a problem for PContent, however we used to have a similar story with various gfx IPDL protocols. In the case of the gfx protocols it led to some nasty crashes/leaks/corruptions which required us to redesign a lot of our IPC code. My intuition is that if it is fine to lose an IPDL protocol's messages now, it will likely be an issue at some point in the future, because a functionality that works "most of the time" often ends up being used by something that relies on it working every time.
The PContent warnings are used as an example here because they are visible, but they may not be what we should focus on fixing right now.
The PContent warnings are used as an example here because they are visible, but they may not be what we should focus on fixing right now.


The '''third family of shutdown issues''' is shutdown hangs. With e10s, the parent process waits for all content processes to be destroyed before it finishes shutting itself down, and if anything goes wrong on a content process at the wrong moment, we end up in a situation where the parent waits for something that will never happen. These hangs can be symptoms of different problems, some of which being caused by the first two families of shutdown issues presented earlier.
The '''third family of shutdown issues''' is shutdown hangs. With e10s, the parent process waits for all content processes to be destroyed before it finishes shutting itself down, and if anything goes wrong on a content process at the wrong moment, we end up in a situation where the parent waits for something that will never happen. These hangs can be symptoms of different problems, some of which are caused by the first two families of shutdown issues presented earlier.


== The first family of shutdown issues ==
== The first family of shutdown issues ==


I presented it earlier. Dynamic and automatically managed resources outlive static resources they depend on, and it causes crashes. Ideally we would not have manually managed resources that get destroyed one after the other in ShutdownXPCOM and they would all be automatically managed so that a everything maintains its dependencies alive. The reality is that it would be hard to get anything to shut down at all in such a situation, or it would require us to rethink how every single module is shut down. For example graphics resources depend on threads so they have to be shutdown before ShutdownPhase::ShutdownThreads (after which, well you can't use threads.) XPCOM threads themselves depend on other things, which depend on other things, and so on, and you quickly find out that to automatically manage the lifetime of a certain module you have to make everything else automatically managed. It is probably no a manageable change (I'd be happy that someone prove me wrong).
I presented it earlier. Dynamic and automatically managed resources outlive static resources they depend on, and it causes crashes. Ideally we would not have manually managed resources that get destroyed one after the other in ShutdownXPCOM and they would all be automatically managed so that everything maintains its dependencies alive. The reality is that it would be hard to get anything to shut down at all in such a situation, or it would require us to rethink how every single module is shut down. For example graphics resources depend on threads so they have to be shut down before ShutdownPhase::ShutdownThreads (after which, well you can't use threads.) XPCOM threads themselves depend on other things, which depend on other things, and so on, and you quickly find out that to automatically manage the lifetime of a certain module you have to make everything else automatically managed. It is probably not a manageable change (I'd be happy that someone prove me wrong).


The current status is that modules are shutdown sequentially in a way that (implicitly) tries to respect the dependencies between modules. Except that this dependency graph has cycles. The cycle collector ends up being destroyed very late, which means some cycle-collected DOM elements end up being destroyed after things that they depend on. As a result, some canvas and media elements end up being destroyed after the modules they depend on (media, gfx), which causes some issues. The graphics and media team have put a lot of effort into mitigating this by trying to find live objects and force them to shut down even if something else will keep them alive longer, but it's hard to this kind of things well, especially if these objects may be used on other threads. The reality is that while we brought the crash volume down significantly, these crashes still exist today.
The current status is that modules are shutdown sequentially in a way that (implicitly) tries to respect the dependencies between modules. Except that this dependency graph has cycles. The cycle collector ends up being destroyed very late, which means some cycle-collected DOM elements end up being destroyed after things that they depend on. As a result, some canvas and media elements end up being destroyed after the modules they depend on (media, gfx), which causes some issues. The graphics and media team have put a lot of effort into mitigating this by trying to find live objects and force them to shut down even if something else will keep them alive longer, but keeping track of all live objects is hard, especially if these objects may be used on other threads. The result is that while we brought the crash volume down significantly, some objects are falling through the cracks and these crashes still exist today.


== Two-phase shutdown proposal ==
== Two-phase shutdown proposal ==


Currently we go from stable state to shutdown, and the shutdown is this delicate mixture of destroying dynamic and static objects as well as we can. It would help a lot if we had a two-phase shutdown:
Currently we go from stable state to shutdown, and the shutdown is this delicate mixture of destroying dynamic and static objects as well as we can. It would help a lot if we had a two-phase shutdown:
* Phase 1: All modules destroy all of their dynamic objects, but stay in a usable state. No more DOM objects or documents or gpu textures floating around at the end of this phase. Id the cycle collector is still retaining resources at this point, it's a bug.
* Phase 1: All modules destroy all of their dynamic objects, but stay in a usable state. No more DOM objects or documents or gpu textures floating around at the end of this phase. A cycle collection runs at the end of the phase to release the last dynamic objects. If there are still dynamic objects creating reference cycles after this, it should be considered to be a bug.
* Phase 2: All modules shut down the way they do now, but without having to destroy all of their dynamic objects synchronously (which is the cause of much of our trouble) since the latter are already gone.
* Phase 2: All modules shut down the way they do now, but without having to destroy all of their dynamic objects synchronously (which is the cause of much of our trouble) since the latter are already gone.


This is certainly a lot easier said than done, but managing shutdown the way we do now is harder.
This is certainly a lot easier said than done, but managing shutdown the way we do now is arguably even harder.


== Shutting down IPDL protocols ==
== Shutting down IPDL protocols ==
Line 45: Line 45:
Textures are mostly managed by the content process in the sense that the latter creates them, paints into them, tells the compositor which ones to use and where, and eventually decides that the texture is not needed anymore and can be destroyed. PTexture messages can be sent from both the parent and child process.
Textures are mostly managed by the content process in the sense that the latter creates them, paints into them, tells the compositor which ones to use and where, and eventually decides that the texture is not needed anymore and can be destroyed. PTexture messages can be sent from both the parent and child process.


Destroying a PTexture is therefore an initiative of the content process. we send the PTexture::Destroy message and put the PTextureChild in a state where it will not send any other messages after that, although it will continues receiving and handling messages sent by the PTextureParent. when PTexturearent receives the Destroy message, it sends any other message that it needs and sends the __delete__ message. Obviously it cannot send any other message after that since __delete__ is the way you tell IPDL that you are done with an actor and that it should delete the other side as well.
Destroying a PTexture is therefore an initiative of the content process. we send the PTexture::Destroy message and put the PTextureChild in a state where it will not send any other messages after that, although it will continues receiving and handling messages sent by the PTextureParent. when PTextureParent receives the Destroy message, it sends any other message that it needs and sends the __delete__ message. Obviously it cannot send any other message after that since __delete__ is the way you tell IPDL that you are done with an actor and that it should delete the other side as well.




Line 74: Line 74:
                                 \ `- PManagee::RecvDestroy
                                 \ `- PManagee::RecvDestroy
                                 \/.
                                 \/.
       PManager::SendDestroy -.  /`- PManager::RecvDestroy
       PManager::SendDestroy -.  /`- PManagee::RecvDestroy
                               \/ /.  
                               \/ /.  
                               /\/ .   
                               /\/ .   
Confirmed users
138

edits