Improving Firefox behavior under low memory conditions

From MozillaWiki
Jump to navigation Jump to search

Introduction

Properly dealing with low-memory conditions is important to ensure reliable operation and a pleasant experience to Firefox users. Data from telemetry and crash reports have shown us that even users with high-end machines can experience poor performance or crashes under certain conditions, and users with old machines can be severaly affected by crashes or swapping. Additionally Firefox should behave properly with respect to the rest of the system, avoiding to consume resources that might be needed elsewhere.

While we do have mechanisms in place to deal with these conditions - some of which are very effective - there's never been a holistic approach to them, and their effectiveness depends on the platform and the specific memory conditions.

In particular we should work towards two separate goals:

  • ensuring that crashes happen as little as possible and cause minimal disruption when they happen
  • remain responsive under low-memory conditions, in particular if the machine is actively swapping

Historical context

The first efforts to improve Firefox behavior under low-memory conditions date back to the introduction of the AvailableMemoryWatcher in bug 670967 for Windows. This was based on a poll-like mechanism hooked to Windows memory allocation functions and would send memory-pressure events both in case of address space exhaustion (a potential problem on 32-bit hosts) and low physical memory. Various components within Gecko would respond to these events in an attempt to reduce overall memory usage: purging the image cache, shrinking buffers to their minimum required size, running the garbage and cycle collectors, etc... At this point the logic assumed only one Firefox process and while ostensibly designed to deal both with swapping and OOM avoidance had limitations. In particular the polling-like mechanism had performance costs and was not particularly effective, in part because the thresholds we set turned out to be too low.

The next step that was taken during Firefox OS development. Firefox OS was the first multi-process release product we shipped and had extensive access to Android low-level facilities. The former called for the introduction of the ProcessPriorityManager in bug 768832. This attacked the problem from a completely different angle: we could now prioritize which processes needed to be kept alive instead of trying to shrink the amount of memory they used. Letting the kernel kill low-priority processes would free memory for those visible to the user, in turn the killed processes could be restored by leveraging the session manager.

Additionally Firefox OS leveraged android kernel-based low-memory detection to send events when the system was running out of memory before any process would be killed (bug 771195). This mechanism was hooked up to the existing memory-pressure events, sped-up by allowing it to bypass the event queues (bug 876029). This mechanism would later be used in Fennec and ultimately Fenix too.

At this stage the several different low memory event types would be sent based on the different conditions we could encounter. For example, several low-memory conditions would trigger "ongoing" memory-pressure events which would not attempt to purge caches once more (as they were unlikely to have been already refilled).

No such trigger-based mechanism existed on desktop platforms and the existing poller was Windows-specific. Additionally child processes in desktop platforms couldn't detect low-memory conditions on their own. So the the Windows-specific polling mechanism we used was replaced with a timing-based mechanism that would run opportunistically based on the user interaction with the browser in bug 1451005. This also added a mechanism to forward the memory-pressure events to child processes, to avoid running several polling loops at the same time.

The combination of low-memory triggers and process prioritization proved very successful on Firefox OS, allowing it to run on machines with as little as 128 MiB of memory. The next logical step would be to leverage similar mechanisms on desktop too, which lead to the introduction of automatic tab unloading in bug 675539. The idea was to opportunistically unload tabs before the user would run out of memory, and reload them as needed. Measurements quickly showed that this mechanism was hampered by the other low-memory listeners, and thus it was refactored so that it would run first bug 1587762.

In spite of all the work up until this point none of these mechanisms worked on desktop machines as well as the Android-based solutions. This lead a new avenue of work: finding non-polling mechanisms to reduce reaction time. New low-memory detectors were thus implemented: for Windows based on memory resource notifications (bug 1586236), for macOS based on GCD's DISPATCH_SOURCE_TYPE_MEMORYPRESSURE events (bug 1595627) and on Linux, still using a polling mechanism unfortunately (bug 1532955).

Monitoring the effectiveness of those events in the wild showed that again none had a significant impact. They worked some of the time but not always, and sometimes they were even counterproductive, which lead to disabling tab unloading on some platforms again.

It was at that time that we did a more thorough investigation of the causes of out-of-memory crashes and how systems would behave in low-memory conditions. This lead to the realization that we had sometimes been measuring values that were not reflective of the state of the system (Windows), that we expected conditions that would never happen (macOS) and that we had not been leveraging the best mechanisms we had available (Linux).

In turn this research lead to the introduction of the two most effective mechanisms we currently have: delaying allocations instead of crashing on Windows (bug 1716727) and leveraging the kernel OOM killer to reap low-priority processes (bug 1771712) on Linux. Both mechanisms greatly reduced OOM crashes (see this article) but also highlighted flaws in our existing mechanism and approach.

Platform-specific behavior

In order to improve our behavior under low-memory conditions it's important to define what those are on different platforms, understand how different operating systems react to them and how this is visible from userspace.

Android

Android has the simplest behavior among all platforms. It usually does not swap and the kernel automatically kills applications in order to free memory, malloc() or mmap() calls will never fail unless address space has been exhausted. To observe low-memory conditions an application can implement an onTrimMemory() callback which is invoked to give applications a chance to reduce their memory consumption before they are killed. We already use this mechanism in the MemoryController class. As for process prioritization Android already assigns different priorities based on the process behavior (e.g. if a process is visible, if it's running background activities, etc...) and instructs the kernel OOM killer to reap the "least useful" processes. On top of this we leverage Context.updateServiceGroup to inform Android about the relative priority of our own processes so that older tabs will be killed before newer ones for example (see bug 1625326).

Android supports using a swap file via the zswap or zRAM modules but we're not aware of their use in the wild. For the time being I think it's safe to ignore those unless we find a significant number of users having them enabled.

Linux

By virtue of having the same kernel Linux behaves somewhat similarly to Android. malloc() and mmap() calls will never fail unless address space has been exhausted, the kernel will make extensive use of the swap file if available to keep applications alive, and when it completely runs out of memory it will kill processes to make room by send them uncatchable SIGKILL signals.

Linux does not have a trigger-based mechanism to detect low memory conditions, forcing us to rely on polling and previous to the introduction of Pressure Stall Information did not have a good way to detect swapping. We recently added support for PSI in bug 1982963. Memory pressure events are implemented by polling and rely on the information /proc/meminfo.

As for process prioritization the OOM killer will normally pick the process that will yield more memory when killed - which is usually a bad pick - but it can be nudged toward other processes by using the /proc/<pid>/oom_score_adj value. The process priority manager uses this mechanism to instruct the kernel to reap the least important processes first (such as the preallocated content processes) and then content processes roughly in the same order as the tab unloader would have unloaded them.

Tab unloading is disabled on Linux.

macOS

Contrary to other platforms not only memory allocations will never fail on macOS but processes will never be killed either. macOS will extend the swap file up to two times the size of the machine's physical memory (!) to avoid an OOM condition. This usually slows the machine to a crawl, and if memory usage keeps increasing above this level a pop-up will be shown to ask the user which application they want to kill to make room in memory.

macOS has several ways to inform the application of low-memory conditions, these are either UIKit callbacks or events sent to dispatch queues of the DISPATCH_SOURCE_TYPE_MEMORYPRESSURE type, see here.

Firefox currently uses all these mechanisms to send memory pressure events but unfortunately it is unclear what kind of condition each of them represents, and what are the relative thresholds (free physical memory, swap available, etc...).

No process prioritization is done on macOS.

Tab unloading is enabled on macOS.

Windows

Windows is the platform which is more likely to exhibit out-of-memory and low-memory conditions due to its inability to overcommit memory. To understand how Windows work it's important to keep in mind that memory goes through three distinct states instead of two like in the other platforms.

Address space can be reserved via VirtualAlloc() and similar functions, this will only yield the requested space but not the underlying memory. Touching these memory areas will cause an access error. Reserved address space can then be committed. When it is Windows guarantees that it will be possible to back that range using either physical memory or the swap file. At this point the address ranges do not occupy physical memory just yet but can be touched. Storing data into those ranges will finally populate them forcing the kernel to allocate physical memory to them.

Because of this peculiarty Windows is the only platform where memory allocations will fail if the system runs out of commit space. This can be counterintuitive as committed memory need not to be used, so a system might have significant amounts of physical memory available yet being unable to allocate more memory because commit space has been exhausted.

When running out of commit space Windows will attempt to increase the size of the swap file so our memory allocator delays allocations for a bit instead of crashing, in the hope that the low-memory condition is resolved.

Firefox uses a trigger-based mechanism to detect low memory conditions using memory resource notifications. This system has several limitations though: it measures physical memory, not commit space, making it less useful for detecting OOMs. It also does not have a trigger-based mechanism to detect when more memory has become available, forcing us to poll to figure it out.

No process prioritization is done on Windows, but the facilities to do so are present: one can use MEMORY_PRIORITY_INFORMATION via a SetProcessInformation() call, but it's unclear what they do.

Tab unloading is enabled on Windows.

Work outline

Improving Firefox low-memory behavior will require different approaches on different platforms but focus around two important goals:

  • reliably detecting when the system is approaching the stage where Firefox will crash
  • reliably detecting when the system is swapping

This will require splitting the low-memory notification events (bug 1782178)) to distinguish between swapping and near-OOM conditions, tweaking our detectors to distinguish between them and finally auditing the low-memory notification listeners to make them respond appropriately. For example, running the garbage collector while swapping would be highly detrimental and should be avoided, but it is useful to reclaim memory if we're approaching an OOM. Conversely unloading a tab causes a surge in memory usage - something we would not want to do in a near-OOM scenario - but would later free a large amount of it, making it a suitable response to swapping.

Last but not least process prioritization should be used where applicable, to make the operating system react in an optimal way to Firefox requirements.

More specifically this work could be split into the following steps:

  • Introduce separate memory-pressure events and their associated types
  • On Windows:
    • Change the available memory watcher to check for swapping via memory resource notifications and send swapping events when it happens
    • Extend the memory allocation delay mechanisms to all the allocators we have and not just mozjemalloc
    • Leverage this system to detect near-OOM conditions. Whenever a memory allocation fails we should consider ourselves in a near-OOM scenario and send the appopriate memory-pressure event. This would be cleared following later successful allocations.
    • Study how memory process prioritize work and eventually use them in the process priority manager
  • On macOS:
    • Establish what the different callbacks and event source represent and identify which ones are suitable to detect swapping.
    • Send appropriate memory-pressure events to inform listeners when the system starts swapping.
  • On Linux:
    • Detect when a process has been killed by the OOM killer and use that as a signal that we're in a near-OOM condition, and send the appropriate events. Because this doesn't require polling it's a much more robust and less performance intensive method than what we have. This will require fixing bug 1752625 first.
    • Figure out what changes are needed to accomodate for specific setups such as the use of cgroups to limit how much memory is available to a process, figure out if these can lead to failed allocations that we would not expect under normal conditions. Also figure out if we need to adjust our detection mechanisms to take those limitations into account.
    • Modify the AvailableMemoryWatcher to only detect swapping conditions via PSI and send the appropriate events instead.
  • Validate the reliability of the new low-memory
  • Once all of the above is done re-enable tab unloading on all platforms.
  • Eventually audit the codebase and find more subsystems that could also respond to low-memory events - our graphical subsystems are a particularly good candidate as they can hold very large amounts of memory in the form of textures and surfaces.