Gecko:CrossProcessLayers
Proposal
- Have a dedicated "GPU Process" responsible for all access to the GPU via D2D/D3D/OpenGL
- This process can be killed and restarted to fix leaks or cover for other driver bugs
- This process would be privileged, allowing content processes to be sandboxed but still use the GPU remotely
- In some configurations the GPU process could be a set of threads in a regular browser process (the master process)
- Browser processes (content and chrome) maintain a layer tree on their main threads
- Layer tree is maintained by layout code
- Each transaction that updates a layer tree pushes a set of changes over to a "shadow layer tree"
- This shadow layer tree is what we use for rendering off the main thread
- This is necessary for layer-based animation to work without being blocked by the main thread
- Since we're pushing changes across threads anyway, we might as well push them across process boundaries at the same time, so push them all the way to the GPU process
- Therefore, the GPU process maintains a set of shadow layer trees and is responsible for compositing them
- Question: why "set of" rather than "layer tree"? Latter option would have main thread maintain "master tree", push updates of this to compositor process. Scheduling could be done by setting "frame-rate goal" attributes on layers (or something).
- Compositing a shadow layer tree either results in a buffer that's rendered into a window, or a buffer that is later composited into some other layer tree
- We can reduce VRAM usage at the expense of increased recomposition work by recompositing a content process layer tree every time we composite its parent layer tree and not having a persistent intermediate buffer
- Question: would this require a synchronous request to content to re-composite, or would composition be done on content pixels in shared memory?
- We can control the scheduling of content layer tree composition
- We can reduce VRAM usage at the expense of increased recomposition work by recompositing a content process layer tree every time we composite its parent layer tree and not having a persistent intermediate buffer
This proposal lets us use a generic remoting layer backend. Hardware/platform specific backends are isolated to the GPU process and do not need to do their own remoting.
Implementation Steps
The immediate need is to get something working for Fennec. Proposal:
- Initially, let the GPU process be a thread in the master process
- Build the remoting layers backend
- Publish changes from the content process layer trees and the master process chrome layer tree to shadow trees managed by the GPU thread
Implementation Details for Fennec
- Key question: what cairo backend do we use to draw into ThebesLayers?
- Image backend?
- Allocate shared system memory or bc-cat buffers for the regions of ThebesLayers to update
- Gecko processes draw into those areas using cairo image backend
- GL backend uploads textures from system memory or acquires texture handle for bc-cat buffer across processes (e.g. using texture_to_pixmap); composites those changes into its ThebesLayer buffers
- Windowless plugins suffer, except for Flash where we have NPP_DrawImage and can use layers to composite those images together
- GTK theme rendering suffers
- do we care?
- Xlib backend?
- Allocate textures for changed ThebesLayer areas and map to pixmap using texture_to_pixmap
- Gecko processes draw into those pixmaps using cairo Xlib backend
- GL backend acquires texture handle across processes; composites those changes into its ThebesLayer buffers
- Maybe we can have some XShm or bc-cat hack that lets us do it all ... an X pixmap backed that we can also poke directly through shared memory that's also a texture!
- Image backend?
Future Details
- How to handle D2D?
- Direct access
- Allocate D3D buffer for changed ThebesLayer areas
- Gecko processes draw into it using cairo D2D backend
- Indirect access (for sandboxed content processes etc)
- Remote cairo calls across to the GPU process, creating a command queue that gets posted instead of a new buffer
- Direct access
Important cases on fennec
Will the browser process need to see all layers in a content process, or just a single container/image/screen layer? Need plan for optimal performance (responsiveness and frame rate) for the following cases.
- Panning: browser immediately translates coords of either single screen layer or container layer before delivering event to content process. Content (or browser) later uses event to update region painting heuristics.
- Volume rocker zoom: browser immediately sets scaling matrix for either single screen layer or container layer (fuzzy zoom). Content (or browser) later uses event to update region painting heuristics.
- Double-tap zoom
- Question: How long does it typically take to determine the zoom target?
- Single screen layer: need to propagate event into content process before repainting so that it can determine target of zoom.
- Container layer: can we use layer-tree heuristics to do a fuzzy zoom while content process figures out target? (Better perceived responsiveness)
- Video
- Single screen layer: decoding needs to be done in content process (?). Possibly better parallelism for SW-only decoding. Content process controls frame rate allocation for multiple videos. Harder to adjust frame rates across browser/content because relies on OS for CPU scheduling.
- Container layer: can extract video layers from content container and schedule centrally. Browser (decoding thread) decodes all videos. Possibly more efficient with HW accelerated decoding b/c can batch commands for several videos. Easier to allocate frame rates across all visible videos.
- canvas: For write-only canvases, it seems best to keep backing memory in VRAM. But this makes synchronous reads expensive and it's not clear how updates would be published from content process w/o access to GPU. For read/write canvases, probably better to store in shared system memory. Could potentially switch backing store at runtime, but shmem-only seems a simpler first approach.
- CSS transforms and SVG filters: scheduling work browser/content probably needs to be viewed as distributed optimization problem. For SW-only transforms/filters, probably want to do as much work as possible in content process. Unclear for HW acceleration, but that's future work.
- Animations: cjones doesn't know enough to comment on this.
General comment: I think we'll want the browser process to be able to see each content process's full published layer tree (which may not be equivalent to local layer tree). Good scheduling of work is a tricky problem that likely changes per device and possibly per page; probably want to be flexible about which gfx operations are done in browser/content. E.g., content process should be able to partially composite layer subtrees.
Question: for a given layer subtree, can reasonably guess how expensive the transformation/compositing operations will take in CPU and GPU time? Could use this information for distributed scheduling.
cjones is in favor of a first implementation where the content process only publishes a single "screen layer". Unsure how video fits into this, although decoding in content process seems fine. Tentatively in favor of initially assuming content process can use GPU so that we can get baseline perf numbers to compare to if we decide to take away GPU access from content.
Concurrency model for remote layers
Kinda somewhat a lower-level implementation detail, kinda somewhat not.
Assume we have a master process M and a slave process S. M and S maintain their own local layer trees M_l and S_l. M_l may have a leaf RemoteContainer layer R into which updates from S are published. The contents of R are immutable wrt M, but M may freely modify R. R contains the "shadow layer tree" R_s published by S. R_s is semantically a copy of a (possibly) partially-composited S_l.
Updates to R_s are atomic wrt painting. When S wishes to publish updates to M, it sends an "Update(cset)" message to M containing all R_s changes to be applied. This message is processed in its own "task" ("event") in M. This task will (?? create a layer tree transaction and ??) apply cset. cset will include layer additions, removals, and attribute changes. Initially we probably want Update(cset) to be synchronous. (cjones believes it can be made asynchronous, but that would add unnecessary complexity for a first implementation.) Under the covers (opaque to M), in-place updates will be made to existing R_s layers.
Question: how should M publish updates of R_s to its own master MM? One approach is to apply Update(cset) to R_s, then synchronously publish Update(cset union M_cset) to its master MM. This is an optimization that allows us to maintain copy semantics without actually copying.
A cset C can be constructed by implementing a TransactionRecorder interface for layers (layer managers?). The recorder will observe all mutations performed on a tree and package them into an IPC message. (This interface could also be used for debugging, to dump layer modifications to stdout.)