Everybody's heard stories about Mobile GPUs being different from desktop GPUs in that they do "deferred" rendering instead of "immediate" rendering.

Do we actually know what this means? What are the implications for the performance of our mobile gfx code? What should we change?

Gathering raw documentation from the source

There doesn't seem to exist a good source of information on "deferred" GPUs in general. Worse, "deferred" means different things on different GPUs with different performance implications.

So for lack of something better, let's start with the only information available: documentation from GPU vendors, that's going to be biased for their own GPUs. Once we understand that, we'll hopefully be able to aggregate that into a big vendor-neutral picture with only well-identified vendor-specific parts.

ARM (Mali)

Mali Developer Center

Mali GPU Application Optimization Guide (2011)

Qualcomm (Adreno)

Adreno 200 Performance Optimization (2010)

NVIDIA (Tegra)

Tegras are the only major mobile GPUs that are immediate, like desktop GPUs --- and not deferred like other mobile GPUs.

NVIDIA White Papers

NVIDIA Tegra 4 Family GPU Architecture

Imagination Technologies (PowerVR)

POWERVR Series5 Graphics

PowerVR -- A Master Class in Graphics Technology and Optimization (2012)

More documentations seems to exist at their "PowerVR insider" site but there appears to be a paywall there.

Intel

Intel currently just licenses PowerVR. They seem to be preparing different hardware for the future, but for now we only need to care about PowerVR-based Intel.

What does "deferred" actually mean in various GPUs?

The best document that I could find on this is POWERVR Series5 Graphics, section 3. However, we won't use exactly the terminology of this document because it reserves the word "deferred" solely for PowerVR's version of "deferred".

In our terminology here, are 3 types of GPUs: immediate, tile-based deferred rasterization, and tile-based deferred HSR, where HSR stands for Hidden Surface Removal.

We will abbreviate "tile-based deferred rasterization" as tbd-rast and "Tile-based deferred HSR" as tbd-hsr.

Here's a table summarizing how this maps to various GPU vendors' terminology, what GPUs fall into which category, and what each term actually means.

Our terminology	Immediate	Deferred
Our terminology	Immediate	Tile-based deferred rasterization, abbreviated as tbd-rast	Tile-based deferred HSR, abbreviated as tbd-hsr
ImgTec terminology	Immediate rendering	Tile-based rendering (TBR)	Tile-based deferred rendering (TBDR)
ARM terminology	Immediate rendering	Interchangeably "tile-based rendering" or "tile-based deferred rendering"
Hardware	NVIDIA Tegra, desktops	ARM Mali, Qualcomm Adreno	ImgTec PowerVR
Meaning	Submitted geometry is immediately rendered; no tiling is used.	Submitted geometry is immediately transformed and stored in per-tile lists. Rasterization is then done separately for each tile.	Submitted geometry is immediately transformed and stored in per-tile lists. HSR is then done for each tile, yielding a list of visible fragments.
Performance implications	Good old desktop GPU optimization	Optimizations discussed below for deferred GPUs	Optimizations discussed below for deferred GPUs; the only difference is that there is no need for front-to-back sorting, as HSR is efficiently handled by hardware.

In a tbd-rast GPU, upon submitting geometry, vertex shaders are run and resulting triangles are clipped, but instead of proceeding further down the pipeline as an immediate renderer would, the resulting triangles are only recorded in tile-specific triangle lists. The actual rasterization of the triangles in each tile is delayed until the frame needs to be resolved, whence the name: tile-based deferred rasterization. Deferring rasterization until all the triangles in a given tile are known, allows tbd-rast GPUs to achieve higher efficiency, if only through higher cache coherency of framebuffer accesses --- in practice, the tile size is small enough that the framebuffer tile will fit in cache memory, considerably limiting framebuffer memory bandwidth. There probably are more gains too, although they will depend on GPU specifics. For example, deferred rendering may allow GPUs to sort primitives by textures, achieving higher texture cache coherency.

All the same applies to tbd-hsr GPUs such as PowerVR's, which are similar to tbd-rast GPUs except for an additional optimization they they automatically perform: when a tbd-hsr GPU is about to start rasterizing the triangles in a given tile, it first identifies for each fragment which primitives may be visible at that fragment: see Section 4.4 in this PowerVR document. What this means in practice is that a tbd-hsr GPU will be equally efficient regardless of the ordering of opaque primitives, whereas other types of GPUs will perform better if opaque geometry is submitted in front-to-back order.

Performance implications of "deferred"

Draw-calls are expensive

Anything that can force immediate framebuffer resolving is expensive

Framebuffer bindings are expensive

glCopyTexImage2D is expensive

Anything that can force saving/restoring framebuffer memory is expensive

Always call glClear immediately after glBindFramebuffer

Overdraw is still expensive on tbd-rast GPUs

Other performance pitfalls (not related to "deferred")

Here are some other performance pitfalls on mobile GPUS:

Use only triangle strips

Reason: Adreno 200's have a vertex cache size of 2.

Make sure that our framebuffer's bit depth matches the hardware framebuffer's

Or else we'll get inefficient conversions between 16bpp and 32bpp. The pitfall here is typically we may be requesting FBConfigs with at least 16bpp and blindly taking the first returned one, which may actually be 32bpp.

Avoid MakeCurrent's

The Adreno 200 document lists it as very expensive. On B2G at least we should be able to land bug 749678.

Platform/GFX/DeferredGPUs

Contents

Gathering raw documentation from the source

ARM (Mali)

Qualcomm (Adreno)

NVIDIA (Tegra)

Imagination Technologies (PowerVR)

Intel

Other GPU vendors without known public documentation

What does "deferred" actually mean in various GPUs?

Performance implications of "deferred"

Draw-calls are expensive

Anything that can force immediate framebuffer resolving is expensive

Framebuffer bindings are expensive

glCopyTexImage2D is expensive

Anything that can force saving/restoring framebuffer memory is expensive

Always call glClear immediately after glBindFramebuffer

Overdraw is still expensive on tbd-rast GPUs

Other performance pitfalls (not related to "deferred")

Use only triangle strips

Make sure that our framebuffer's bit depth matches the hardware framebuffer's

Avoid MakeCurrent's

Navigation menu

Platform/GFX/DeferredGPUs

Gathering raw documentation from the source

ARM (Mali)

Qualcomm (Adreno)

NVIDIA (Tegra)

Imagination Technologies (PowerVR)

Intel

Other GPU vendors without known public documentation

What does "deferred" actually mean in various GPUs?

Performance implications of "deferred"

Draw-calls are expensive

Anything that can force immediate framebuffer resolving is expensive

Framebuffer bindings are expensive

glCopyTexImage2D is expensive

Anything that can force saving/restoring framebuffer memory is expensive

Always call glClear immediately after glBindFramebuffer

Overdraw is still expensive on tbd-rast GPUs

Other performance pitfalls (not related to "deferred")

Use only triangle strips

Make sure that our framebuffer's bit depth matches the hardware framebuffer's

Avoid MakeCurrent's

Navigation menu

Search