Platform/GFX/DeferredGPUs: Difference between revisions

Replaced content with "This page moved to MobileGPUs as we realized that it was more useful to have a page for all mobile GPUs."
(Replaced content with "This page moved to MobileGPUs as we realized that it was more useful to have a page for all mobile GPUs.")
 
Line 1: Line 1:
Everybody's heard stories about Mobile GPUs being different from desktop GPUs in that they do "deferred" rendering instead of "immediate" rendering.
This page moved to [[Platform/GFX/MobileGPUs|MobileGPUs]] as we realized that it was more useful to have a page for all mobile GPUs.
 
Do we actually know what this means? What are the implications for the performance of our mobile gfx code? What should we change?
 
= Gathering raw documentation from the source =
 
There doesn't seem to exist a good source of information on "deferred" GPUs in general. Worse, "deferred" means different things on different GPUs with different performance implications.
 
So for lack of something better, let's start with the only information available: documentation from GPU vendors, that's going to be biased for their own GPUs. Once we have that, we try below to aggregate that information into more generally useful knowledge.
 
== ARM (Mali) ==
 
[http://malideveloper.arm.com/documentation/ Mali Developer Center]
 
[http://infocenter.arm.com/help/topic/com.arm.doc.dui0555a/DUI0555A_mali_optimization_guide.pdf Mali GPU Application Optimization Guide (2011)]
 
== Qualcomm (Adreno) ==
 
[https://developer.qualcomm.com/download/adreno200performanceoptimizationopenglestipsandtricksmarch10.pdf Adreno 200 Performance Optimization (2010)]
 
== NVIDIA (Tegra) ==
 
Tegras are the only major mobile GPUs that are immediate, like desktop GPUs --- and not deferred like other mobile GPUs.
 
[http://www.nvidia.ca/object/white-papers.html NVIDIA White Papers]
 
[http://www.nvidia.ca/docs/IO/116757/Tegra_4_GPU_Whitepaper_FINALv2.pdf NVIDIA Tegra 4 Family GPU Architecture]
 
== Imagination Technologies (PowerVR) ==
 
[http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf POWERVR Series5 Graphics]
 
[http://www.imgtec.com/powervr/insider/powervr_presentations/GDC%20HardwareAndOptimisation.pdf PowerVR -- A Master Class in Graphics Technology and Optimization (2012)]
 
More documentations seems to exist at their "PowerVR insider" site but there appears to be a paywall there.
 
== Intel ==
 
Intel currently just licenses PowerVR. They seem to be preparing different hardware for the future, but for now we only need to care about PowerVR-based Intel.
 
== Other GPU vendors without known public documentation ==
 
[http://www.broadcom.com/products/technology/mobmm_videocore.php Broadcom VideoCore]
 
[http://www.vivantecorp.com/index.php/en/technology/3d Vivante]
 
= What does "deferred" actually mean in various GPUs? =
 
The best document that I could find on this is [http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf POWERVR Series5 Graphics], section 3. However, we won't use exactly the terminology of this document because it reserves the word "deferred" solely for PowerVR's version of "deferred".
 
In our terminology, there are 3 types of GPUs: immediate, tile-based deferred rasterization, and tile-based deferred HSR, where '''HSR stands for Hidden Surface Removal'''.
 
We will abbreviate "tile-based deferred rasterization" as '''tbd-rast''' and "Tile-based deferred HSR" as '''tbd-hsr'''.
 
Here's a table summarizing how this maps to various GPU vendors' terminology, what GPUs fall into which category, and what each term actually means.
 
{|class="wikitable"
!rowspan="2"|Our terminology
!rowspan="2"|Immediate
!colspan="2" style="text-align: center" |Deferred
|-
|'''Tile-based deferred rasterization''', abbreviated as '''tbd-rast'''
|'''Tile-based deferred HSR''', abbreviated as '''tbd-hsr'''
|-
![http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf ImgTec terminology]
|Immediate rendering
|Tile-based rendering (TBR)
|Tile-based deferred rendering (TBDR)
|-
![http://infocenter.arm.com/help/topic/com.arm.doc.dui0555a/DUI0555A_mali_optimization_guide.pdf ARM terminology]
|Immediate rendering
| scope="row" colspan="2" style="text-align: center" |Interchangeably "tile-based rendering" or "tile-based deferred rendering"
|-
!Hardware
|NVIDIA Tegra, desktops
|ARM Mali, Qualcomm Adreno
|ImgTec PowerVR
|-
!Meaning
|Submitted geometry is immediately rendered; no tiling is used.
|Submitted geometry is immediately transformed and stored in per-tile lists. Rasterization is then done separately for each tile.
|Submitted geometry is immediately transformed and stored in per-tile lists. HSR is then done for each tile, yielding a list of visible fragments.
|-
!Performance implications
|Good old desktop GPU optimization
|Optimizations discussed below for deferred GPUs
|Optimizations discussed below for deferred GPUs; the only difference is that there is no need for front-to-back sorting, as HSR is efficiently handled by hardware.
|}
 
In a '''tbd-rast''' GPU, upon submitting geometry, vertex shaders are run and resulting triangles are clipped, but instead of proceeding further down the pipeline as an immediate renderer would, the resulting triangles are only recorded in tile-specific triangle lists. The actual rasterization of the triangles in each tile is delayed until the frame needs to be resolved, whence the name: ''tile-based deferred rasterization''. Deferring rasterization until all the triangles in a given tile are known, allows '''tbd-rast''' GPUs to achieve higher efficiency over immediate GPUs, if only through higher cache coherency of framebuffer accesses --- in practice, the tile size is small enough that the framebuffer tile will fit in cache memory, considerably limiting framebuffer memory bandwidth. There probably are more gains too, although they will depend on GPU specifics. For example, deferred rendering may allow GPUs to sort primitives by textures, achieving higher texture cache coherency.
 
All of that also applies to '''tbd-hsr''' GPUs such as PowerVR's, which are similar to '''tbd-rast''' GPUs except for an additional optimization that they automatically perform: when a '''tbd-hsr''' GPU is about to start rasterizing the triangles in a given tile, it first identifies for each fragment which primitives may be visible at that fragment: see Section 4.4 in [http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf this PowerVR document]. What this means in practice is that a '''tbd-hsr''' GPU will be equally efficient regardless of the ordering of opaque primitives, whereas other types of GPUs will perform better if opaque geometry is submitted in front-to-back order.
 
= Performance implications of "deferred" =
 
== Draw-calls are expensive ==
 
They are always expensive regardless of the type of GPU, but talking with ARM people it sounded like there was something about deferred that made it worse. The ARM document explains: ''"The processing required by draw calls include allocating memory, copying data, and processing data. The overhead is the same whether you draw a single triangle or thousands of triangles in a draw call."''
 
The consensus seems to be that it is worth using texture atlas to minimize draw calls, even if (in the case of a browser) that means dynamically updating the atlas with glTexSubImage2D. Well, at least on drivers where glTexSubImage2D is not broken.
 
== Anything that can force immediate framebuffer resolving is expensive ==
 
Forcing immediate framebuffer resolving negates the benefits of deferred renderers and turns them into liabilities.
 
This can be caused by changing the framebuffer binding, or by anything that will depend on the framebuffer's pixel values.
 
=== Framebuffer bindings are expensive ===
 
Changing the framebuffer binding forces immediately resolving the rendering of the current framebuffer. Therefore it is important to sort rendering to minimize framebuffer bindings. The Adreno 200 document, Section 3.2.4, has a useful explanation.
 
=== glCopyTexImage2D is expensive ===
 
Traditional GPU optimization lore says that glReadPixels is expensive. The updated version of this story to account for deferred GPUs says that glCopyTexImage2D is quite expensive too (although not as terrible as glReadPixels), because it forces resolving the framebuffer's rendering.
 
== Always call glClear immediately after glBindFramebuffer ==
 
See the Adreno 200 document, section 3.2.1: ''"it is imperative to (a) use clears when switching Frame Buffer Objects (FBOs) to avoid having the driver tries to save/restore GMEM contents, and (b) always clear the depth-buffer at the start of a frame."''
 
That makes sense, so we should always do it. Concretely, this means that we should do a glClear after every glBindFramebuffer call, ideally right after it, or at least before any draw-call.
 
== Overdraw is still expensive on tbd-rast GPUs ==
 
While '''tbd-hsr''' GPUs optimize away overdraw regardless of the order of submission of geometry, '''tbd-rast''' GPUs don't. Therefore it remains important to sort opaque geometry in front-to-back order.
 
= Misc mobile GPU performance topics =
 
Add here anything that's not directly related to deferred rendering.
 
== Use only triangle strips ==
 
Reason: Adreno 200's have a vertex cache size of 2.
 
== Make sure that our framebuffer's bit depth matches the hardware framebuffer's ==
 
Or else we'll get inefficient conversions between 16bpp and 32bpp. The pitfall here is typically we may be requesting FBConfigs with ''at least'' 16bpp and blindly taking the first returned one, which may actually be 32bpp.
 
== Avoid MakeCurrent's ==
 
The Adreno 200 document lists it as very expensive. On B2G at least we should be able to land {{bug|749678}}.
 
== Always clear depth and stencil framebuffers at the same time ==
 
The underlying format may be a DEPTH24_STENCIL8 buffer and failing to clear the two components at the same time may force the GPU to go down a slow path.
Confirmed users
753

edits