Platform/GFX/MobileGPUs: Difference between revisions

no edit summary
No edit summary
 
(5 intermediate revisions by 3 users not shown)
Line 18: Line 18:


[https://developer.qualcomm.com/download/adreno200performanceoptimizationopenglestipsandtricksmarch10.pdf Adreno 200 Performance Optimization (2010)]
[https://developer.qualcomm.com/download/adreno200performanceoptimizationopenglestipsandtricksmarch10.pdf Adreno 200 Performance Optimization (2010)]
[https://github.com/freedreno/freedreno/wiki/Adreno-tiling Adreno Tiling]


== NVIDIA (Tegra) ==
== NVIDIA (Tegra) ==
Line 91: Line 92:


All of that also applies to '''tbd-hsr''' GPUs such as PowerVR's, which are similar to '''tbd-rast''' GPUs except for an additional optimization that they automatically perform: when a '''tbd-hsr''' GPU is about to start rasterizing the triangles in a given tile, it first identifies for each fragment which primitives may be visible at that fragment: see Section 4.4 in [http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf this PowerVR document]. What this means in practice is that a '''tbd-hsr''' GPU will be equally efficient regardless of the ordering of opaque primitives, whereas other types of GPUs will perform better if opaque geometry is submitted in front-to-back order.
All of that also applies to '''tbd-hsr''' GPUs such as PowerVR's, which are similar to '''tbd-rast''' GPUs except for an additional optimization that they automatically perform: when a '''tbd-hsr''' GPU is about to start rasterizing the triangles in a given tile, it first identifies for each fragment which primitives may be visible at that fragment: see Section 4.4 in [http://www.imgtec.com/powervr/insider/docs/POWERVR%20Series5%20Graphics.SGX%20architecture%20guide%20for%20developers.1.0.8.External.pdf this PowerVR document]. What this means in practice is that a '''tbd-hsr''' GPU will be equally efficient regardless of the ordering of opaque primitives, whereas other types of GPUs will perform better if opaque geometry is submitted in front-to-back order.
= Notes about tile-based GPUs =
Traditional OpenGL pipeline requires a lot of memory bandwidth which is very bad for power consumption. Mobile GPUs try to alleviate that by moving the frame buffer out of main memory, into high-speed on-chip memory (GMEM). This memory is very fast and power-efficient, but also very small. So the GPU will break up the framebuffer into smaller tiles and render to them using the high-speed memory one after the other. The size of tile buffers varries accross hardware, but it can be as small as 16x16 pixels tiles. To avoid overdraw, the GPU collects all the geometry, computes culling based on the output of the vertex shader and store the result in a spatial data structure for later use. When a tile is rendered, the data structre is consulted to see which triangles are relevent.


= Performance implications of "deferred" =
= Performance implications of "deferred" =
Line 98: Line 103:
They are always expensive regardless of the type of GPU, but talking with ARM people it sounded like there was something about deferred that made it worse. The ARM document explains: ''"The processing required by draw calls include allocating memory, copying data, and processing data. The overhead is the same whether you draw a single triangle or thousands of triangles in a draw call."''
They are always expensive regardless of the type of GPU, but talking with ARM people it sounded like there was something about deferred that made it worse. The ARM document explains: ''"The processing required by draw calls include allocating memory, copying data, and processing data. The overhead is the same whether you draw a single triangle or thousands of triangles in a draw call."''


The consensus seems to be that it is worth using texture atlas to minimize draw calls, even if (in the case of a browser) that means dynamically updating the atlas with glTexSubImage2D. Well, at least on drivers where glTexSubImage2D is not broken.
== Replacing a texture image can cause a pipeline stall ==
 
Since deferred GPUs accept new GL calls while they are still resolving the previous frame, if a texImage2D call replaces an image in a texture that is still being sampled, this will stall the pipeline.


== Anything that can force immediate framebuffer resolving is expensive ==
== Anything that can force immediate framebuffer resolving is expensive ==
Line 123: Line 130:


While '''tbd-hsr''' GPUs optimize away overdraw regardless of the order of submission of geometry, '''tbd-rast''' GPUs don't. Therefore it remains important to sort opaque geometry in front-to-back order.
While '''tbd-hsr''' GPUs optimize away overdraw regardless of the order of submission of geometry, '''tbd-rast''' GPUs don't. Therefore it remains important to sort opaque geometry in front-to-back order.
== Incremental frame updates ==
While only rerendering the part of the framebuffer that has changed since the last swap look like an intersting performance trick, it turns out to cause bad performance problems on tile-based GPUs. This is because if the previous tile buffer is not cleared, the GPU must restore its state which requires a bandwidth cost that can be higher than just redrawing the entire frame.
Some vendors provide with extensions (like QCOM_tiled_rendering) to address this use-case. Such extensions make it possible to minimize the amount of tiles that need to be restored.
== Per pixel hidden surface removal ==
Some GPUs like PowerVR family do per pixel hidden surface removal to shad as few pixels as possible. Blending, sample masking and fragment shaders that may output transparent pixels or use the discard keyword will disable this optimization.


= Misc mobile GPU performance topics =
= Misc mobile GPU performance topics =
Line 135: Line 150:


Reason: Adreno 200's have 256Kb of GMEM. All of the render targets need to fit in GMEM and
Reason: Adreno 200's have 256Kb of GMEM. All of the render targets need to fit in GMEM and
so only using a color buffer means we can have a larger tile size.
so only using a color buffer means we can have a larger tile size. The Freedreno driver doesn't currently (Apr 5 2013)
optimize for this and always has all three buffers. It would be good to confirm with qualcomm that there driver does optimize this.


== Make sure that our framebuffer's bit depth matches the hardware framebuffer's ==
== Make sure that our framebuffer's bit depth matches the hardware framebuffer's ==
Line 148: Line 164:


The underlying format may be a DEPTH24_STENCIL8 buffer and failing to clear the two components at the same time may force the GPU to go down a slow path.
The underlying format may be a DEPTH24_STENCIL8 buffer and failing to clear the two components at the same time may force the GPU to go down a slow path.
== Memory Bandwidth ==
Unagi can do memcpy of 320*480*4 in 0.7ms or 834MB/s
It can do memmove at 200MB/s
Confirmed users
138

edits