The PlayStation®Vita SoC (system-on-a-chip) contains an SGX543MP4+ GPU. This is a multi-core, tile based deferred rendering GPU, with an advanced unified shader architecture. The key features of the SGX543MP4+ are summarized below.
PowerVR Series5XT SGXMP chips are multi-core variants of the SGX series with some updates. It will be included in the PlayStation Vita portable gaming device with the MP4+ Model of the PowerVR SGX543, the only intended difference is the cores, where MP4+ denotes 4 cores (quad-core).
- Model: SGX543
- Date: Jan 2009
- Cores: 4 (availble as 1-16 on other platforms)
- Die Size (mm2): [email protected] nm
- Config core: 4/2
- Fillrate (@ 200 MHz): 35 MTriangles/s, 1000 MPixel/s
- Bus width (bit): 64
- API: DirectX 9.0, OpenGL 2.1
- GFLOPS(@ 200 MHz,per core): 7.2
SGX543MP4+ Block Overview
The SGX543MP4+ multi-core GPU contains four SGX543+ cores and several master blocks whose role it is to orchestrate and distribute work amongst the cores efficiently.
The master blocks of SGX543MP4+ are described below.
The Master VDM – Master Vertex Data Master – is the front end of the GPU. It is responsible for reading the single command stream constructed by the graphics driver (libgxm) from memory; and for distributing all vertex work amongst the cores.
The Master VDM splits vertex processing work amongst the cores in order to load balance their usage, while at the same time preserving primitive submission order to ensure correct rasterization. Vertex processing work is basically split according to a pre-defined split threshold (in number of primitives). These primitive chunks are the fundamental unit of vertex processing work at master level. Consequently, it is possible for individual Draw commands to be split over multiple cores. Once the cores accept their requested vertex processing work they operate independently, generating their own subset of the Parameter Buffer.
The Master IPF – Master ISP Parameter Fetch – is responsible for processing the Parameter Buffer produced by the multiple cores and initiating the rasterization process. As Master IPF reads the Parameter Buffer, it combines the individual subsets produced by the cores, such that each tile has a single tile command stream. A tile command stream contains references to primitive data (output by vertex programs) and state data required for rasterization and fragment processing.
Using information provided by the Master VDM, the Master IPF also ensures primitives binned in each tile are rasterized in submission order. In parallel to the Parameter Buffer read, the Master IPF will also distribute rasterization and fragment processing amongst the cores; and ensure they are efficiently load balanced. Since each tile is completely independent, distribution across the cores is naturally achieved in terms of tiles; this is the fundamental unit of rasterization work at master level. Each tile assigned to a core is rasterized and processed entirely on that core.
The Master DPM – Master Dynamic Parameter Management – (block) is responsible for managing Parameter Buffer memory for all GPU cores. No CPU or user intervention is required to manage the Parameter Buffer; it is handled entirely in hardware.
Parameter Buffer memory is managed as a linked list of pages. As vertex processing proceeds individual cores request memory pages from the Master DPM, which are allocated from a free list. As fragment processing proceeds and sections of the scene are rendered, the Master DPM receives lists of memory pages back from the cores. These page allocation and free operations are performed in parallel to help ensure the maximum amount of free Parameter Buffer memory is maintained.
The PTLA - Present and Texture Load Accelerator - (block) is a fixed function transfer unit that operates asynchronously to the GPU cores. The PTLA block supports various forms of format conversion, memory layout conversion, downscaling, copying, and filling of 2D images. See "Transfers using PTLA" for additional information.
The SLC – System Level Cache – is the highest level cache within the GPU; sitting between all cores, the master-level blocks and memory. All GPU memory requests go through the SLC in order to avoid re-fetching of memory between and within the cores.
In addition, to avoid thrashing the SLC, not all memory requests are cached. This depends on the type of data and the hardware block issuing the request. For example, texture and vertex attribute data is likely to be read many times and so is cached; whereas the VDM Command Stream is read only once and so is un-cached. Similarly, triangle primitive data in the Parameter Buffer is written only once per macro tile (group of tiles) and so these writes are un-cached, however, this same data is read many times and so reads are cached.
The SLC is implemented in four separate cache banks, one per memory access channel, with a crossbar directing the requests based on address. This allows all four cores to access different SLC banks in parallel.
- 256K total cache size
64K cache bank - per core
- 16-way set associative
- Pseudo-LRU replacement policy
Single SGX543+ Core Block Overview
This section provides a high-level overview of a single SGX543+ core in terms of its underlying hardware blocks. There are four such cores within the SGX543MP4+ multi-core GPU, each of them identical.
The hardware blocks of a single SGX543+ core are described below.
The VDM – Vertex Data Master – is responsible for starting vertex processing within the core. The VDM accepts VDM command stream segments from the Master VDM via a FIFO. The VDM command stream contains high-level commands such as Draw Index List and Set Vertex Processing State, which in turn reference primitive (vertex) indices and state data. The state data is used to setup the PDS, USSE pipes and Tiling Accelerator for vertex processing.
Additionally, the VDM parses the index data to determine unique indices; forwarding only these indices to the PDS. This is done to help reduce redundant vertex shading computation.
The PDM – Pixel Data Master – is responsible for starting rasterization and fragment processing within the core. The PDM accepts tile command streams, stored in the Parameter Buffer, from the Master IPF. One tile command stream is received per tile, containing references to primitive data (output by vertex programs) and state data. The tile command streams contain all the information needed to setup and execute all subsequent fragment processing stages.
The PDM comprises the hidden surface removal unit (also known as the Image Synthesis Processor or ISP) and the texture and shader setup unit (known as TSP).
The PDS – Programmable Data Sequencer – controls how vertices and fragments are processed on the USSE pipes; including the fetching of input data and allocation of USSE shared resources.
During vertex processing the VDM issues commands to the PDS, which is responsible for de-indexing each vertex and DMA-fetching its input vertex attribute data. The PDS then issues a command to the USSE pipes so that the associated vertex program is executed.
During fragment processing, the PDS receives groups of 2x2 pixel blocks, known as spans, following hidden surface removal. All fragments in a span are visible and use the same fragment program.
The PDS then issues a series of commands to fetch interpolants, issue non-dependent texture reads and instruct the USSE pipes to execute the associated fragment program for visible fragments.
The DMS - Data Master Selector - arbitrates between the data masters (VDM and PDM) so that vertex and fragment processing can share the single PDS unit on a core.
The USSE – Universal Scalable Shader Engine – is a fully programmable processing unit, primarily used to execute vertex and fragment programs. There are four USSE pipes within a single SGX543+ core, each accepting work (in the form of tasks) from the PDS.
The USSE units have a powerful and highly optimized instruction set for processing vertices and fragments efficiently; the instruction set is also general in nature. Consequently the USSE units are also responsible for executing the GPU Firmware (on core 0 only).
The USSE also has instructions for interfacing other hardware blocks, including instructions to emit shaded vertices to the Tiling Accelerator or entire rendered tiles (on end of tile) to the Pixel Back End (PBE).
Each USSE pipe has its own control unit (for scheduling tasks), pipeline controller (for decoding instructions) and pipeline data path (for executing instructions). Additionally, each USSE pipe has its own local memory, known as unified store, for holding input, temporary and output registers.
Each USSE pipe is responsible for scheduling its own work; and it does so at fine-level to ensure maximum usage of its resources and improve performance. It is therefore possible for a single USSE pipe to execute multiple vertex and fragment programs in parallel.
The TA - Tiling Accelerator – is the collective name for the group of hardware blocks responsible for implementing the tiling process; that is, the binning of primitives following vertex processing into tiles, ready for rasterization and deferred rendering. The Tiling Accelerator comprises of the MTE, TE and DPM blocks described below.
The MTE – Macro Tiling Engine – performs the first stage of the tiling process. It accepts shaded vertices from the USSE pipes and primitive index data from the VDM (via the IDX FIFO) and subsequently generates blocks of vertex and index data, known as primitive blocks, binned by macro tiles (a high-level rectangular group of tiles).
During this process, after the viewport transform has been applied, the MTE will cull as many primitives as possible, in order to reduce the amount of vertex and index data written to memory; and subsequently reduce the number of primitives that need to be rasterized and fragment shaded. The MTE's culling methods include: back-face culling, off-screen culling and small-primitive culling.
Primitives culled by the MTE are rejected from all further processing. The remaining accepted primitives then undergo clipping (if necessary), before being forwarded to the TE.
The TE – Tiling Engine – performs the second stage of the tiling process. It accepts primitive data from the MTE and performs two incremental tiling algorithms in order to compute the minimal list of tiles that intersects each primitive. The TE then uses this information to create lists of primitives that are contained within each tile. During the tiling process, these lists are written to memory, one per tile, in the form of tile command streams.
Once the MTE indicates the completion of the scene, the TE will then terminate the tile control streams and write their list headers (known as region headers) to memory. Together, the primitive blocks of the MTE, and region headers and tile command streams of the TE make up the Parameter Buffer.
The DPM – Dynamic Parameter Management – (block) manages Parameter Buffer memory page allocations and de-allocations. It works closely with the Master DPM so a single Parameter Buffer can be managed entirely in hardware; and without CPU or user intervention.
During the tiling process the DPM will receive page allocation requests from the MTE and TE.
During fragment processing the DPM will receive page de-allocation requests from the Master DPM once each macro tile has been rendered and its associated list of memory pages can be freed.
The DPM, together with the Master DPM, is also responsible for detecting when the Parameter Buffer memory heap has been exhausted; and subsequently signaling when a Partial Render needs to be executed.
The ISP – Image Synthesis Processor – is the first stage in the fragment processing pipeline and is responsible for performing pixel(and sample) accurate hidden surface removal. The ISP does this on a tile basis ahead of fragment shading; a key feature of the SGX architecture known as tile based deferred rendering. This ensures only visible fragments are shaded and USSE cycles are not wasted on occluded fragments, which are discarded at the start of the fragment processing pipeline.
The ISP consists of the following three sub-blocks, which are described in processing stage order:
- ISP Parameter Fetch block parses tile command streams received from the Master IPF, forwarding ISP state information downstream and fetching primitive vertex positional (XYZ) data from Parameter Buffer primitive blocks in memory.
- ISP FPU block then converts all incoming primitives into triangles (including lines and point sprites), before generating the necessary plane equations required for rasterization and depth comparison.
- ISP block will conduct hidden surface removal for the tile, using dedicated on-chip tile depth / stencil /mask memory. This greatly reduces the memory bandwidth required for depth/stencil tests; even reducing it to zero if the depth/stencil values are no longer needed in future passes or the current scene. The ISP block is additionally responsible for initializing the tile (loading depth/stencil /mask values from memory if required), multi-sampling, visibility tests and updating the depth /stencil /mask buffer in memory (if required).
The TSP – Texture and Shader setup Processor – accepts groups of visible 2x2 pixel blocks (known as spans) from the ISP and performs the necessary setup for texturing and fragment shading to proceed.
The TSP consists of the following two sub-blocks, described in processing stage order:
- TSP Parameter Fetch block first fetches vertex position, color, and texcoord attributes from memory for visible primitives; these are the vertex program outputs stored in primitive blocks during tiling. Once read from memory, this vertex data is stored in a dedicated cache to reduce memory bandwidth (that is, when vertices are shared amongst primitives). Additionally the TSP Parameter Fetch block fetches state information for visible primitives, which is forwarded to the PDS to setup fragment processing of spans.
- TSP FPU block then uses the vertex data fetched by the TSP Parameter Fetch block to perform triangle setup. Like the ISP FPU, it first converts all primitives (including lines and point sprites) into triangles and then generates plane equations (per required vertex attribute), which are forwarded to the iterator (TITR and UITR) blocks.
The UITR – USSE Iterator – (blocks) generate per-fragment color and texcoord data for visible primitives. These values are computed at 32-bit precision using fragment X, Y positions from the PDS, with the vertex attributes fetched by the TSP Parameter Fetch block and the plane equations generated by the TSP FPU.
These per-fragment colors and texcoords (and position, if required) are then forwarded to the USSE pipes as fragment program inputs.
There are two UITR blocks within a core; each shared by two USSE pipes.
The TITR – TAG Iterator – (blocks) generate per-fragment texcoord data for visible primitives. These values are computed first at 32-bit precision using fragment X, Y positions from the PDS, with vertex texture coordinate attributes fetched by the TSP Parameter Fetch block and the plane equations generated by the TSP FPU, but are subsequently quantized to 24-bit precision before perspective correction takes place.
These per fragment texcoords are then forwarded to the TAG blocks to allow texture data to be pre-fetched.
There are two TITR blocks within a core; each shared by two USSE pipes.
The TAG – Texture Address Generator – (blocks) receive texture requests, in the form of texture coordinates, state, and LOD information; these inputs are then used to generate one or more addresses for texture look-ups and coefficients for filtering.
The TAG blocks receive texture requests from two sources:
- From the iterator TITR blocks, when texture coordinates are not modified by the fragment program and so the texture data can be pre-fetched early; (known as non-dependent texture reads). - From the USSE pipes, when texture coordinates are modified by the fragment program and so the texture data cannot be pre-fetched early; (known as dependent texture reads).
Vertex texture requests always originate from the USSE pipes and so are always dependent texture reads.
The texture addresses generated by the TAG blocks are forwarded to the Texture Cache Unit (TCU) so the texel data can be read.
There are two TAG blocks within a core; each shared by two USSE pipes.
The TF – Texture Filter – (blocks) are responsible for texture filtering. They receive read texel data from the Texture Cache Unit (TCU); and texture state and filter coefficients from TAG. These inputs are then used to format expand and filter the texture lookup. sRGB to linear gamma correction is also applied if required. The results are finally forwarded to the USSE pipes for use in fragment (and vertex) programs; they are stored in input or temporary registers.
There are two TF blocks within a core; each shared by two USSE pipes.
The PBE – Pixel Back End – (block) is responsible for the final stage of fragment processing; it receives completely rendered tiles from the USSE pipes, applying numerous conversion operations to fragments before writing the final tile pixel data to memory.
These conversion operations take place in the PBE on-chip color buffer.
PBE conversion operations include: linear to sRGB gamma correction, downscaling for x2 and x4 MSAA resolve and pixel packing format conversion (including dithering).
The PBE is also responsible for address translation, such that color surfaces can be written to memory in linear, tiled, or swizzled layouts.
The DCU – Data Cache Unit – is a multi-level data cache providing DMA translation and memory caching for the four USSE pipes and PDS.
The PDS can instruct the DCU to fetch data from memory and write it directly back to the USSE pipes.
The USSE pipes may also directly instruct the DCU to read memory, as well as write data to the DCU and/or memory.
The levels of the DCU are as follows:
- L0 - DMA
- Provides DMA translation; converting burst memory requests into cache lines requests and burst write back to the USSE pipes
- One L0 DMA is shared per two USSE pipes + the PDS
- There are two L0 DMA units within the DCU
- L1 - Data cache
- 512B size (16 lines x 256-bit)
- Pseudo-LRU replacement policy
- One L1 data cache per L0 DMA
- There are two L1 data caches within the DCU
- L2 – Data cache
- 1K size (32 lines x 256-bit)
- Pseudo-LRU replacement policy
- Serves all four USSE pipes + the PDS
- There is one DCU L2 data cache within a core
Any memory not cached by the DCU is fetched from external memory via the BIF.
The TCU – Texture Cache Unit – is a dedicated data cache, responsible for reducing the memory bandwidth and latency of texture lookup operations.
The TCU has two levels:
- L1 – Data cache
- 512B size (16 lines x 256-bit)
- 4-way set associative, pseudo-LRU replacement policy
- One L1 data cache per two USSE pipes + one TAG unit
- There are two L1 data caches within the TCU
- L2 – Data cache
- 8K size (256 lines x 256-bit)
- 4-way set associative, pseudo-LRU replacement policy
- Serves all four USSE pipes + the two TAG units
- There is one TCU L2 data cache within a core
Any memory not cached by the TCU is fetched from external memory via the BIF.
The BIF – Bus Interface – is the interface through which the core accesses external memory in the PlayStation®Vita SoC, specifically, main memory and video memory.