Time go really back: a quick view to ATI's R520
In order to understand some of the possible motivations that led ATI's to develop a VLIW chip, we must first give a quick look their previous chip architecture: R520.
Source: ATI (now AMD)
This chip was somewhat “in between” a pure DX9 chip (eg: R300/R420) and a DX10 chip as R600: for example, while it has a central dispatch processor and a decoupled units arrangement, it has classical, separate vertex and pixel shader hardware.
Look at the four quad pixel shader core: each of them is 4D SIMD unit, physically implemented as a 3-way vector unit and a single (1D) scalar unit. This can be well recognized from the picture below:
Source: ATI (now AMD)
As SIMD unit (and especially vector unit) have an high compute-to-size ratio, this arrangement give to the shader core a quite good compute power for limited transistors number and die size. Paradoxically, this was not so evident with R520: the control logic, texture/ROP units, etc. required an high transistors count and die estate, leaving to the shader core not as much space as ideal.
However, with the transition to R580, thing become much clearer: while this chip tripled pixel shaders number, spotting a total of 48 PSUs, transistors count and die space grown only slightly (it weighted at 384M transistors vs R520's 320M).
The downside of this vector-based approach is the relative lack of flexibility, from two points of view:
- a vector unit execute only one instruction type (eg: addition) for each vector component. If you have a vector which need two different kinds of operations (eg: addition for the fist two components, subtraction for the other two) you are out of luck. In fact, the added scalar unit serve to mitigate this very issue: very often the pixel's alpha channel run different instructions than the three color channels, so thescalar unit is a very welcomed enhancement.
- to obtain maximum performance you need to supply the vector unit with a sufficiently large data vector – if you supply a 3-way vector unit with a 2-wide data vector, you will obtain only two FLOPs out from the theoretical three.
Fortunately, when dealing with DX9 graphics these two problems are strongly mitigated: first, DX9-class shader are rarely too much complex and, second, as each pixel generally has 4 data channels (red, green, blue and alpha) the vector+scalar unit can often run at full speed.