Its time for R600 – the VLIW chip

With the inevitable increased pixel shaders complexity there were cases when R580's inflexible setup execution units led to lower shader core utilization and, in turn, to lower performance. To solve this problem, and in order to have a chip that could better handle DX10-class shaders (with their complex capabilities) , ATI decided to develop a unified shader architecture that, albeit compact in size, removed restrictions on issuing different instructions for different pixel components: say hello to R600.Its time for R600 – the VLIW chip

R600 architecture

Source: ATI (now AMD)

Can you see the light-red picture behind the “streaming processing units” banner? Well, this is R600's shader core. It is composed of 4 independent SIMD engines, each 16-way wide. To be more clear: each of the above light-red column is a 16-way SIMD array. Notice the small yellow points inside each SIMD lane? These are the new R600 execution units: the vector+scalar units are gone, giving the way to five scalar execution units, organized in a VLIW fashion.

This image depict the VLIW unit layout:

R600 shader unit

Source: ATI (now AMD)

So, each SIMD array has a total of 16 x 5 = 80 scalar execution units at his disposal.

This new arrangement permit to solve problem n.1 (as described above): the VLIW parading permit to mix different instructions inside the single large instruction that is forwarded to the five execution units. To tell this with a picture:

Shader progression

Source: ATI (now AMD)

As you can see, while R520/580 class hardware (the central row) had serious restrictions about instructions issuing (only 2 different instructions was possible for clock), the VLIW paradigm enable R600 to issue up to 5 different instructions on the 5 different components. Moreover, as single instructions inside a VLIW word are statically scheduled, the ALUs require less control logic: a 5-way VLIW setup is considerably more compact that a normal, independent 5 ALUs implementation. So, you can combine computer power and flexibility with a quite compact execution core. For example, consider the RV770 GPU: while is has 2.5X the shader / texture performance, it has roughly 956M transistors, a moderate grow from the ~700M transistors that powered the original R600 chip.

Unfortunately, the VLIW setup has lesser success against problem n.2 – the need to supply the GPU with a large data unit that can effectively fill up the various resources. While R600 setup can somewhat relax this issue (as it is simpler to operate on multiple data sets when you can issue different instruction for each data item) if you have data dependency only some (maybe only one!) of the five execution units can do useful work. Moreover, the need to issue independent instructions place an higher burden on the compiler: while the shader core become more flexible, in order to fully utilize it ATI had to develop a very smart compiler, capable of extracting a good amount of parallelism from the instruction and data flows.

So, in a nutshell:

  • R600 design was more flexible than R520 one
  • R600 design, while not compact as R520 one, provide excellent scalability and very good compute density
  • the added flexibility require that, to fully utilize the execution resource, compiler technology required some notable development
  • until the compiler did not catch up with hardware capabilities, R600 execution core utilization was lower than optimal, but no worse than R520 one.