Graphic performance considerations
This fact – that dependent instructions can not be optimally executed by a VLIW core – is often cited as the greatest R600's problem. Some site go as far as to stating that often only one ALU for VLIW unit (1 over 5) would be utilized .
However, this is not true: as we noted before, often shader concurrently operate on the different channels of each different pixel, so the common utilization scenario can not be so bad. Some very respectful site as Anandtech  and Beyond3D  cite a mean ALUs utilization between 3 and 4 in games, resulting in average-to-good resource usage (as a note, the latest R600 descendant, the Cayman GPU, has a 4-way VLIW unit).
Moreover, independent tests show that the 2007-era games were hardly bound by shader processing power. For example, consider Techreport's nVidia 9600 GT review : if you dig into the graphics, you will notice that, while the G94 powerd Nvidia GeForce 9600GT has only ~60% theoretical shader performance versus the G92-based GeForce 8800GT, it is only marginally (0-15%) slower. The lesson is clear: at least with antialiasing turned on, the performance bottleneck was not shader power, rather fillrate and memory bandhwidth.
So, why was R600 slower then its competitors? Basically, for two reasons:
- the low absolute TMUs number, resulting in low textel rate   
- the badly implemented (if not broken) antialiasing resolve algorithm, that inefficiently use the main shader core performance for pixel resolving rather than using a dedicated hardware unit inside the RBEs (or ROPs).   
- it was built on a very leakage-prone TMSC's 80HS (80nm) process, resulting in lower clock speed and higher power consumption  
While the first and the third points are clearly understandable (R600 had only 16 TMUs, while competitors had 24/32 of them; from the power consumption point, it consumes more than a 8800GTX while delivering lower performance), the second one need some clarifications. Anti-aliasing (AA in short) smooths jaggies by using several subpixels to compute the final, real pixel color; to obtain this results, some kind of subpixels-choice-and-blend filter must be used. The usual filter is the “box” resolve filter:
Source: ATI (now AMD)
However, some chips (as R600) permit the use of custom AA filters, for example the “narrow tent” one:
Source: ATI (now AMD)
While it is fine to run the custom filters on the shader core (as custom filter can not be implemented by using fixed, non-programmable hardware), the usual box filer case should be really run on faster, dedicated hardware. Unfortunately, both R600 and its direct son, RV670, does use the shader core for box filter also. This lead to lower performance and, perhaps more serious, to shader core over-congestion (the resolve process is a quite intensive one).
Please note that at 1600x1200 resolution with AA turned off, the Radeon 2900XT is only ~12% slower then a Nvidia 8800 GTX; however, as you turn on 4X AA, the gap grows at over 30%. In the same manner, at 1600x1200 resolution the 2900 XT was over 46% faster than the previous 1950 XTX, but with 4X AA its advantage shrinks to a mere 23%. Newer ATI drivers, by optimizing the shader-based AA resolve code, somewhat improve this situation, but were never able to match the ROP-based box AA resolving speed.Computerbase.de's 2900XT review clearly shows R600's weakness with AA. 
There were some contradictions on why ATI used shader-based box AA resolving instead to use the common, ROP-based method: some site claimed that this was a conscious choice, while others wrote that this was the result of flaw, or bug, in RBE/ROP design.
RV770 was the first R600-derived chip to correct this situation, resulting in vastly higher AA performance.