Arithmetic logic units

Bulldozer unique approach enabled AMD to sell a processor with 8 somewhat streamlined integer cores. A bulldozer module include 2 integer cores, each with 2x ALUs and 2x AGUs. Compared to previous K10 design each integer core loses an ALU/AGU pair, but AMD correctly stated that rarely a single instruction stream need 3x ALUs/AGUs. Comparig Bulldozer with Sandy bridge is interesting: Intel processor has a very powerfull OoO logic that drives 3x ALUs and 2x AGUs per core, still Hyper-threading extracts significantly more performance from a single core, denoting that also Intel's cores can be under-utilized in normal work.

How integer cores and caches add up? The graph below show the maximum theoretical performance in both compute and L1 write intensive patterns:

Bulldozer integer bandwidth

From a compute standpoint, Bulldozer's cores are very well balanced: they have a maximum throughput equal to 66% of K10's ones, but remember that AMD 1) was able to integrate 2 integer cores per module and 2) the third ALU/AGU pair is rarely used. This means that a Bulldozer module can provide noticeably better results (in the range of 50-100%) than a Phenom core.

However, when we factor the reduced L1 write bandwidth, things become a little worse: while 32 bit operations should perform without problem, 64 bit operations can be bandwidth-starved. In write-intensive operations, a single Bulldozer core will be significantly slower than a K10 core, needing an entire Bulldozer module to match K10 performance. On the other hand, you can argue that, as loads are generally way more that stores, Bulldozer should perform adequately in all cases. Also remember the WCC, that can improve thing significantly for AMD newest design.

Ok, it's time to see some number now. To look at ALU performance we can use Sisoft Sandra 2011 SP5 drystone test. This benchmark is rather small and so it should fit entirely on L1 cache.'s review compared the newly released FX-8150 with other chips, disabling any turbo mode on processors that support it.

Bulldozer Sandra drystone

First of all, a little disclaimer: don't be fooled by the low per-thread SB + HT results: Hyperthreading enable one processor core to concurrently execute two thread, resulting in higher aggregate throughput at the expense of single thread latency.

In this test, we see that Bulldozer fares very well: at a chip level, it is about 30% faster then a Phenom II X6 running at the same clock speed. Remember that PhenomII X6 uses six fat, full-flegged cores: comparing one of these cores to an entire bulldozer module, we see that the latter is 2X faster than the former. If you factor that Bulldozer is expected to work at higher clock than previous processors, its performances are nothing short of excellent. Here AMD strategy of doubling the integer execution cores pay very well.

Comparing Bulldozer to Sandy bridge, the situation is a little different: while the new AMD architecture seems to hold its ground against non-HT enabled processors, it has some difficulties against HT-enabled ones. It seems that in drystone HT give a disproportional grow in speed, boosting aggregate performance by about 35%. The end results is that, at the same clock speed, Sandy bridge seems to have a 20% advantage over Bulldozer. Sure Bulldozer was expected to have very high clock speed; however the reality is that current AMD processors don't have a 20% higher clock then Intel processors.

Nevertheless, we can say that FX-8150 was a definitive improvement over previous PhenomII processors in the ALU department. So ALU performance is going to be a strong point of Bulldozer architecture, at least compared to its previous products.