A premise: why AMD decided against a 32nm K10-based PhenomII X8 / X10 processor?

A recurring question on forums is why AMD decided to go with the new Bulldozer architecture even if it was under-performing. Many thinks that AMD had better going with the aging, but trusted, K10 architecture producing a eight or ten cores PhenomII processor. So, why AMD didn't choose this approach? The reason is that AMD understand that it can not compete with Intel in single-thread performance right now, so it has to change the rules of the game: instead of focusing in single-thread performance, prioritize multi-thread aggregate performance providing a core number advantage to its processors. However, this has to be done without excessively enlarge die size, or its revenues are going to suffer in the long time. So, it must use a micro-architecture that permit an increase in cores number without an excessive increase in die size. This can not be done with the old K10 architecture.

For example, consider the Phenom II X4 vs X6 dies: the first integrate 4 cores each with 512 KB L2 cache and 6 MB shared L3 cache in an area of 258mm2, while the latter has 6 of the same cores and the same 6 MB L3 cache in an area of 346mm2.

A Phenom X6 (Thuban) die

A Phenom II X6 (Thuban) die

As a 45nm K10 core+L2 cache weight at about 16+6=22mm2, you can wonder why the increase in die size was in the order of 88mm2 rater than in 22x2=44mm2. The point is that, with this classical approach, increasing core count leads to a disproportionate increase in die size due to all the wiring (eg: additional crossbar switch lanes, power lanes, etc.) and padding needed to accommodate the new cores. Moreover, these wiring and padding rarely scale down nicely with the moving to finer manufacturing processes. To make thing even worse, logic itself does not scale linearly with new manufacturing processes. Take the 32nm K10.5-based Llano processors: one of its core weights at about 11mm2, versus the 16mm2 of the old 45nm process. As you can see, the scaling from the 45nm to the 32nm core, while good, was not linear.

Caches, on the other hand, scale very good, almost at the linear 2X ratio. With these factors in mind, say hello to Bulldozer:

A Bulldozer die

A Bulldozer Orochi die

A Bulldozer module (there are four of them in a FX-8150 processor) include 2 integer cores, a powerful multi-pipeline 128 bit FPU and 2 MB L2 cache, yet it weight at only 30.9 mm2. The shared L2 cache servers as a connector between the two cores, while the four modules communicate via the 4x 2 MB L3 cache slices and a crossbar switch connecting them all.

With this design, AMD was able to pack eight integer core, four powerful FPUs and 16 MB L2 / L3 caches in an area of 315mm2 with a 32nm process. In other words, compared to a PhenomII X6 die it significantly increase the compute performance while reducing die size. If AMD stuck with the old design, probably the best feasible processor within this area limit would be a Phenom II X8, while an X10 processor would be significantly larger.

Obviously, there are other reason other than die size behind AMD's choice. For example, as processor's thermal output is not reduced linearly with die size, a Phenom II X8 / X10 would be not only area, but probably thermally constrained also, resulting in low clock speed (a problem partially affecting Bulldozer also, as we will soon find).

The point is that to compete with the manufacturing and design powerhouse that is Intel, AMD correctly recognized that it had to develop a new architecture. But is Bulldozer adequate? At the moment, benchmark results don't fare well for AMD: as stated above, Bulldozer is sometime even slower than previous Phenom II X6 processors. Why? To answer this question, we are going to examine the various processor's subsystems one by one, in that order: memory, caches, ALUs, FPUs and branch predictor. However, let's first talk a little about frequency and power output.