AMD Bulldozer dual-core module: break a very complex core in two simpler ones

In the previous page, we see that Intel feels that a single, complex core is often underutilized by a single execution thread. While agreeing with this premise, AMD chose to not increase core utilization by increasing threads number, rather by splitting the complex core in two simpler ones. The argument is straightforward: if a complex core is not used to full extent, maybe a simpler core can. Moreover, these simpler cores shares as much silicon as possible, so die size increase should be kept under reasonable limit.

The net result is something similar to that:

Bulldozer module

As you can see, much of the transistors-consuming logic (eg: decode stage, FPU and L2 cache) are shared by the two simpler integer “cores”. The sharing is so pervasive that the very definition of core is unclear here: someones consider this approach a true “dual-core” one, while other prefer to define it as a “single-core width dual integer execution clusters”. It is worth noting that the fetch and decode stage, as well as the FPU, are used by the two threads in a time-sharing manner, similar to Intel hyperthreading.

Please note also that, while simpler, these cores are not lightweight ones: each of them has 2 full AGUs and 2 full ALUs. However, they are noticeable simpler than previous AMD design: for example, an integer core pair shares the same FPU silicon.

So, what is the impact on die area? Based on AMD slides, it appear to be quite low:

Bulldozer added silicon

A 5% chip-level increase in die size is not much, especially considering the projected 60-80% aggregate performance gain. This small die-size increase should not surprise us: as with hyperthreading, the module approach try to share as much units as possible, replicating only some key structures as L1D and integer schedulers and execution units.

When only a single thread is executed across the module, it has access to all shared resources (eg: decoders, L2 cache, FPU) but it can use only one integer execution core. This should not be a problem, as typical programs rarely issue more than two instructions that can be concurrently executed. In some case, however, the maximum limit of 2x AGUs and 2x ALUs can hinder single-thread performance, as from X86 standpoint it will often result in a 2-way wide core (compare this with the 3-way AMD K10 core and the 4-way 3-way-integer Intel cores).

On the other side, the dedicated integer scheduler and L1D cache should contribute to kept single and multi thread performances good and predictable, with a lower eventual negative impact than hypertreading. However the shared decoders, L2 cache and FPU can all represent a performance bottlenecks, so they are significantly overhauled over previous AMD micro-architectures.

So, in a nutshell: AMD module approach try to increase aggregate performance by “breaking” a very complex core into two simpler ones and supplying each of them with a single instruction thread. The simpler core share as much silicon as possible.