HINT: if you are interested in the quick & dirty benchmarks only, go to page #4

It is not a secret that processor performance grow at a very fast rate, faster that any other PC / server component. This disparity challenged CPU designer, as they had to create faster processor that are impacted from the slower system components as little as possible.

One of these system components, and one that can have a great influence on processor speed, is the Random Access Memory, or RAM in short. In the past years, there was a lot of effort to raise the RAM speed: in less that a decade, we went from 133 Mhz SDR DIMM RAM to 1333 Mhz DDR3 DIMM RAM, effectively increasing bandwidth by a factor of 10X. If you consider that modern PC and server platforms uses two or more memory channels, you can quickly realize the improvements in memory speed over the last ten yers.

However, CPU performance go up at an ever faster rate. Also, while memory bandwidth has improved tremendously, memory latency has improved by a factor of 2X or 3X at most. So, while todays RAMs are quite fast at moving relatively large data chunks (they have a burst speed in the range of 6.4 – 12.8 GB/s for DIMM module), their effective access latency remain at around 40/50 ns. So, RAM speed can seriously influence CPU speed.

For example, consider the FSTORE unit on Phenom / PhenomII CPU: it can output a canonical 64 bit-wide x87 register each clock, and it is clocked at around 3.0 Ghz. A simple math reveal that in the optimal conditions, one single core of a 3.0 Ghz Phenom / Phenom II processor can store floating point data at around 24 GB/s. Considering that the Phenom II x4 940 has four core, a single processor can write floating point data at a peak of 96 GB/s! And this is only part of the story, as the integer input/output rates are almost double. Compare these values to the peak bandwidth delivered by a single memory module and you can realize that today processors can be really limited by memory bandwidth.

To alleviate that problem, all current processors use some very interesting strategies to relieve their dependence on memory speed. These improvements are focused on the following area (sorted by the older to the newest):

  • minimize memory utilization (eg: by using large on-chip cache)

  • maximize memory bandwidth (eg: by using multiple memory channels)

  • address the memory using more granular approach (eg: by splitting one 128 bit channel in two 64 bit channels).

These methods are very efficient in fight memory bandwidth starvation (especially cache have a enormous positive impact on processor performance). According to Intel, to saturate a two DDR3-1333 we need at least three Nehalem-style core working on a memory intensive kernel.

This article concentrate itself on the last trick – the use of more granular memory address method. The Phenom / Phenom II processors are very interesting beast, as the permit the user to configure its memory channels either ganged (to form a single, 128 bit memory channel) or unganged (two indipendent 64 bit wide memory channels). But why wrote this kind of article? Simple because there are quite a bit of confusion on that argument on the net. Someone write that you should absolutely avoid the unganged mode, as this mode give (in his opinion) only a 64 bit path to memory to a single CPU core. Other argue that nowadays programs are well multi-threaded, so you should use the unganged mode and absolutely avoid the ganged one.

In truth, performance difference between ganged vs unganged mode is not earth-shattering: some years ago the smart guys at ixbtlabs do a great job benchmarking the ganged vs unganged mode in common applications and the found that the respective performance was quite close. You can read the entire article here: http://ixbtlabs.com/articles3/cpu/amd-phenom-x4-9850-ganged-unganged-p1.html However, the above article does not explain why the performances are so close. In the end, using a 128 bit channel vs 2x 64 bit channels seems a quite radical choice, with its bunch of advantages and pitfalls. So, what it the real modus operandi of the Phenom memory controller? Can we expect a performance advantage using one method rather then other, and in which applications?

We will answer to these questions shortly, but first let's have an in-depth loop to the Phenom integrated memory controller. I want to remark that what you will read in the following pages is the results of an careful study of AMD's documentations and of observations done running a custom test program. While I did my best, I do not pretend to be always 100% correct. If you find some large or small error, please let me know.

So, it's time to go deep... let's study AMD documentations a bit.