The Phenom / PhenomII memory controller: ganged vs unganged mode benchmarked

Written by Gionatan Danti on . Posted in Hardware analysis

User Rating:  / 80
PoorBest 

What AMD says about the ganged vs unganged question

AMD has some excellent documentations that can be download for free. Let's examine some extract of the “BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors ”. On section 2.8 we can find some considerations on ganged vs unganged mode. If you are interested in checking the doc, I suggest you to especially read these sections:

  • 2.8 - DRAM Controllers (DCTs)
  • 2.8.5 - Ganged or Unganged Mode Considerations
  • 2.8.8 - DRAM Data Burst Mapping
  • 2.12.2 - DRAM Considerations for ECC

 

In short, the documentation indicates that:

  1. In ganged mode, we have a 128 bit wide logical DIMM that map the first 64 bit on physical DDR channel A and the last 64 bit on DDR channel B. So we can state that a single 128 bit operation is effectively split between two memory channel; on the other hand, the DCTs can not operate independently. In other words, the physical address space is interleaved between the two DIMM in 64 bit steps

  2. In unganged mode, each DCT can act independently and has its own 64 bit wide address space. In this mode the processor can be programmed to interleave the single, physical address space on the two normalized address space associated with the two memory channel; however, the finer possible interleaving unit is the cache line size (64 bytes)

  3. AMD officially suggest to enable unganged mode to benefit from increased parallelism

  4. Some CPU models (for example, the 8 and 12 core Magny Cours G34 processors), can only use the unganged mode.

I draw a graph that, hopefully, should help explaining the differences between ganged and unganged modes:

Ganged vs Unganged physical address subdivision

As you can see, in the ganged mode the physical address space is spread between the two memory channel with a 64 bit granularity: this means that two consecutive 64 bit access will read from two different memory channels and, more importantly, that a 128 bit access can utilize both channel.

On the other hand, in the unganged mode a (relatively) large portion of physical address space is bound to a single memory channel. In the graph above this portion is 64 bytes length, but the K10 processors can be programmed to use an even more coarse grained interleaving scheme. However, the normal interleaving unit in unganged mode is 64 byte length (as shown in the graph), as longer unit can cause a tangible performance loss.

From what we see, one should think that neither approach is the ideal one: the usual registers and operands size is 64 bit (8 byte), so it appear that both the ganged and unganged methods will read this 64 bit entity over only a single memory channel, effectively wasting bandwidth. A byte interleaved (or bit interleaved) mode should give as a great performance boost, right? Simply stated: no. The key point to understand here is that processors do not move in and out from memory data chunks of arbitrary length, but use a fixed-sized scheme: they move data from and to main memory only on a cache line base. On Phenom processor the cache line size is 64 byte long, so these processors move data from and to main memory only in 64 bytes chunks. This means that if we try to read a byte at address 0x0, the entire cache line (64 byte) will be fetched by the processor! While this can seems counterproductive, it has its reasons, especially related to space locality and cache design. It is beyond the scope of this article to explain why processors behave in this manner, but in short we can state that this design permit good performance boost (because exploit code and data space locality) and the creation of very dense caches.

As memory operations happens in 64 bytes chunks, it appear that ganged mode will always win: it can spread that 64 bytes operations on the two memory channel, while the unganged mode will only use a single memory channel. The reality, however, is the the unganged mode rarely suffer from this problem, because normally there are many outstanding memory request to be completed, so there are many outstanding cache line to be fetched from or stored to main memory. While the ganged mode will be faster in operating on a single cache line, the unganged mode can theoretically operate on two cache line at a given moment (with some restrictions). This parallelism can be realized because the memory controller incorporate an 8 entry depth memory controller queue (the “MCQ” box in the drawing above), for a total of 8 outstanding cache line requests.

However, simply stating that the unganged mode has the potential to be often on par with the ganged mode is not enough: in this case, we can simply use the ganged mode and forget about the unganged mode. The point is that the unganged mode has potential to be faster that ganged mode. Why? Because we must realize that main memory access don't happen immediately, as the DRAM chip require many ns to be accessed: after this initial access time the data can be transferred quite quickly, but the initial access steps can be very slow (from a processor standpoint). Starting two memory operations at the same time, the memory controller has the possibility to hide at least partially the latency involved in the setup steps of the second operations. Obviously this is not always true, but it is a possibility indeed and, so, this can be an advantage of unganged vs ganged method. Moreover, using the unganged mode the memory controller can theoretically both write to and read from memory at the same time: this should help memory copy routines and multitasking operating system, where many processes can both read from and write to memory at the same time.

Summarizing the whole point, we can state that:

  • the ganged mode has the potential to be faster than unganged mode because it use a more fine grained interleave mode

  • the unganged mode has the potential to be faster than ganged mode because it can start two memory operations at the sime time, effectively hiding at least part of the latency involved in the second operation. Also, this mode permit to both read from and write to memory at the same time, with the intrinsic advantages that this possibility implies.

So, we don't have a “magic setting” that will always give us the better possible performance. We should run some benchmarks to understand wich applications and scenarios benefits from one method rather than the other.

Comments   

 
#1 Julián Fernández 2012-07-21 00:40
This was quality reading. Thanks mate.
Quote
 
 
#2 Iz 2013-01-24 23:14
Thank you for sharing these insights. I found them most useful indeed.
Quote
 
 
#3 asd 2014-03-26 19:17
Your graphs are misleading. You should ALWAYS show the full range in any graph (i.e. starting at 0 value), so the magnitude of the gains can be seen at first glance. This is statistics 101.
At least you labeled your axis.
Quote
 
 
#4 Gionatan Danti 2014-03-26 19:29
Quoting asd:
Your graphs are misleading. You should ALWAYS show the full range in any graph (i.e. starting at 0 value), so the magnitude of the gains can be seen at first glance. This is statistics 101.
At least you labeled your axis.


Yes, you are right.

When the differences are small, old OpenOffice Calc versions tend to create graphs which don't start from 0.

I realized that only after the graph were published, and I preferred to leave them unmodified.

Regards.
Quote
 

Add comment


Security code
Refresh