Synthetic benchmarks: MEMBENCH sequential memory speed

In order to obtain some more informations, I write a small program that use some assembly language routines to read, write and copy memory chunks. You can download the source code from Assyoma web site. As comparison points, I included the glibc memset() and memcopy() routine speed. Let's see the single process results first:

MEMBENCH sequential memory speed - single process

As you can see, the ganged mode has a slight advantage on memory reads, while the situation is reversed in the mixed read/write (copy) tests. What happen when we launch two concurrent membench process?

MEMBENCH sequential memory speed - two processes

The ganged mode maintain a very small margin in the read tests, but in the other cases the unganged mode score higher. The previous graph was obtained by running two identical benchmark instances. This means that when the first instance was reading, the second was also reading. But what happen if we let the first instance to execute read test while the second instance was writing, and vice versa?

MEMBENCH sequential memory speed - concurrent read/write operations

The unganged mode has a noticeably advantage here.

A little note about these sequential tests: the astute reader might have noticed that, especially on single process tests, the measured memory bandwidth seems quite low, almost as if the processor was using only one memory channel. I can guarantee that the processor was correctly utilizing both memory channel: using only a single DIMM revealed an immediate, tangible performance drop.

So, why we have these low numbers here? Principally for three reasons:

  • due to the exclusive nature of L1 and L2 cache and the “mostly exclusive” nature of L3 cache, read speed have a maximum, upper bound limit that is equivalent to the per-core L3 bandwidth, measured (from my test and from other, independent tests) at about 9 GB/s (for a total, aggregated bandwidth of ~36 GB/s).
  • on write test, the numbers are so low because of cache trashing: as we are constantly writing into the cache a great amount of data (that we will never reuse), we are forcing the L3 cache to evict a cache line each time we want to write something on the memory. This is terrible for memory efficiency because at some point each write command will generate the following chain of events:
    1. send write command to L1 cache

    2. realize that we have no space in the L1 cache

    3. evict a L1 cache line to L2

    4. realize that we have no space in the L2 cache

    5. evict a L2 cache line to L3

    6. realize that we have no space in the L3 cache

    7. evict a L3 cache line to DRAM

    8. wait for DRAM to complete the operation

    9. now write the desired data on L1 cache

    As you can see, this disrupt the normal flow of write. To avoid this stalls, we can use a MOVNT instruction: this instruction does not use the cache but the write-combining buffers. In internal test I noticed that, by using this instruction, I can reach very read-like speed in writing sequential data streams. However, this is a special instruction, only useful on some selected cases

  • main memory has a very high first access latency, so 100% utilization is difficult to reach with only a single process. By using multiple processes the memory controller can open (and keep opened) more memory pages, effectively lowering the total impact of fist time access latency