Synthetic benchmarks: STREAM
Let's begin our session with a classic memory test: the STREAM benchmark (you can download it at http://www.cs.virginia.edu/stream/). It can be run in single and multi thread mode and, in the latter case, it use the OpenMP library to spread the calculations over the available core. The benchmark was configured to use a 16.000.000 entry array, with each entry using 8 bytes (double precision floating point type). As the test use three arrays, the total memory utilized was about 366 MB (well beyond the 8.5 MB total processor's cache).
While in single thread mode the performances of ganged and unganged modes are quite on par, in the multi-threaded scenario the unganged mode score higher. Why? Probably because the use of more threads exacerbate the advantage in writing from and reading to main memory at the same time. The advantage that unganged mode has on single-threaded copy also seems to confirm this point.