Branch prediction

On a superpipelined processor, correct branch prediction is a very important thing: each time the processor mispredict a branch, the long pipeline is filled with incorrect data, wasting time and power.

While AMD has not officially disclosed Bulldozer pipeline length, from the misprediction latency it seems that it is about 50% longer then PhenomII one. In order to maintain acceptable performance, AMD significantly improved its branch prediction logic. How things add up? Anand @ run a AIDA 64 Queen test on some systems with only one core enabled. This enable us to evaluate single-threaded branch prediction efficiency:

Bulldozer AIDA 64 Queen single-thread

You can see that Bulldozer is noticeably slower then K10, nor to mention Sandy Bridge. However, taking into account the increased pipeline length, branch prediction logic seems slightly improved over K10 design.

What happens if we throw more threads at the problem? review give us the response:

Bulldozer AIDA 64 Queen multi-thread

Providing the Bulldozer architecture with more threads significantly improve aggregate performance. But wait a moment – this test is branch, not ALU, bound, and Bulldozer only has a single decoder per module, yet aggregate performance are 58% better then simply quadrupling single-thread ones. Why multithreaded results are so better? The answer lies in the fact that by working on two threads at time, the decoder can better masks some of the cache/memory latencies, resulting in better performance. You can see that a very similar things happen for Intel processor with HT enabled, but at a lesser extent: L2 is very fast on Sandy bridge, so the gain is in the range of “only” 34%.

Altogether, while branch prediction is not at Sandy Bridge level, it is clear then Bulldozer is a noticeably improvement over K10 in this area. However branch-rich code can be impaired by the high L2 latency.