We are at the end of our journey into Bulldozer architecture. So, why this brand-new architecture failed to live with our expectations? To recap:
- GlobalFoundries's silicon low physical performance lead to excessive heat generation, significantly impairing Bulldozer clock scaling capability. A Bulldozer chip with a 4.2 Ghz default clock would left a different impression;
- The low bandwidth, high latency L2 cache lead to not-so-high speed with branch-rich code, which is a large part of our common applications. Moreover, the L1 write-through policy amplifies this problem in store-intensive workloads;
- The revamped FPU really miss an additional store port and more write bandwidth.
What AMD will do for the next Bulldozer iteration, aka Piledriver? It is difficult to say now. At a bare minimum, I expect an improved silicon die capable of working at a default clock of over 4 Ghz. It would be nice to also have a larger Write Coalesce Cache and faster L2 caches.
AMD had a brilliant intuition in equipping a Bulldozer module with two streamlined integer cores. If they will be capable to successfully evolve this design, eliminating all the current bottlenecks, they will have a very competitive core in their hands. However, Bulldozer really need to be polished off.
Have a nice day!