Floating point units

If in the integer core department AMD chose to integrate a greater number of somewhat simplified cores, in the floating point department they went for the reverse route: in current design a single, beefed-up FPU is shared between two integer cores. This give the FX-8150 a total of four 128 bit FPUs serving 8 integer cores. Compared to previous Phenom II X6 processors, this is a regression: the old flagship processor had six 128 bit FPUs, against the four of the new FX-8150. On the other hand, the new 128 bit FPUs are very powerful: if previous FPUs integrated 1x FADD/SSE, 1x FMUL/SSE and 1x store ports, the new ones include 2x FMAC/x87, 2x integer MMX/SSE/AVX and 1x store ports. An interesting thing is that Bulldozer return to a co-processor style FPU, where load and stores are forwarded by/to the integer cores.

Sandy bridge FPU is a particular beast: while in AMD processors the floating point units is clearly separated from the arithmetic/logic units, in Intel design the fp resources are located on the same ports than the integer ones. In terms of capabilities this design is 256 bit wide and has 1x FADD/SSEADD/AVXADD, 1x FMUL/SSEMUL/AVXMUL and 1 store port.

Critically the powerful Bulldozer FPU continue to have only one store port: this choice may result in low FPU throughput in write-intensive workload, as the two threads forwarded by the integer cores will be served at a total max rate of 1X x87/SSE stored operation per clock, resulting in a per-thread store rate of 0.5 x87/SSE instruction per clock. Add this to the previous discussion about L1 write throughput and you can see that the new Bulldozer FPU can be starved for store rate:

Bulldozer floating point bandwidth

The above chart shows that in both compute and store intensive workloads, a Bulldozer FPU is going to have similar performance that its older counterpart. The situation is particularly critical for single-thread (per core) performance: as x87 stores are forwarded to the integer cores, the low per core L1 write bandwidth (remember that it is half of what found on K10 processor) can significantly impair 80 and 128 bit results, that are going to be stored at half speed.

On the other hand, this is the worst possible situation for Bulldozer FPU. Real-world results should be higher, because:

  • first, the new FPU has robust FMAC capabilities, that when exploited can double its compute performance;
  • second, the two FMAC/x87 pipelines are largely symmetrical, meaning that it is easier to reach peak speed. On the other hand, previous design had an asymmetrical FADD/FMUL implementation;
  • third, in multi-threading scenario the FPU forwards stores to both integer cores, effectively using both L1 and having access to the full per-module write bandwidth;
  • fourth, each Bulldozer module has a 4KB Write Coalesce Cache (WCC); this means that, a least in some cases, L1 store bandwidth is not going to be a problem;
  • finally, remember that Bulldozer works at higher clock speed.

For evaluating real world performance, we are going to see Overclockersclub.com results first. They run Sandra2011 whetstone test (SSE3) and two AIDA 64 benchmark, Mandel (SSE2/3/4) and SinJulia (x87 extended precision).

Bulldozer Sandra whetstone

In Sandra test, we see that the four Bulldozer FPUs fare quite well, scoring on par with the six floating point units integrated into Phenom II X6 die. In other words, the new FPU is 50% faster clock-to-clock then the old one in Sandra's whetstone test. This is enough to overcome HT-less Sandy bridge CPU, but again we see Hyperthreading giving Intel processors an impressive performance boost, enabling them to surpass both AMD chips.

Will AIDA 64 Mandel test confirm this situation?

Bulldozer AIDA 64 Mandel

It seems no: while Bulldozer FPUs remain about 30% faster then K10 ones, they can not overcome the sheer brute force of the six FPUs. Sure when you factor that the new AMD processor can work at higher frequency the performance gap between the two AMD products should be negligible; however, per thread speed clearly prove that this shared FPU design may have insufficient horsepower to concurrently execute instructions from two threads at K10-like speed.

On the other hand, as AIDA uses customized benchmark routine optimized for each processor type, the comparison with Sandy bridge is difficult. At least, we see that Sandy Bridge, even with a single store port, is way faster then both Bulldozer and Phenom FPUs: this can in part due to the 36 entry store queue (against the 24-entry AMD queue) and in part due to the 2x 128 bit L1 write ports with the consequent high write bandwidth; however, the low Phenom II results seems to suggest other causes also (perhaps a better optimization for Intel processors?).

At last, let's see how Bulldozer fares in x87 extended-precision operations:

Bulldozer AIDA 64 SinJulia

This last FPU benchmark is the worse for the new Bulldozer architecture, as not only its FPUs are slightly slower then K10 ones, but Phenom II X6 has six of them against the four integrated into FX-8150. This results in massive performance decrease over previous AMD flagship processor, and higher Bulldozer clock speed can not really close the gap. Sandy bridges are even faster then Phenom, but not by much this time.

The x87 EP result perfectly reflects theoretical estimation: the single store port, coupled with very low L1 write bandwidth, prevent the new FPU to show its full potential. When you factor that FX-8150 has only four FPUs, it is way slower then previous PhenomII X6. While you can argue that Sandy bridge has only one store port also, it is backed by greater store resources (larger store buffers and faster caches).

Ultimately, Bulldozer FPUs are not bad, but it really need more of them. Alternatively, the FPU need a second store port to guarantee better floating point resources utilization.