Bulldozer cache arrangement is very different from previous AMD works: while K8 and K10 processors used write-back, exclusive caches, the new Bulldozer architecture use a write-through, private L1 cache and a mostly-inclusive, shared (at a module level) L2 cache. As in previous Phenom processors, the integrated northbridge include a L3 victim cache, this time 8 MB in size.

So, total L2+L3 caches size grow up from 9 MB (Thuban die) to 16 MB (Orochi die).

However cache speed are equally, if not more, important than size. So how Bulldozer's caches compare to other AMD and Intel designs? Let's discuss read / write bandwidth first. Folks @ used AIDA64 to measure read / write cache speed.

Bulldozer AIDA 64 cache speed

The above graph depict cache speed at single core level (it is a single-threaded benchmark). Speaking about L1 read speed, we can say that the new Bulldozer processor show very good performance, equating both PhenomII and Sandy bridge results. If you consider that a Bulldozer module include two cores, both with its private L1, total aggregated L1 caches read speed can go beyond any other processor in the desktop market today. Note: I saw some SiSoft Sandra 2011 SP5 results suggesting that aggregate L1 read bandwidth is lower then Phenom II X6 processors. However, I have the strong impression that Sandra 2011 cache test incorrectly use only four, and not all eight, cores. Moreover, AMD own documentation shows that L1 per-core cache has the same 2x128 bit load capability found on previous Phenom processors.

Moving to L1 write, however, we see a very depressing result: Bulldozer L1 write speed is way slower then any other processor! How can it happen? Remember that Bulldozer L1 cache use a write-through design, while Phenom II L1 cache is a write-back one. Write-through caches imply that each write is immediately propagated to the next cache level – L2 in this case. So, while good for simplicity, consistency and safety, write-through caches are generally slower than write-back caches, which don't have any requirement to immediately propagate writes to the next cache level. In other word, Bulldozer L1 write speed is going to be limited by L2 write speed, which is exactly what you see above.

Please keep in mind that this is a worst case scenario. Aware of the slow L1 write behavior, AMD equipped its new processor with an intermediate, 4 KB Write Coalesce Cache (WCC) that sits somewhere in between L1 and L2 caches. The WCC serve to coalesce multiple writes in only one L1-L2 write transaction, improving bus efficiency and overall speed. This means that real-world Bulldozer's L1 write speed can be noticeably better. On the other hand, limiting the WCC to only 4 KB means that workloads with a significant number of writes are going to not be handled at full speed anyway.

Single-threaded L2 cache reads and writes are more or less similar to Phenom II ones, and so way slower than Sandy Bridge. However, Bulldozer L2 cache are probably dual ported and in this case it will accept two concurrent read / write requests from the two integer cores. This means that in a multi-thread environment, Bulldozer is going to read and write L2 data only a little slower than Sandy Bridge processors.

Bulldozer L3 is based on a classical, crossbar design. Read speed are way better than PhenomII but slower than Sandy bridge, while L3 write speed is comparable to PhenomII design, which is slow.

Speaking about cache latencies, AMD state that L1 has a 4 cycle latency, while L2 is at about 18-20 cycles (these numbers include the 4 cycles L1 latency). Compared to previous PhenomII processors, L1 latency grown by 1 cycle, while L2 grown by 6-8 cycles (PhenomII L2 latency was 3+9 cycles).

To summarize: Bulldozer cache architecture is a mixed bag, denoting very competitive read results but slow write speed, mostly as a consequence of its L1 write-through approach, and high latency.

This means that in store-intensive application Bulldozer's execution units are going to be held back by the low L1 write bandwidth, especially considering L2's low read speed and high latency.