Speculating on Big Kepler (GK100 / GK110)

With a more or less accurate estimation of Kepler basic building block size, we can now try to elaborate on the subject of GK100 / GK110. Here I will present you some possible scenarios, starting from the less probable to the more likely.

1) Simply double-up GK104: while GK104's relative small size might suggest a GK100 / GK110 chip that is the simple combination of 2x GK104, this is the less likely scenario. Such very large chip (~550 mm2) would be overkill for graphics (256 TMUs, 64 ROPs, etc.) and weak in the compute department (1/24th FP64 rate, low per-ALU load/store/L1 bandwidth). To realize a GP-GPU capable product, Nvidia's engineers should reintroduce ganged 64 bit ALUs operations, further increasing die size and with final chip area probably higher then 600 mm2. So, the combination of very large chip and low FP64 rates make this choice almost impossible. Moreover, such a complex chip would trow TDP out of the window.

2) Double up GK104 shader resources but increase ROPs and memory bus by 50% only: while improbable, this is a more reasonable choice already. By increasing ROPs to 48 units and with a 384-bit memory bus, final chip size should be at around 490 mm2. However, the low FP64 rate and low per-ALU load/store/L1 bandwidth will continue to hamper GP-GPU workload, so Tesla chips must be somehow different (eg: reintroducing ganged 64 bit ALUs operations), with final die size more in the range of 550 mm2. Heat will remain a very big problem. On the other side, such a graphic processor would be a monster for graphic tasks ;)

3) Double up GPCs count (from 4 to 8), while decreasing per-SMX SPs and TMUs numbers to 128 and 8 respectively. Increase ROPs and memory bus by 50%: in my view, this is the most likely scenario. Reducing per-SMX SPs number by 33% will increase per-ALU load/store/L1 bandwidth by 50%, bringing it to 0.5 B/FLOP and 1B/FLOP for 32 bit and 64 bit operations respectively (comparing this with 1B/FLOP rate of current Fermi-based Tesla, it seems a reasonable choice). This would be beneficial for LSUs/ALUs rate also, increasing it to 1/3, from current GK104's 1/6 rate (in comparison, current GF100 has 1/2 LSUs/ALUs rate). At the same time, doubling GPCs number will end delivering a chip with 33% more total SPs (2048) and 2X raster/triangle throughput compared to GK104 while, comparing it to GF110, it will assure a 2X advantage in FLOPs throughput for clock. All in all, I think that this is a very reasonable configuration for GK100 / GK110. Moreover, it perfectly aligns with GF100/GF104 evolution. Estimated die size, counting for ganged 64 bit ALU operations also, should be kept at around 500 mm2.

The following table recaps this latest GK100 / GK110 incarnation:

Resource type Resource number Compared to GK104 Compared to GF110
GPCs 8 2x 2x
SMXs 16 2x 1x
SPs per SMX 128 0.67x 4x
TMUs per SMX 8 0.5x 2x
Total SPs number 2048 1.33x 4x
Total TMUs number 128 1x 2x
ROPs 48 1.5x 1x
Memory bus width 384 1.5 1