Intel Hyperthreading: a complex core for two threads
Modern X86 processors have a quite robust execution core, especially considering that they are commodity hardware. For example, the latest iteration of Intel Core micro-architecture is a 4-way machine, with 3x ALUs, 2x AGUs and 2x 256-bit FPU. Oh, and do not forget about the frequency: we are at 3.5 Ghz range here. This powerful core, while well suited for complex, heavy computation, is likely to be an exaggeration for normal, typical use: its execution resources will often stall, waiting for slower devices (es: memory, disks, etc.) to catch up. Moreover, typical code simply do not have enough instruction parallelism to fully utilize this little silicon beast.
Hyperthreading, first implemented on Pentium 4-class CPU, enable the processor to simultaneously track two different instruction streams (threads), masking some system latency and increasing the number of machine instructions to be issued to the execution core. In short, you can think to hyperthreading as a technology that enable one CPU core to present itself as two “virtual” cores, each asking for a compute thread. We speak about “virtual” core because none of the execution units are replicated: they are simply shared between two threads.
To tell this with a picture:
As you can see, some resources are duplicated (eg: ITLB and register rename logic), some are time-shared (eg: the decode block) but the great majority are either statically (eg: uop queue) or dynamically (eg: L2 cache) partitioned. This approach enable the processor to execute instructions from two different threads, raising back-end utilization and aggregate performance. The added silicon estate is very low:
Intel told that, on 0.13u P4 core, die size increased only by about 5%. Considering that modern microprocessors pack more “uncore logic” (eg: L3 cache, multiple memory controllers, etc.) on a single die, we can expect that hyperthreading costs about 2-3% in silicon area today.
Considering that this technology can improve aggregate performance by about 10-20% (with a ~30-40% observed maximum), it is a very good compromise between speed and space. Moreover, when only a single thread is executed, it has access to all core resources.
However, it has also some drawbacks: as some resources are shared between two threads, per-thread performance can be significantly impaired, sometime at a point where aggregate performance is lower than single-threaded one. For example, sharing the uop queue and L2 cache can lead to uop/cache trashing, with bad consequence on speed [1]. Other times, it simply fail to produce any noticeable speedup [2].
So, in a nutshell: hyperthreading try to increase aggregate system performance by concurrently supplying two threads to a single, heavy-weight execution core. This means that the two threads effectively share the same execution units.