SSE2 – or SIMD, if you prefer

In contrast to X86-64, which “simply” extends the number and the size of the available registers, the SSE / SSE2 extensions not only introduce some new 128 bit registers, but also a new programming paradigm: the SIMD (single instruction – multiple data) paradigm. What this means? The key here is to understand that a single operation (eg: an add) can be repeated on a large set of data.

For example, think to an application that had to XOR two 2 wide integer array, called A and B, where each array element is a 64 bit integer: normally, it has to do A0 XOR B0 and A1 XOR B1, for a total of 2 XOR operations. Using the SIMD paradigm, instead, you are able to “pack” the first two 64 bit values on a single 128 bit register, to pack the second two 64 bit values on another 128 bit register, and then execute a single XOR operation between these two 128 bit registers. In this manner, you issue only a single XOR operations, but you are effectively operating on four (instead of two) array elements.

Hopefully, the next picture can help you to understand the point:

SIMD / SSE2 principle

In short: a single XOR (or, better said in SSE2 term, PXOR) operation, give us two useful results.

It is worth note that the 128 bit SSE register can be split on smaller parts also: for example, you can pack 4 x 32 bit or 8 x 16 bit variable in a 128 bit register. In the latter case (8 x 16 bit elements), the hypothetical XOR operations will give you 8 (16 bit wide) useful results at once.

The potentials performance advantages should be clear here (2X for 64 bit values, 4X for 32 bit values and 8X for 16 bit values), but there are some pitfalls:

  • the required variable packing / unpacking can steal many clock cycle, at a point that can overcome the speedup from SSE2 parallel calculations

  • to effectively use these SSE2 extensions, you code should be vectorizable , in the sense that must do the same identical operations on a large set of packable data. While this can be generally done for code that operate on data stream (eg: images, vectors, videos, ecc.) this is near impossible to do on general purpose code. This means that on general code the SSE2 instructions ofter are of no help.

Some other things to note:

    • SSE2 were preceded by SSE, but the latter where limited to 4 x 32 bit floating point varible (single precision processing) and where somewhat limited in their flexibility. SSE2 added many more instructions and more flexibility, at the point that they can replace many old X87 floating point functions (a thing that can not be said about SSE);

    • while original X86 specifications for SSE2 talk about 8 x 128 bit registers called XMM0-7, with the introduction of X86-64 extensions were added other 8 x 128 bit registers, for a total of 16 x 128 bit registers (XMM0-15). This means that X86-64 processors with SSE2 capabilities should be less prone to register pressure problems over XMM registers;

    • while some processor with X86 can also have SSE2 (most notably the Pentium 4), all the X86-64 processors have SSE2 capabilities. This means that from the high performance Intel Core i7 / AMD Phenom II processors to the low power, low performance VIA Nano / Intel Atom, all 64 bit processors support the SSE2 extensions. This is true to a point that, when run on X86-64 processors, the GCC compiler automatically prefer to use SSE2 instructions instead of X87 floating point instructions.