DES performance on John the ripper

The standard, vanilla version of John the ripper let us to test processor performance using five encryption/authentication methods: four versions of DES (Traditional DES, BSDI DES, Kerberos AFS DES and LM DES), the FreeBSD 's MD5 version and the OpenBSD Blowfish protocol.

As, in this article, we are focusing on X86-64 and SSE2 performance on different processors generations, we had to concentrate ourselves on the implementations and protocols that actively uses these features. So, I decided to analyze the Traditional DES, BSDI DES and LM DES performances. So, before to see the numbers, I repeat a little disclaimer: the benchmarks published here have little or nothing to do with general day usage, with office and Internet style applications. So, don't be fooled by thinking that X86-64 and SSE2 can always give you the sort of speedup that the graphs below will show.

Let's start with the Traditional DES numbers. Here, the bars are absolute performances, while the two lines are the relative performance improvements when switching from X86 to X86-64 (green line) and from X86/X86-64 to SSE2 code (yellow line). The 64 bit capable processors where tested both with a 32 bit and a 64 bit OS/application.

Traditional DES benchmark

So, what can we see? Speaking about plain, 32 bit X86 code, we can see that the P4 is not so bad, but hey – it is clocked @ 3.0 GHz, while the Core2 (which had the same results) is clocked to only 2.0 GHz. The Core i7 processor is well ahead, thanks to its improved design and clock speed.

By switching to a 64 bit OS/application, we can see a very interesting picture: Core2 and Core i7 both perform 3X that the 32 bit version. Why we have these great gains? Evidently, John the ripper use large data types, that greatly benefit from the expanded registers width.

By examining the SSE2 line, we can draw a similar conclusion: note as, on 32 bit tests, they give a relative performance starting at 3X for the P4 and ending to a impressive >5,5X on Core2/i7 processors. On the other hand, when compared to X86-64, they give us “only” a 2X advantage.

What this means? Well, think about bit capabilities: the green line shows us that by using 64 bit instead of 32 bit registers, we have a 3X boost. So, if we use the very large 128 bit SSE2 registers instead of 32 bit registers, the total performance gain should be phenomenal, and it is. When comparing the SSE2 versions against the X86-64 version we see “only” a 2X boost because the registers get 2X the width and the application use the SSE2 capability to calculate 2 x 64 bit data pieces at once.

One last thing to note: why, on the P4, the SSE2 extensions give us “only” a 3X boost, while on Core2/i7 we have a larger speedup? The point is that the P4 has a slower SSE2 implementation: it internally break the single 128 bit operation in 2 x 64 bit operations, while other processors can operate on the full 128 bit register at once.

So, these results are quite interesting, but are confined only in the Traditional DES test? Let's see the results of BSDI DES:

BSDI DES benchmark

In this benchmark, we have a very similar picture. The only significant thing to note is that the P4 is now slower in the simple X86 test also, with the Core2 performing 2X better. This is a testament of the great IPC (instructions per clock) rate of Core2 processors: while clocked at only 2 GHz, it beat hand down a 3.0 GHz Pentium 4.

Let see the LM DES benchmark now:

LM DES benchmark

This time we have some differences: while the switch from X86 to X86-64 shows the usual gains (proving that the LM DES test use large data types), the use of SSE2 bring less then expected gains. This means that the LM DES code is less vectorizable that the others DES implementations examined. In other words, is has low benefit from parallel SIMD computing, while it greatly prefer expanded bit width. This is the very reason why we see the SSE2 perform very well against simple X86 (with its 32 bit only registers).

So, by comparing X86 vs X86-64 and SSE2 enabled code vs no SSE2 code, we learned a few interesting thing about maximum performance gains enabled by these extensions. However, it will be quite interesting to read the obtained results to better compare these different processors microarchitecture: Netburst (P4), Core (Core2) and Nehalem (Corei7).