128-bit Encryption using Intel SSE2 Extensions

rmauldin · ‎10-05-2005

Anybody need some source code to encrypt basic text messages using Intel SSE2 Extensions. Here we are using 128 bit XOR _mm_xor_si128() intrinsic function to obtain higher performance results. You must make a input.txt file in notepad or other ascii source editor and store in same directory as compiled program. Tell me how it works for you. And for those of you who are beginners and don't understand how the XOR is encrypting the data here is a sample data pattern... I attached the source code since itdidn't look correct in this HTML window.

11110010 char to encrypt

xor01011010 key

10101000 encrypted char
xor01011010 key
11110010decrypted char

I attached the source code for anyone interested

Message Edited by Ryan-Mauldin@utc.edu on 10-05-2005 08:36 AM

Intel_C_Intel · ‎10-05-2005

Dear Ryan,

Thank you for sharing your code with the community. Please allow me to make one comment, however, because I find it important that our readers realize that in many cases you dont have to resort to intrinsics or inline assembly programming. If you had expressed both your encryption and decryption loops (which are the same) simply as:

for (k = 0; k < 16; k++) {
result = c1 ^ Key128;
}

then automatic vectorization would have generated exactly the same code and execution times (in fact, I believe this application is too I/O bound to make any computational contribution).

=> icl QxP vec.cpp
vec.cpp(85) : (col. 1) remark: LOOP WAS VECTORIZED.
vec.cpp(144) : (col. 1) remark: LOOP WAS VECTORIZED.

Your timing on a sample file:

Starting Encryption of 'Input.txt'...
Encryption Time = 3 seconds
Starting Decryption of 'Encrypted.txt'...
Decryption Time = 5 seconds

Vectorized timing on the same sample file:

Starting Encryption of 'Input.txt'...
Encryption Time = 3 seconds
Starting Decryption of 'Encrypted.txt'...
Decryption Time = 5 seconds

Hope you find this insightful.
Aart Bik
http://www.aartbik.com/

rmauldin · ‎10-14-2005

Thanks for the reply. That is a great response Bik. I had also created another version of the program that just uses char x char XORing and was comparing their results. I just needed to match up the programs to process a lot of data to see a large difference in program run times. And you are correct, when using the Intel C++ compiler both programs had the same runtimes because of automatic vectorization. However when compiled with the Microsoft Visual Studio 2003 C++ compiler, with the same exact libraries, the program ran at around 23 seconds. I was just proving the effectiveness of SSE/SSE2 intrinsics and why they are a very necessary part of the Intel Architecture. Another interesting compilation was using the Microsoft Visual Studio 2005 Beta 2 C++ compiler. It reduced code size from 280kb down to 52kb and ran in 4 seconds. I assume that the SSE/SSE2 intrinsics were taken advantage of and that the I/O routines were alot faster and optimized. Also, I compiled the same program with the Beta 2 C++ compiler on an equivalent AMD Athlon processor and it ran in about40 seconds. Due to no use of optimized intrinsics and no XMM registers to do quick large computation. I was pretty impressed with the Intel hardware after finishing the project although I know the only intrinsic getting used was the XOR calculation. But it is still a pretty cool way to benchmark the hardware with a real world program of my own since I myself didnt necessarily compile assembly code and optimize my own I/O libraries or anything. I was very impressed in the different runtime lengths however. It just proves how valuable a DSP is to be inside Intels hardware. I had also talked with Don Campbell from VistaScape in Atlanta, GA. I asked them if they preferred Intel hardware for processing all of their security camera data in their security systems. He said that they only use Intel because of the DSP features in the hardware to speedup computation. He also mentioned offloading many of the computations onto the GPU in the graphics cards parallel to what is happening on the main CPU, because of the ability for vectorized calculations.

Thanks again Bik,
Ryan Mauldin

Message Edited by Ryan-Mauldin@utc.edu on 10-14-2005 12:05 PM

Intel_C_Intel · ‎10-15-2005

> Athlon processor and it ran in about 40 seconds. Due to no use of > optimized intrinsics and no XMM registers to do quick large > > > > computation. I was pretty impressed with the Intel hardware after > finishing ...

To be honest you should be fair, the Intel-C++ compiler of course produces code which is optimized exactly to their processors and does not really care about other x86 implementations.

The P4 does a very good job when executing dsp-code especially written for this platform with a deep knowledge about its inside, however on general code (large code size, many conditional branches) it doesn't do that well thats why Intel will drop the Netburst powered architecture over the next 1-2 years.

lg Clemens

rmauldin · ‎10-16-2005

Not trying to be unfair... again on the compilation using Visual Studio 2005 Beta 2 (not the Intel compiler) ... the program compiles and will run in 4 seconds on the Intel machine and with the same compiler runs10x slower on my AMD machine from Alienware with same speed RAM and same 7200 RPM hard-drive. It is very hard to judge the difference between AMD and Intel, since they target the same Instruction Set, however they are way different architectures. Obviously there is some explanation due to the ability of automatic vectorization of the compiler's on intel hardware. I did not adjust any compilation settings and used default settings on every compiler. Try it yourself...

Thanks, Ryan Mauldin

rmauldin · ‎10-16-2005

Also another thought comes to mind... although my AMD machine may be faster at allot of tasks that I can think of off hand, like opening documents and editing them in Microsoft Word sometimes I wait around for them on the Intel machine, when it comes to throughput however... the Intel machine blows it away. Especially on ripping CDs and any heavy file system usage. Not saying that one machine is purely better than the other. The next machine I buy, laptop or desktop will be an Intel though.

Ryan

Intel_C_Intel · ‎10-17-2005

Thats really strange, both are running on same clock speed?

I am currently writing dsp-code for image analysis (glyph recognition stuff and things like that, you know) and my Duron-800 (64kb L2 cache) performs about 0.75x/mhz than the intel machine having said that the code on the p4 is optimized for p4 while on the duron its generic P2 code, whereas its even a bit faster (x/mhz) when executing the same code on the northwood-p4.
But right, its possible to archive quite high throughput when writing code which respects the weaks of the netburst architekture and well thats what we do - we use tons of CeleronD boxes *g*

Furthermore typial benchmarks for encoding stuff show one time intel ahead and one time amd - maybe your setup is wrong since I doubt any company could sell a product which is in terms 10 times slower than the competitive product.

However, who is faster or not is such a typical flame topic so piece on earth ;)

lg Clemens

rmauldin · ‎10-17-2005

Optimization of Compiler's is a beast isn't it.

Unimportant section:
I really have more programs installed on the Intel machine running in the background, since this is the machine I do most of my work on. And yes, the AMD Athlonmachine is slower at 1.8 GHz. (2.5/1.8) -> Intel machine is 1.3888...x faster on the Clock Rate. I'm really not positive on the number of Instructions being executed on the AMD or Intel machine, nor am I sure about the time spent solidly executing the code on either machine, just a rounded integer number provided by time.h. Nor am I positive on the number of pipeline stages on either processor or whether the Microsoft compiler was able to handle branch prediction and delay slots better on the Intel hardware than the AMD. And given that the Intel machine was purchased a year after the AMD machine I will make this more fare. Lets just bring inMoore's law and say that for the 12 months that I took this AMD machine and improved it at (1.03)^n. So we are really looking at (1.03)^12 -> 1.426. Showing that 1.426/1.3889 => 1.0267x faster AMD machine than the Intel machine in 12 months. Now using the Visual Studio .NET 2005 Beta 2 test times for the Intel machine at 4 seconds for a encode and 4 seconds for a decode and for AMD at 37 seconds for an encode / 39 seconds for a decode (I will take 37 seconds as shortest time). Now I will multiply AMDs shortest time 37sec * (1/1.426) = 25.9 seconds with the improvement. So given that Intel still rules at 4 sec as compared to AMDs 25.9 seconds I still see Intel getting the job on only 4MB of data 6.486 x faster. Of course there is always the argument of when who brought in Net Burst architecture which is slowing the machine down or who has 3D NOW with speed improvements and whatever. Bottom line only on this XOR test did Intel win. I have never said that Intel will win at all tasks or instructions given nor do I even know what hardware optimizations were made between these two computers between their ages. And I am byno means a perfessional at benchmarking. I would assume both machines have improved at industry rates to be able to compete. The only real difference between all of this mess on which hardware is better was never my intensions. I think the real difference in speed was gained from Microsofts ability to optimize on the Intel machine, and not the AMD machine. Hundreds of factors including time play a huge role on compiler optimizations. I think Ive spent way too much time on this. I could be wrong, I could be right.

Important Section:
I had originally just wanted toadd AMD into the mix to show the32 second time gap between the to architectures with the Microsoft Beta Compiler, but as it ends up Im sure its just that Microsoft doesnt know how to optimize on the AMD processor as well as it does on the Intel or maybe that they hadn't even had time to on this release. Also again, the Microsoft Visual Studio 2003 c++ compiler ran around 27 seconds on the Intel architecture and 4 seconds with the Beta 2 compiler on the same machine. The Intel compiler gives a 9 second decode on my Intel machine. How huge of a difference on the same stinkin machine. Would you rather wait around 27 seconds on 4 MB of data to be encoded by your crappy programs or 4 seconds and let the compiler optimize your code. Enough said. Forget I ever mentioned AMD please.

Thanks for giving me the chance to clear this up,
Ryan

Message Edited by rmauldin on 10-17-2005 02:25 PM

rmauldin · ‎10-18-2005

>>Moore's law and say that for the 12 months that I took this AMD machine and improved it at (1.03)^n...

Just to make things a little clearer here. I am saying that the industry as a whole will improve at 3% per month where n is the number of months. To get a more exact approximation for 1.03, since this number is rounded... we need to find Log24 (2).

So when Moore's Law states that in 2 years the industry will double (specifically in transistor count on a chip) and we want to find out the increase per month... we are looking for a number that we can multiply to the nth power where n is the number of months. To find this number we are looking at 24 months = 2 times increase... so for a monthly exponential rate you need to find X^24 = 2. If we use some basic algebra we get the following...

Log24 (2) = X
X^24 = 2
24ln(X) = ln(2)
ln(X) = ln(2)/ 24
X = e^(ln(2)/24)

X = 1.029302237 which rounds to 1.03.... which we see that
(1.029302237)^24 = 2 (approximately).

So you can see that is why I used this calculation to get the monthly rate increase of around 3% and by multiplying (1.03)^n where n is the number of months to increase improvements by and would give me the amount of improvement to multiply by. Where if the AMDs 1.8 GHz processor if improved within a years time (1.03)^12 = 1.42576 * (1.8 GHz) = 2.56 GHz (close to the Intels 2.5 GHz)... giving us an approximately closer clock speed. However I multiplied the AMDs program time by the inverse of 1.42576 to divide out the new time the program should run on the same efficiency of a processor with only a faster clock rate. I just want to make it clear that Im not just making numbers out of thin air to multiply by.