We have integrated with Intel IPP Crypto library version 8.2.090. We are using this library for AES Encrypt / Decrypt in CTR mode (for SRTP). Our target platform is a VMWare system running on Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz. Hence the application is linked with single threaded (st) version of libippcpv8.so.8.2 (optimized for Xenon platfor,) & libippcore.so.8.2.
Note: Our application already has thread pool implemented and hence we do not need to use the multi-threaded (mt) version of the libraries.
We compared the performance of Intel IPP crypto implementation of AES 128 bit CTR mode Encrypt / Decrypt with the aes_icm_ctr implementation in libsrtp (https://github.com/cisco/libsrtp/tree/master/crypto/cipher).
We observe that IPP crypto implementation is **slower** when compared to the open source implementation. 128 bit ippsAESEncryptCTR() takes ~125% of the time it takes for the corresponding implementation in the open source libsrtp code base - aes_icm_encrypt_ismacryp(). And that 128 bit ippsAESDecryptCTR() takes ~175% of the time it takes for the corresponding implementation in the open source libsrtp code base - aes_icm_encrypt_ismacryp().
Do you have any suggestions on how we can get Intell IPP crypto library to perform better on our target platform for AES in CTR mode ?
Below is how the routines are being invoked and these are the only methods that we are attempting to profile.
#define SRTP_IPP_AES_CTR_BIT_LEN 16
retStatus = ippsAESEncryptCTR(
retStatus = ippsAESDecryptCTR(
could you provide absolute numbers in cpu clocks per byte? I don't understand what 125 or 175% mean - 1.25x speedup or 2.25x. I'm curious why you see different ratios for encryption and decryption - for CTR mode of AES the same code is working... Also take into account that IPP functions are mitigated from any known sorts of attacks.
ippsAESEncryptCTR() takes 3958 usec to encrypt 32000 bytes of payload data while ippsAESDecryptCTR() takes 4888 usec to decrypt 32000 bytes of encrypted data. This is on a platform running VMWare VM configured to use 1 core of Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.
On the Same platform the open source version (imlementation in libsrt - https://github.com/cisco/libsrtp/tree/master/crypto/cipher) takes 3465 usec to encrypt 32000 bytes of payload and 3867 usec to decrypt 32000 bytes of encrypted data.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping : 4
microcode : 0x416
cpu MHz : 2600.000
cache size : 20480 KB
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc up arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dtherm
bogomips : 5200.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
below is an output from our performance system (you can find it in the standard installation in the "tools" subfolder - ps_ippcp) - you see, that for single thread and 3.5 GHz AVX CPU it takes 14 usec to encrypt/decrypt the 32K vector, therefore taking into account that you have 2.6 GHz CPU - I expect 14*3.5/2.6=~19-20 usec, but not 4000-5000:
CPU Processor supporting Advanced Vector Extensions instruction set 4x3.49 GHz Max cache size 8192 K
OS Linux (2.6.32-279.el6.x86_64 x86_64)
Library ippCP AVX (e9) 8.2.1 (r44077) Oct 9 2014
Start Thu Oct 9 18:57:30 2014
what kind of linking do you use - static or dynamic? what kind of library - single or multi-threaded? do you call ippInit() if static? could you provide an output from ippcpGetLibVersion() function?