Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Efficiency of ippsSMS4EncryptCBC

huang__zhongqiang
1,496 Views

I was testing IPP SMS4 functions on a Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz.

ippsSMS4EncryptCBC takes 1.3s to encrypt 100MB data, while ippsSMS4DecryptCBC taking only 0.25s to decrypt the cipher.

SMS4 is a symmetric encryption, why encrypting is much slower than decrypting in IPP crypto?

The source file is compiled with gcc not icc, does it matter?

0 Kudos
13 Replies
Igor_A_Intel
Employee
1,496 Views

hi,

CBC decryption has no feedback dependency, while CBC encryption has.

This feature allows perform decryption of several blocks simultaneously.

 

This feature is general for CBC mode.

If one compare AES-CBC encryption and decryption the general picture will look the same – decryption is several times faster.

 

regards, Igor

0 Kudos
Igor_A_Intel
Employee
1,496 Views

hi zhongqiang,

which IPP version do you use? (+operating system? arch? (ia32 or Intel64), linking - static or dynamic?)

the best reply is to provide an output from ippcpGetLibVersion():

    const IppLibraryVersion* lib;

    lib = ippcpGetLibVersion();
    printf("%s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);

regards, Igor

0 Kudos
huang__zhongqiang
1,496 Views

Hi, Igor

The output is:

ippCP AVX (e9) 2018.0.1 (r57267) 2018.0.1.57267

My OS is 

Linux algo 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The linking arguments are:

cc -I/opt/intel/ippcp/include -O3   -c -o sm4test.o sm4test.c
cc -I/opt/intel/ippcp/include -O3 -g sm4test.o /opt/intel/ippcp/lib/intel64/libippcp.a  -o test

0 Kudos
Chao_Y_Intel
Moderator
1,496 Views

Hi, 
Just check the sm4test.c file in "SM4_CBC.7z",  it does not include any IPP call.  Is there anything missed there?

Also, could you submit your Could you summit a support ticket to our support site: https://www.intel.com/supporttickets? Our support team can reproduce with your test code for the investigation.  Here are some steps:  https://software.intel.com/sites/default/files/managed/97/ce/SubmittingS...

Thanks,
Chao

 

0 Kudos
huang__zhongqiang
1,496 Views

Hi, Chao Y,

My support ticket is 03236416.

"SM4_CBC.7z" is the code I downloaded from the Internet and did some slight modifications.

The sample code (my code and ipp code) has been uploaded to the support site. Many thanks. 

 

 

0 Kudos
huang__zhongqiang
1,496 Views

Igor Astakhov (Intel) wrote:

hi,

CBC decryption has no feedback dependency, while CBC encryption has.

This feature allows perform decryption of several blocks simultaneously.

 

This feature is general for CBC mode.

If one compare AES-CBC encryption and decryption the general picture will look the same – decryption is several times faster

regards, Igor

 

Thank you for your reply.

I downloaded the SM4 source code from the internet and did some modifications. The code takes 0.88s to encrypt 100MB data in Intel Xeon E3-1230. 

I would like to utilize IPP Crypto to optimize the SM4, but found that IPP is a lot slower. I was wondering if there is a high-throughput (> 400MBps in E3-1230) SM4 encryption in IPP crypto?

 

0 Kudos
Igor_A_Intel
Employee
1,496 Views

hi Zhongqiang,

the best performance is not the only criterion for crypto functionality. The main criterion in addition to performance is that all IPP crypto functions are safe and mitigated from all known attacks (in ~2005 was published cache-timing attack with cache-line-size granularity, in 2017 - with 16-bit granularity (MemJam)). You implementation is well known - with pre-calculated big tables - it is not safe against the 1st kind and all further attacks.

reading from

uint32_t Sbox_final0_rest[256]

uint32_t Sbox_final1_rest[256]

uint32_t Sbox_final2_rest[256]

uint32_t Sbox_final3_rest[256]

 

directly depends on the round key and is not regular through your tables - therefore the round key can be easily restored by cache-timing attack and you know - secret and round key are mutually reversible. Please take a look at the attached doc.

 

regards, Igor

0 Kudos
huang__zhongqiang
1,496 Views

Hi, Igor

Did you mean the non-linear substitution should not be implemented as a fixed lookup table for security reasons?

However, I found that SMS4_Sbox (the original Sbox table, in type uint32_t [256]) is defined in the IPP crypto according to the disassembly information of ippsSMS4EncryptCBC.

Sbox_final_res is almost equivalent to SMS4_Sbox which also depends on the round key for reading, so IPP crypto functions are note safe either?

0 Kudos
Sergey_K_Intel4
Employee
1,496 Views

a) exactly, the best way is avoid lookup operations

b) latest IPP implementation of SM4 is using Sbox (is AES-NI are disables), but provide uniform access to SM4 S-box does not dependent from particular input index.

c) IPP implementation of SM4 does not contains large S-boxes, It uses "standard" SM4 256-byte short S-box

const __ALIGN64 Ipp8u SMS4_Sbox[16*16] = {
   0xD6,0x90,0xE9,0xFE,0xCC,0xE1,0x3D,0xB7,0x16,0xB6,0x14,0xC2,0x28,0xFB,0x2C,0x05,
   0x2B,0x67,0x9A,0x76,0x2A,0xBE,0x04,0xC3,0xAA,0x44,0x13,0x26,0x49,0x86,0x06,0x99,
   0x9C,0x42,0x50,0xF4,0x91,0xEF,0x98,0x7A,0x33,0x54,0x0B,0x43,0xED,0xCF,0xAC,0x62,
   0xE4,0xB3,0x1C,0xA9,0xC9,0x08,0xE8,0x95,0x80,0xDF,0x94,0xFA,0x75,0x8F,0x3F,0xA6,
   0x47,0x07,0xA7,0xFC,0xF3,0x73,0x17,0xBA,0x83,0x59,0x3C,0x19,0xE6,0x85,0x4F,0xA8,
   0x68,0x6B,0x81,0xB2,0x71,0x64,0xDA,0x8B,0xF8,0xEB,0x0F,0x4B,0x70,0x56,0x9D,0x35,
   0x1E,0x24,0x0E,0x5E,0x63,0x58,0xD1,0xA2,0x25,0x22,0x7C,0x3B,0x01,0x21,0x78,0x87,
   0xD4,0x00,0x46,0x57,0x9F,0xD3,0x27,0x52,0x4C,0x36,0x02,0xE7,0xA0,0xC4,0xC8,0x9E,
   0xEA,0xBF,0x8A,0xD2,0x40,0xC7,0x38,0xB5,0xA3,0xF7,0xF2,0xCE,0xF9,0x61,0x15,0xA1,
   0xE0,0xAE,0x5D,0xA4,0x9B,0x34,0x1A,0x55,0xAD,0x93,0x32,0x30,0xF5,0x8C,0xB1,0xE3,
   0x1D,0xF6,0xE2,0x2E,0x82,0x66,0xCA,0x60,0xC0,0x29,0x23,0xAB,0x0D,0x53,0x4E,0x6F,
   0xD5,0xDB,0x37,0x45,0xDE,0xFD,0x8E,0x2F,0x03,0xFF,0x6A,0x72,0x6D,0x6C,0x5B,0x51,
   0x8D,0x1B,0xAF,0x92,0xBB,0xDD,0xBC,0x7F,0x11,0xD9,0x5C,0x41,0x1F,0x10,0x5A,0xD8,
   0x0A,0xC1,0x31,0x88,0xA5,0xCD,0x7B,0xBD,0x2D,0x74,0xD0,0x12,0xB8,0xE5,0xB4,0xB0,
   0x89,0x69,0x97,0x4A,0x0C,0x96,0x77,0x7E,0x65,0xB9,0xF1,0x09,0xC5,0x6E,0xC6,0x84,
   0x18,0xF0,0x7D,0xEC,0x3A,0xDC,0x4D,0x20,0x79,0xEE,0x5F,0x3E,0xD7,0xCB,0x39,0x48
};

 

0 Kudos
huang__zhongqiang
1,496 Views
Hi Kirillov, Thanks for the explanation. The precalculated Sbox_final_res table is indeed a lot larger (16x) than the standard Sbox and it does not support 'uniform access'. The goal is to achieve 400MBps in  E3-1230 (my code still needs 3x improvements), does IPP crypto have any solution?
0 Kudos
Sergey_K_Intel4
Employee
1,496 Views

Not sure it's possible. Let convert your requirement (400MB/s, 3.3GHz) into another units. It corresponds to 3.3e^9/400*1e^6 = 8 cycles/byte. It's your goal.

Imagine you have AES128-CBC cipher instead of SM4-CBC. What performance do you expect from AES128-CBC encryption based on  AES-NI implementation? Suppose it will about 3-4 cycles/byte. (Recall, that CBC encryption allows block-by-block processing only).

Both AES and SM4 have 16-byte block. But AES128 takes 11 rounds per block encryption whereas SM4 takes 32. From my point of view this means that SM4-CBC encryption could not show performance better than 3*(32/11)=9 cycles/byte. This estimation based on assumption that both AES and SM4 have similar efficient implementation (== directly mapped into AES-NI). But unfortunately it is not true. AES-NI have been designed for AES implementation specifically, not for SM4. In spite of AES-NI applicable for SM4 performance improvement (recall IPP SM4-CBC decryption shows 0.25s per 100MB) it can't change the situation dramatically.

That is why I think that SM4-CBC encryption at 400MB/s on 3.3GHz CPU is not real

0 Kudos
huang__zhongqiang
1,496 Views

Sergey Kirillov (Intel) wrote:

Not sure it's possible. Let convert your requirement (400MB/s, 3.3GHz) into another units. It corresponds to 3.3e^9/400*1e^6 = 8 cycles/byte. It's your goal.

Imagine you have AES128-CBC cipher instead of SM4-CBC. What performance do you expect from AES128-CBC encryption based on  AES-NI implementation? Suppose it will about 3-4 cycles/byte. (Recall, that CBC encryption allows block-by-block processing only).

Both AES and SM4 have 16-byte block. But AES128 takes 11 rounds per block encryption whereas SM4 takes 32. From my point of view this means that SM4-CBC encryption could not show performance better than 3*(32/11)=9 cycles/byte. This estimation based on assumption that both AES and SM4 have similar efficient implementation (== directly mapped into AES-NI). But unfortunately it is not true. AES-NI have been designed for AES implementation specifically, not for SM4. In spite of AES-NI applicable for SM4 performance improvement (recall IPP SM4-CBC decryption shows 0.25s per 100MB) it can't change the situation dramatically.

That is why I think that SM4-CBC encryption at 400MB/s on 3.3GHz CPU is not real

You may be right. My idea is to shorten the critical dependency path of SM4 CBC but got no progress. 

Anyway, thank you all for your patience and help. I've learned something from this post. 

I'll cancel the support request and get in touch with you guys if I have any further question :)

0 Kudos
Lee__Mike
Beginner
1,496 Views
ippsSMS4EncryptCBC takes 1.3s to encrypt 100MB data, while ippsSMS4DecryptCBC taking only 0.25s to decrypt the cipher this is tested by all cores or single core?
0 Kudos
Reply