- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to use the Intel IPP library to improve application performance, but it doesn't working very well. Intel IPP library performance is unstable, with a 2x performance difference between 2 runs under the same conditions.(Using sse vector instructions, performance is stable. Using avx2/avx512 vector instructions, performance is unstable)
Comparing hot assembly code with good performance .vs. poor performance, instructions such as vfmadd213ps are randomly and unusually slow.
- Hardware problem or software problem? How can I analyze the cause of this problem?
My test environment is as follows:
platform:Linux
CPU:Intel(R) Xeon(R) Platinum 8458P @ 2.7GHz (Intel sapphirerapids architecture)
Compiler version: gcc 7.3.0
Intel IPP version:intel-oneapi-ipp 2021.7.0
I have a simple example to replicate the problem of unstable performance.
- Read data libs:rwblib.h, rwblib.c
void wbdata(const char* variable, const char* datatype, void* data, size_t datatypelen, size_t alllen);
void rbdata(const char* filename, void* data, size_t alllen);
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rwblib.h"
void wbdata(const char* variable, const char* datatype, void* data, size_t datatypelen, size_t alllen)
{
char filename[512] = {0};
snprintf(filename, sizeof(filename), "variable-%s-datatype-%s-datatypelen-%lu-alllen-%lu", variable, datatype, datatypelen, alllen);
FILE *fp_w = fopen(filename, "wb");
if (fp_w == NULL)
{
perror("error in open file\n");
return;
}
size_t elements_written = fwrite(data, 1, alllen, fp_w);
if (elements_written != alllen)
{
perror("error in write file\n");
fclose(fp_w);
return;
}
fclose(fp_w);
}
void rbdata(const char* filename, void* data, size_t alllen)
{
FILE *fp_r = fopen(filename, "rb");
if (fp_r == NULL)
{
perror("error in open file\n");
return;
}
size_t elements_read = fread(data, 1, alllen, fp_r);
if (elements_read != alllen)
{
perror("error in read file\n");
fclose(fp_r);
return;
}
fclose(fp_r);
}
- ippsAddProduct_32fc function input data: variable-core-vis1-datatype-ippcf32-datatypelen-8-alllen-256, variable-core-vis2-datatype-ippcf32-datatypelen-8-alllen-256.(see attachment)
- Test program:ippsAddProduct_32fc.c
#include <stdio.h>
#include <stdlib.h>
#include <ippcore.h>
#include <ippvm.h>
#include <ipps.h>
#include <rwblib.h>
#define PX_FM ( ippCPUID_MMX | ippCPUID_SSE | ippCPUID_SSE2 )
#define M7_FM ( PX_FM | ippCPUID_SSE3 )
#define U8_FM ( M7_FM | ippCPUID_SSSE3 )
#define N8_FM ( U8_FM | ippCPUID_MOVBE )
#define Y8_FM ( U8_FM | ippCPUID_SSE41 | ippCPUID_SSE42 )
#define E9_FM ( Y8_FM | ippCPUID_AVX | ippAVX_ENABLEDBYOS | ippCPUID_F16C )
#define L9_FM ( E9_FM | ippCPUID_MOVBE | ippCPUID_AVX2 | ippCPUID_PREFETCHW )
#define N0_FM ( L9_FM | ippCPUID_AVX512F | ippCPUID_AVX512CD | ippCPUID_AVX512PF | ippCPUID_AVX512ER | ippAVX512_ENABLEDBYOS )
#define K0_FM ( L9_FM | ippCPUID_AVX512F | ippCPUID_AVX512CD | ippCPUID_AVX512VL | ippCPUID_AVX512BW | ippCPUID_AVX512DQ | ippAVX512_ENABLEDBYOS )
int main(int argc, char *argv[])
{
Ipp32fc src1[32] = {0};
Ipp32fc src2[32] = {0};
Ipp32fc des[32] = {0};
IppStatus status;
rbdata("variable-core-vis1-datatype-ippcf32-datatypelen-8-alllen-256", src1, 256);
rbdata("variable-core-vis2-datatype-ippcf32-datatypelen-8-alllen-256", src2, 256);
status = ippSetCpuFeatures(K0_FM);
printf("SetCpuFeatures = K0_FM, Vector instructions using avx512\n");
int num = atoi(argv[1]);
for(int j=0; j<num; j++)
{
for(int i=0; i<1000000000; i++)
{
status = ippsAddProduct_32fc(src1, src2, des, 32);
}
}
printf("Complete %ld ippsAddProduct_32fc calculations\n", (long)num*1000000000);
return 0;
}
- The compilation command is as follows:
gcc -O2 -fPIC -shared rwblib.c -o librwb.so
gcc -O2 -I./ -I/opt/czq/spack-1.0.0/var/spack/environments/ipp-2021-7/.spack-env/view/ipp/latest/include ippsAddProduct_32fc.c -L/opt/czq/spack-1.0.0/var/spack/environments/ipp-2021-7/.spack-env/view/ipp/latest/lib/intel64 -lippcore -lipps -L./ -lrwb -o ippsAddProduct_32fc
- SetCpuFeatures = K0_FM, vector instructions using avx512. Bind core using numactl and run 5-10 times, unstable performance up to 2x difference.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.09
user 8.06
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.15
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.15
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.16
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 16.44
user 16.38
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 7.92
user 7.88
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 7.93
user 7.90
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.14
sys 0.00
- SetCpuFeatures = L9_FM, vector instructions using avx2. Bind core using numactl and run 5-10 times, unstable performance up to 2x difference.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.51
user 17.46
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.91
user 8.88
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.50
user 17.46
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.82
user 8.80
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.53
user 17.48
sys 0.00
- SetCpuFeatures = Y8_FM, vector instructions using sse. Bind core using numactl and run 5-10 times. Only using sse, performance is stable.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.29
user 13.25
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.36
user 13.32
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.15
user 13.12
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.08
user 13.04
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.06
user 13.02
sys 0.00
- Comparing hot assembly code with good performance .vs. poor performance, instructions such as vfmadd213ps are randomly and unusually slow.(Complete hot assembly code, see attachment)
I'm not sure what's causing the unstable performance? Is it a problem with the AVX vector instructions? What method or tools should I use to analyze this problem?
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page