Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
6811 Discussions

Intel IPP library performance is unstable, 2x performance difference between 2 runs.

Xiaoqiang
New Contributor I
1,811 Views

I'm trying to use the Intel IPP library to improve application performance, but it doesn't working very well. Intel IPP library performance is unstable, with a 2x performance difference between 2 runs under the same conditions.(Using sse vector instructions, performance is stable. Using avx2/avx512 vector instructions, performance is unstable)

Comparing hot assembly code with good performance .vs. poor performance,  instructions such as vfmadd213ps are randomly and unusually slow.

  • Hardware problem or software problem? How can I analyze the cause of this problem? 

My test environment is as follows:

platform:Linux
CPU:Intel(R) Xeon(R) Platinum 8458P @ 2.7GHz (Intel sapphirerapids architecture)
Compiler version: gcc 7.3.0
Intel IPP version:intel-oneapi-ipp 2021.7.0

 I have a simple example to replicate the problem of unstable performance.

  • Read data libs:rwblib.h, rwblib.c
void wbdata(const char* variable, const char* datatype, void* data, size_t datatypelen, size_t alllen);
void rbdata(const char* filename, void* data, size_t alllen);
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rwblib.h"

void wbdata(const char* variable, const char* datatype, void* data, size_t datatypelen, size_t alllen)
{
        char filename[512] = {0};
        snprintf(filename, sizeof(filename), "variable-%s-datatype-%s-datatypelen-%lu-alllen-%lu", variable, datatype, datatypelen, alllen);
        FILE *fp_w = fopen(filename, "wb");
        if (fp_w == NULL)
        {
                perror("error in open file\n");
                return;
        }

        size_t elements_written = fwrite(data, 1, alllen, fp_w);
        if (elements_written != alllen)
        {
                perror("error in write file\n");
                fclose(fp_w);
                return;
        }
        fclose(fp_w);
}

void rbdata(const char* filename, void* data, size_t alllen)
{
         FILE *fp_r = fopen(filename, "rb");
         if (fp_r == NULL)
         {
                 perror("error in open file\n");
                 return;
         }

         size_t elements_read = fread(data, 1, alllen, fp_r);
         if (elements_read != alllen)
         {
                 perror("error in read file\n");
                 fclose(fp_r);
                 return;
         }
         fclose(fp_r);
}
  • ippsAddProduct_32fc function input data: variable-core-vis1-datatype-ippcf32-datatypelen-8-alllen-256, variable-core-vis2-datatype-ippcf32-datatypelen-8-alllen-256.(see attachment)
  • Test program:ippsAddProduct_32fc.c
#include <stdio.h>
#include <stdlib.h>
#include <ippcore.h>
#include <ippvm.h>
#include <ipps.h>
#include <rwblib.h>

#define PX_FM ( ippCPUID_MMX | ippCPUID_SSE | ippCPUID_SSE2 )
#define M7_FM ( PX_FM | ippCPUID_SSE3 )
#define U8_FM ( M7_FM | ippCPUID_SSSE3 )
#define N8_FM ( U8_FM | ippCPUID_MOVBE )
#define Y8_FM ( U8_FM | ippCPUID_SSE41 | ippCPUID_SSE42 )
#define E9_FM ( Y8_FM | ippCPUID_AVX | ippAVX_ENABLEDBYOS | ippCPUID_F16C )
#define L9_FM ( E9_FM | ippCPUID_MOVBE | ippCPUID_AVX2 | ippCPUID_PREFETCHW )
#define N0_FM ( L9_FM | ippCPUID_AVX512F | ippCPUID_AVX512CD | ippCPUID_AVX512PF | ippCPUID_AVX512ER | ippAVX512_ENABLEDBYOS )
#define K0_FM ( L9_FM | ippCPUID_AVX512F | ippCPUID_AVX512CD | ippCPUID_AVX512VL | ippCPUID_AVX512BW | ippCPUID_AVX512DQ | ippAVX512_ENABLEDBYOS )

int main(int argc, char *argv[])
{
        Ipp32fc src1[32] = {0};
        Ipp32fc src2[32] = {0};
        Ipp32fc des[32] = {0};
        IppStatus status;

        rbdata("variable-core-vis1-datatype-ippcf32-datatypelen-8-alllen-256", src1, 256);
        rbdata("variable-core-vis2-datatype-ippcf32-datatypelen-8-alllen-256", src2, 256);

        status = ippSetCpuFeatures(K0_FM);
        printf("SetCpuFeatures = K0_FM, Vector instructions using avx512\n");

        int num = atoi(argv[1]);
        for(int j=0; j<num; j++)
        {
                for(int i=0; i<1000000000; i++)
                {
                        status = ippsAddProduct_32fc(src1, src2, des, 32);
                }
        }
        printf("Complete %ld ippsAddProduct_32fc calculations\n", (long)num*1000000000);
        return 0;
}
  • The compilation command is as follows:
gcc -O2 -fPIC -shared rwblib.c -o librwb.so
gcc -O2 -I./ -I/opt/czq/spack-1.0.0/var/spack/environments/ipp-2021-7/.spack-env/view/ipp/latest/include ippsAddProduct_32fc.c -L/opt/czq/spack-1.0.0/var/spack/environments/ipp-2021-7/.spack-env/view/ipp/latest/lib/intel64 -lippcore -lipps -L./ -lrwb -o ippsAddProduct_32fc
  • SetCpuFeatures = K0_FM, vector instructions using avx512. Bind core using numactl and run 5-10 times, unstable performance up to 2x difference.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.09
user 8.06
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.15
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.15
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.16
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 16.44
user 16.38
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 7.92
user 7.88
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 7.93
user 7.90
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = K0_FM, Vector instructions using avx512
Complete 1000000000 ippsAddProduct_32fc calculations
real 10.19
user 10.14
sys 0.00
  • SetCpuFeatures = L9_FM, vector instructions using avx2. Bind core using numactl and run 5-10 times, unstable performance up to 2x difference.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.51
user 17.46
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.91
user 8.88
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.50
user 17.46
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 8.82
user 8.80
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = L9_FM, Vector instructions using avx2
Complete 1000000000 ippsAddProduct_32fc calculations
real 17.53
user 17.48
sys 0.00
  • SetCpuFeatures = Y8_FM, vector instructions using sse. Bind core using numactl and run 5-10 times. Only using sse, performance is stable.
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.29
user 13.25
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.36
user 13.32
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.15
user 13.12
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.08
user 13.04
sys 0.00
[root@master ipp-2021-7]# numactl -C 0 -m 0 time -p ./ippsAddProduct_32fc 1
SetCpuFeatures = Y8_FM, Vector instructions using sse
Complete 1000000000 ippsAddProduct_32fc calculations
real 13.06
user 13.02
sys 0.00
  • Comparing hot assembly code with good performance .vs. poor performance,  instructions such as vfmadd213ps are randomly and unusually slow.(Complete hot assembly code, see attachment)

Xiaoqiang_0-1753170335034.png

 

I'm not sure what's causing the unstable performance? Is it a problem with the AVX vector instructions? What method or tools should I use to analyze this problem?

Labels (1)
0 Kudos
0 Replies
Reply