Showing results for 
Search instead for 
Did you mean: 

Incorrect results when using AVX2 with IPP 7.0

Hi Intel compiler engineers

Is there any known risk when using AVX2 based dynamic library with IPP 7.0 ?

Here's my problem.

A dynamic library built with command /QaxCORE-AVX2 and /O2 works fine with its unit test program. In this dll, some intrinsics were used, such as _mm256_mul_ps, _mm256_add_ps, _mm256_exp_ps, etc. However, when integrating this AVX2 based dynamic library into a project, which had already used some functions from IPP 7.0, the intrinsics' results were getting wrong. I printed the results from _mm256_add_ps and found that the high 128 bits numbers were all ZERO.

Is this phenomenon expected? If so, how can I make it right?

Please suggest me, I am new to this.

Zhongqi Zhang

0 Kudos
3 Replies

Hi Zhongqi,

Which function are you using?  As I recalled, IPP 7.0 main have AVX code, no AVX2 optimization.  You may take a look the IPP 7.0 bug list from

For your question, first, I may suggest , if possible, please upgrade IPP 2017, get the install package  from => Get This Library for Free

secondly, in theory, IPP was ready and self-contained binary. it  don't influence or be influenced by Compiler option  /QaxCORE-AVX2 and /O2 , any external intrinsics .result.  So if it is issue, maybe some change in input or output.  Have you located which cause the problem?

And if possible, please provide us test case, so we can see what goes wrong.

Best Regards,



Hi Zhongqi.

I'm not sure but probably you have found some unexpected side effect in IPP avx code from zeroupper. But as mentioned above we need reproducer to confirm this issue.

Thanks for your feedback and using IPP.


Hi Ying and Andrey

Thank you for replying my questions.Heres the updated information.

We have removed every dependencies to IPP 7.0. However the problem, which is the high 128 bit of __m256 element were all zero, still exits. So, for now, it seems that the problem has nothing to do with IPP. Unfortunately, we cannot reproduce this problem in a single and easy test case(this is the most difficult problem). The only way is to integate the intrinsics to our main project, and source code is not allowed to share on the internet due to our company's policy. Here's a code block illustrates what the failure source code looks like.

void failureTest()
    float* pRaw = (float*)_mm_malloc(4 * 8 * 10, 32);
    float* pMat = (float*)_mm_malloc(4 * 8 * 3, 32);

    __m256 mmRaw;
    __m256 mmMat;

    float fNum = 1.0f;
    for (int i = 0; i < 320; ++i)
        pRaw = fNum;
    for (int i = 0; i < 96; ++i)
        pMat = fNum++;

    for (int i = 0; i < 10; i++)
        mmRaw = _mm256_load_ps(pRaw + i * 8);
        for (int j = 0; j < 10; ++j)
            mmMat = _mm256_loadu_ps(pMat + j);
            __m256 fv1 = _mm256_add_ps(mmMat, mmRaw);
            __m256 fv2 = _mm256_mul_ps(fv1, fv1);
            mmMat = _mm256_add_ps(mmRaw, fv2);
            mmMat = _mm256_exp_ps(mmMat);
        _mm256_store_ps(pRaw + i * 8, mmMat);


We can workaround this problem with two methods.

First, declare all the __m256 elements with 'volatile'.However, in this method, performance is not acceptable.

Two, replace _mm256_expf_ps with a substitution found in : In this website, there is a function called 'exp256_ps'.  This method shows convincing performance and accuracy. But we still have no clue about what went south in the original code.

Best Regards

Zhongqi Zhang