Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
7730 Discussions

Performance speed of oneAPI 2021.02 and Parallel Studio 2018

edeltech
Beginner
570 Views

I measured the execution time by compiling the following source using oneapi 2021.02 and Parallel Studio 2018.1 respectively.

Compile command: icc -o test test.c -mkl -lpthread -lrt


As a result, Parallel Studio 2018 performed 4x faster.
Why is oneAPI 2021 so slow?
Is the compile option wrong?
How can I speed up the execution time of oneAPI?

=====================================

#define MAX_MULTI_PATH_REF_NUM 2
#define MAX_MULTI_PATH_LOOP_CNT 11
#define MAX_LOOP_INDEX 30000
#define MAX_CORE_NUM 21
#define MULTIPATH_U_V_TYPE_CNT 26
#define MAX_ANTENNA_ARRAY_NUM 2048
#define PI 3.141592653589793f

static uint64_t ul_test_threads[MAX_CORE_NUM];
float f_test_out[MAX_LOOP_INDEX*MAX_CORE_NUM];
MKL_Complex8 arr_m_source_xp[MAX_ANTENNA_ARRAY_NUM], arr_m_source_yp[MAX_ANTENNA_ARRAY_NUM];


static void stat_req_recv(long l_index);

static void *test_for_thread(void * const thread_idx){
long ll_thread_index;
cpu_set_t m_cpu_mask;
int i_result;
long l_result;

ll_thread_index = (int64_t)thread_idx;
CPU_ZERO(&m_cpu_mask);
CPU_SET(2,&m_cpu_mask);

i_result = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &m_cpu_mask);

stat_req_recv(ll_thread_index);

return NULL;
}

static void stat_req_recv(long l_index)
{
struct timespec m_timetag_start, m_timetag_end;
float f_target_v, f_multipath_v, f_pi_factor, f_center_u, f_center_v, f_multipath_offset_angle, f_beta, f_weight_coef;
int k, m, n, i_multipath_proc_cnt;
MKL_Complex8 arr_m_beam_weight_out[MAX_MULTI_PATH_LOOP_CNT*MAX_MULTI_PATH_REF_NUM*MAX_ANTENNA_ARRAY_NUM];
f_pi_factor = PI/180.f;
f_center_v = 0.2588f;
f_center_u = 0.01f;
f_multipath_offset_angle = -0.5f;
f_beta = 2*PI;
for (m=0;m<3;m++)
{
clock_gettime(CLOCK_REALTIME, &m_timetag_start);

i_multipath_proc_cnt = m;
for(k=0;k<MAX_MULTI_PATH_LOOP_CNT;k++){

f_target_v = f_center_v + sinf(0.2*(i_multipath_proc_cnt/MULTIPATH_U_V_TYPE_CNT)*f_pi_factor);
f_multipath_v = f_center_v + sinf(((0.2*(i_multipath_proc_cnt%MULTIPATH_U_V_TYPE_CNT)) + f_multipath_offset_angle)*f_pi_factor);

if(((i_multipath_proc_cnt/MULTIPATH_U_V_TYPE_CNT) == 0) && ((i_multipath_proc_cnt%MULTIPATH_U_V_TYPE_CNT) == 0))
{
f_target_v = f_center_v;
f_multipath_v = f_center_v - sinf(0.2f*f_pi_factor);
}
for(n=0;n<MAX_ANTENNA_ARRAY_NUM;n++)
{

f_weight_coef = f_beta*((f_target_v*arr_m_source_yp[n].imag) + (f_center_u*arr_m_source_xp[n].imag));

arr_m_beam_weight_out[n+(2*k)*MAX_ANTENNA_ARRAY_NUM].real = cosf(f_weight_coef);
arr_m_beam_weight_out[n+(2*k)*MAX_ANTENNA_ARRAY_NUM].imag = sinf(f_weight_coef);

f_weight_coef = f_beta*((f_multipath_v*arr_m_source_yp[n].imag) + (f_center_u*arr_m_source_xp[n].imag));

arr_m_beam_weight_out[n+(2*k+1)*MAX_ANTENNA_ARRAY_NUM].real = cosf(f_weight_coef);
arr_m_beam_weight_out[n+(2*k+1)*MAX_ANTENNA_ARRAY_NUM].imag = sinf(f_weight_coef);
}
i_multipath_proc_cnt = i_multipath_proc_cnt + 1;
}

clock_gettime(CLOCK_REALTIME, &m_timetag_end);
printf("Thread = %lld, Index = %d, Test for time = %lld\n",l_index, m, (m_timetag_end.tv_nsec - m_timetag_start.tv_nsec)/1000);
}

}

int32_t main(void)
{
int i, j, k, m, i_status, i_status_check;
unsigned int ui_mod;

pthread_attr_t m_test_for_threads_attr[MAX_CORE_NUM];
struct sched_param m_test_for_threads_param[MAX_CORE_NUM];

k = 0;
for(k=0;k<MAX_ANTENNA_ARRAY_NUM;k++)
{
arr_m_source_xp[k].real = (0.1)*k;
arr_m_source_xp[k].imag = (0.2)*k;
arr_m_source_yp[k].real = (0.4)*k;
arr_m_source_yp[k].imag = (0.3)*k;
}


for(k=0;k<MAX_CORE_NUM;k++)
{
i_status = pthread_attr_init(&m_test_for_threads_attr[k]);
m_test_for_threads_param[k].__sched_priority = 98;
i_status = pthread_attr_setinheritsched(&m_test_for_threads_attr[k], PTHREAD_EXPLICIT_SCHED);
i_status = pthread_attr_setschedpolicy(&m_test_for_threads_attr[k], SCHED_FIFO);
i_status = pthread_attr_setschedparam(&m_test_for_threads_attr[k],&m_test_for_threads_param[k]);
i_status = pthread_create(&ul_test_threads[k],&m_test_for_threads_attr[k],test_for_thread,(void*)k);
}


for(k=0;k<MAX_CORE_NUM;k++)
{
pthread_join(ul_test_threads[k],NULL);
}

 

return 0;
}

0 Kudos
5 Replies
VidyalathaB_Intel
Moderator
548 Views

Hi,

 

Thanks for reaching out to us.

 

>>Why is oneAPI 2021 so slow?... How can I speed up the execution time of oneAPI?

 

Could you please let us know the CPU model on which you are running your code and also your OS details?

 

>>As a result, Parallel Studio 2018 performed 4x faster.

 

Please share with us the outputs of the provided source code that you are getting for both parallel studio 2018 and oneAPI 2021.

 

 i_status = pthread_create(&ul_test_threads[k],&m_test_for_threads_attr[k],test_for_thread,(void*)k);

 

It would help us if you make it clear about the value of the argument that you are passing to the test_for_thread function so that we could see the execution time.

 

>>Is the compile option wrong?

 

That should be enough.

 

Could you please confirm whether your issue is with respect to icc compiler or mkl library (as we could see that you are not calling any MKL functions except using the MKL_Complex8 datatype)?

 

Regards,

Vidya.

 

edeltech
Beginner
531 Views

Dear,

 

Thank you for your support. Sorry for the late reply.

 

Could you please let us know the CPU model on which you are running your code and also your OS details?

>> CPU : Intel(R) Xeon(R) Gold 6238T CPU @ 1.90GHz

>> OS : Redhat 7.7

 

Please share with us the outputs of the provided source code that you are getting for both parallel studio 2018 and oneAPI 2021.

>> The results are similar, but not always the same.

>> parallel studio 2018 output

Thread = 0, Index = 0, Test for time = 252
Thread = 0, Index = 1, Test for time = 102
Thread = 0, Index = 2, Test for time = 98
Thread = 1, Index = 0, Test for time = 81
Thread = 1, Index = 1, Test for time = 70
Thread = 1, Index = 2, Test for time = 60
Thread = 6, Index = 0, Test for time = 78
Thread = 6, Index = 1, Test for time = 62
Thread = 6, Index = 2, Test for time = 73
Thread = 5, Index = 0, Test for time = 165
Thread = 5, Index = 1, Test for time = 165
Thread = 5, Index = 2, Test for time = 209
Thread = 7, Index = 0, Test for time = 166
Thread = 7, Index = 1, Test for time = 166
Thread = 7, Index = 2, Test for time = 201
Thread = 4, Index = 0, Test for time = 198
Thread = 4, Index = 1, Test for time = 166
Thread = 4, Index = 2, Test for time = 166
Thread = 8, Index = 0, Test for time = 202
Thread = 8, Index = 1, Test for time = 166
Thread = 8, Index = 2, Test for time = 230
Thread = 9, Index = 0, Test for time = 282
Thread = 9, Index = 1, Test for time = 210
Thread = 9, Index = 2, Test for time = 176
Thread = 2, Index = 0, Test for time = 68
Thread = 2, Index = 1, Test for time = 209
Thread = 2, Index = 2, Test for time = 171
Thread = 10, Index = 0, Test for time = 298
Thread = 10, Index = 1, Test for time = 234
Thread = 10, Index = 2, Test for time = 165
Thread = 11, Index = 0, Test for time = 192
Thread = 11, Index = 1, Test for time = 164
Thread = 11, Index = 2, Test for time = 165
Thread = 3, Index = 0, Test for time = 76
Thread = 3, Index = 1, Test for time = 196
Thread = 3, Index = 2, Test for time = 209
Thread = 12, Index = 0, Test for time = 449
Thread = 12, Index = 1, Test for time = 189
Thread = 12, Index = 2, Test for time = 166
Thread = 14, Index = 0, Test for time = 279
Thread = 14, Index = 1, Test for time = 165
Thread = 14, Index = 2, Test for time = 166
Thread = 15, Index = 0, Test for time = 197
Thread = 15, Index = 1, Test for time = 225
Thread = 15, Index = 2, Test for time = 166
Thread = 13, Index = 0, Test for time = 166
Thread = 13, Index = 1, Test for time = 200
Thread = 13, Index = 2, Test for time = 165
Thread = 16, Index = 0, Test for time = 291
Thread = 16, Index = 1, Test for time = 167
Thread = 16, Index = 2, Test for time = 166
Thread = 18, Index = 0, Test for time = 211
Thread = 18, Index = 1, Test for time = 212
Thread = 18, Index = 2, Test for time = 165
Thread = 17, Index = 0, Test for time = 166
Thread = 17, Index = 1, Test for time = 194
Thread = 17, Index = 2, Test for time = 165
Thread = 19, Index = 0, Test for time = 237
Thread = 19, Index = 1, Test for time = 207
Thread = 19, Index = 2, Test for time = 165
Thread = 20, Index = 0, Test for time = 206
Thread = 20, Index = 1, Test for time = 215
Thread = 20, Index = 2, Test for time = 195

 

>> oneAPI 2021 output

Thread = 2, Index = 0, Test for time = 664
Thread = 2, Index = 1, Test for time = 1099
Thread = 2, Index = 2, Test for time = 1084
Thread = 1, Index = 0, Test for time = 1587
Thread = 1, Index = 1, Test for time = 1150
Thread = 1, Index = 2, Test for time = 886
Thread = 5, Index = 0, Test for time = 355
Thread = 5, Index = 1, Test for time = 309
Thread = 5, Index = 2, Test for time = 294
Thread = 6, Index = 0, Test for time = 338
Thread = 6, Index = 1, Test for time = 278
Thread = 6, Index = 2, Test for time = 318
Thread = 7, Index = 0, Test for time = 632
Thread = 7, Index = 1, Test for time = 505
Thread = 7, Index = 2, Test for time = 288
Thread = 8, Index = 0, Test for time = 330
Thread = 8, Index = 1, Test for time = 312
Thread = 8, Index = 2, Test for time = 277
Thread = 9, Index = 0, Test for time = 459
Thread = 9, Index = 1, Test for time = 526
Thread = 9, Index = 2, Test for time = 284
Thread = 10, Index = 0, Test for time = 347
Thread = 10, Index = 1, Test for time = 293
Thread = 10, Index = 2, Test for time = 278
Thread = 11, Index = 0, Test for time = 371
Thread = 11, Index = 1, Test for time = 278
Thread = 11, Index = 2, Test for time = 645
Thread = 12, Index = 0, Test for time = 900
Thread = 12, Index = 1, Test for time = 760
Thread = 12, Index = 2, Test for time = 777
Thread = 3, Index = 0, Test for time = 1958
Thread = 3, Index = 1, Test for time = 816
Thread = 3, Index = 2, Test for time = 793
Thread = 13, Index = 0, Test for time = 939
Thread = 13, Index = 1, Test for time = 743
Thread = 13, Index = 2, Test for time = 520
Thread = 14, Index = 0, Test for time = 767
Thread = 14, Index = 1, Test for time = 778
Thread = 14, Index = 2, Test for time = 805
Thread = 15, Index = 0, Test for time = 516
Thread = 15, Index = 1, Test for time = 665
Thread = 15, Index = 2, Test for time = 778
Thread = 16, Index = 0, Test for time = 882
Thread = 16, Index = 1, Test for time = 875
Thread = 16, Index = 2, Test for time = 392
Thread = 17, Index = 0, Test for time = 685
Thread = 17, Index = 1, Test for time = 709
Thread = 17, Index = 2, Test for time = 755
Thread = 18, Index = 0, Test for time = 971
Thread = 18, Index = 1, Test for time = 750
Thread = 18, Index = 2, Test for time = 418
Thread = 19, Index = 0, Test for time = 627
Thread = 19, Index = 1, Test for time = 756
Thread = 19, Index = 2, Test for time = 746
Thread = 20, Index = 0, Test for time = 992
Thread = 20, Index = 1, Test for time = 769
Thread = 20, Index = 2, Test for time = 793
Thread = 0, Index = 0, Test for time = 6095
Thread = 0, Index = 1, Test for time = 783
Thread = 0, Index = 2, Test for time = 794
Thread = 4, Index = 0, Test for time = 2566
Thread = 4, Index = 1, Test for time = 760
Thread = 4, Index = 2, Test for time = 558

 

Could you please confirm whether your issue is with respect to icc compiler or mkl library

>> It does not call the mkl function.

 

Best Regards,

Hwang

VidyalathaB_Intel
Moderator
490 Views

Hi,

 

Thanks for providing the details.

We are looking into this issue. we will get back to you soon.

 

Regards,

Vidya.

 

Jie_L_Intel
Employee
445 Views

Could you try the new version 2022.1 on RedHat 8.2 or 8.3?

Redhat 7.7 is not on the support list of oneAPI compiler in Base Toolkit. See the support OS matrix https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-too...



Jie_L_Intel
Employee
368 Views

2019.PNG2022.PNG

 

i tried with icc19.1 and latest oneAPI 2022.1 release icc classic compiler. Running on ubuntu18.04, there is no much performance difference as shown by the two pictures from vtune. Build options : "icc test.c -g -o test-icc2019 -lpthread" (for icc19.1) and "icc test.c -g -o test2022 -lpthread" (for oneAPI 2022.1)

From the vtune hotspot analysis, there is an obvious issue that the sin/cos functions are from libm while not from the intel provided libimf and the vectorized-math APIs - libsvml. Adding the "-O3" option could involve vectorized math, so with "icc test.c -g -O3 -o test2022 -lpthread", you can see in below picture the vtune hotspot.

2022-O3.PNG

 

 

Reply