- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I measured the execution time by compiling the following source using oneapi 2021.02 and Parallel Studio 2018.1 respectively.

Compile command: icc -o test test.c -mkl -lpthread -lrt

As a result, Parallel Studio 2018 performed 4x faster.

Why is oneAPI 2021 so slow?

Is the compile option wrong?

How can I speed up the execution time of oneAPI?

=====================================

#define MAX_MULTI_PATH_REF_NUM 2

#define MAX_MULTI_PATH_LOOP_CNT 11

#define MAX_LOOP_INDEX 30000

#define MAX_CORE_NUM 21

#define MULTIPATH_U_V_TYPE_CNT 26

#define MAX_ANTENNA_ARRAY_NUM 2048

#define PI 3.141592653589793f

static uint64_t ul_test_threads[MAX_CORE_NUM];

float f_test_out[MAX_LOOP_INDEX*MAX_CORE_NUM];

MKL_Complex8 arr_m_source_xp[MAX_ANTENNA_ARRAY_NUM], arr_m_source_yp[MAX_ANTENNA_ARRAY_NUM];

static void stat_req_recv(long l_index);

static void *test_for_thread(void * const thread_idx){

long ll_thread_index;

cpu_set_t m_cpu_mask;

int i_result;

long l_result;

ll_thread_index = (int64_t)thread_idx;

CPU_ZERO(&m_cpu_mask);

CPU_SET(2,&m_cpu_mask);

i_result = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &m_cpu_mask);

stat_req_recv(ll_thread_index);

return NULL;

}

static void stat_req_recv(long l_index)

{

struct timespec m_timetag_start, m_timetag_end;

float f_target_v, f_multipath_v, f_pi_factor, f_center_u, f_center_v, f_multipath_offset_angle, f_beta, f_weight_coef;

int k, m, n, i_multipath_proc_cnt;

MKL_Complex8 arr_m_beam_weight_out[MAX_MULTI_PATH_LOOP_CNT*MAX_MULTI_PATH_REF_NUM*MAX_ANTENNA_ARRAY_NUM];

f_pi_factor = PI/180.f;

f_center_v = 0.2588f;

f_center_u = 0.01f;

f_multipath_offset_angle = -0.5f;

f_beta = 2*PI;

for (m=0;m<3;m++)

{

clock_gettime(CLOCK_REALTIME, &m_timetag_start);

i_multipath_proc_cnt = m;

for(k=0;k<MAX_MULTI_PATH_LOOP_CNT;k++){

f_target_v = f_center_v + sinf(0.2*(i_multipath_proc_cnt/MULTIPATH_U_V_TYPE_CNT)*f_pi_factor);

f_multipath_v = f_center_v + sinf(((0.2*(i_multipath_proc_cnt%MULTIPATH_U_V_TYPE_CNT)) + f_multipath_offset_angle)*f_pi_factor);

if(((i_multipath_proc_cnt/MULTIPATH_U_V_TYPE_CNT) == 0) && ((i_multipath_proc_cnt%MULTIPATH_U_V_TYPE_CNT) == 0))

{

f_target_v = f_center_v;

f_multipath_v = f_center_v - sinf(0.2f*f_pi_factor);

}

for(n=0;n<MAX_ANTENNA_ARRAY_NUM;n++)

{

f_weight_coef = f_beta*((f_target_v*arr_m_source_yp[n].imag) + (f_center_u*arr_m_source_xp[n].imag));

arr_m_beam_weight_out[n+(2*k)*MAX_ANTENNA_ARRAY_NUM].real = cosf(f_weight_coef);

arr_m_beam_weight_out[n+(2*k)*MAX_ANTENNA_ARRAY_NUM].imag = sinf(f_weight_coef);

f_weight_coef = f_beta*((f_multipath_v*arr_m_source_yp[n].imag) + (f_center_u*arr_m_source_xp[n].imag));

arr_m_beam_weight_out[n+(2*k+1)*MAX_ANTENNA_ARRAY_NUM].real = cosf(f_weight_coef);

arr_m_beam_weight_out[n+(2*k+1)*MAX_ANTENNA_ARRAY_NUM].imag = sinf(f_weight_coef);

}

i_multipath_proc_cnt = i_multipath_proc_cnt + 1;

}

clock_gettime(CLOCK_REALTIME, &m_timetag_end);

printf("Thread = %lld, Index = %d, Test for time = %lld\n",l_index, m, (m_timetag_end.tv_nsec - m_timetag_start.tv_nsec)/1000);

}

}

int32_t main(void)

{

int i, j, k, m, i_status, i_status_check;

unsigned int ui_mod;

pthread_attr_t m_test_for_threads_attr[MAX_CORE_NUM];

struct sched_param m_test_for_threads_param[MAX_CORE_NUM];

k = 0;

for(k=0;k<MAX_ANTENNA_ARRAY_NUM;k++)

{

arr_m_source_xp[k].real = (0.1)*k;

arr_m_source_xp[k].imag = (0.2)*k;

arr_m_source_yp[k].real = (0.4)*k;

arr_m_source_yp[k].imag = (0.3)*k;

}

for(k=0;k<MAX_CORE_NUM;k++)

{

i_status = pthread_attr_init(&m_test_for_threads_attr[k]);

m_test_for_threads_param[k].__sched_priority = 98;

i_status = pthread_attr_setinheritsched(&m_test_for_threads_attr[k], PTHREAD_EXPLICIT_SCHED);

i_status = pthread_attr_setschedpolicy(&m_test_for_threads_attr[k], SCHED_FIFO);

i_status = pthread_attr_setschedparam(&m_test_for_threads_attr[k],&m_test_for_threads_param[k]);

i_status = pthread_create(&ul_test_threads[k],&m_test_for_threads_attr[k],test_for_thread,(void*)k);

}

for(k=0;k<MAX_CORE_NUM;k++)

{

pthread_join(ul_test_threads[k],NULL);

}

return 0;

}

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Thanks for reaching out to us.

*>>Why is oneAPI 2021 so slow?... How can I speed up the execution time of oneAPI?*

Could you please let us know the CPU model on which you are running your code and also your OS details?

*>>As a result, Parallel Studio 2018 performed 4x faster.*

Please share with us the outputs of the provided source code that you are getting for both parallel studio 2018 and oneAPI 2021.

` i_status = pthread_create(&ul_test_threads[k],&m_test_for_threads_attr[k],test_for_thread,(void*)k);`

It would help us if you make it clear about the value of the argument that you are passing to the **test_for_thread** function so that we could see the execution time.

*>>Is the compile option wrong?*

That should be enough.

Could you please confirm whether your issue is with respect to icc compiler or mkl library (as we could see that you are not calling any MKL functions except using the MKL_Complex8 datatype)?

Regards,

Vidya.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear,

Thank you for your support. Sorry for the late reply.

Could you please let us know the CPU model on which you are running your code and also your OS details?

>> CPU : Intel(R) Xeon(R) Gold 6238T CPU @ 1.90GHz

>> OS : Redhat 7.7

Please share with us the outputs of the provided source code that you are getting for both parallel studio 2018 and oneAPI 2021.

>> The results are similar, but not always the same.

>> parallel studio 2018 output

Thread = 0, Index = 0, Test for time = 252

Thread = 0, Index = 1, Test for time = 102

Thread = 0, Index = 2, Test for time = 98

Thread = 1, Index = 0, Test for time = 81

Thread = 1, Index = 1, Test for time = 70

Thread = 1, Index = 2, Test for time = 60

Thread = 6, Index = 0, Test for time = 78

Thread = 6, Index = 1, Test for time = 62

Thread = 6, Index = 2, Test for time = 73

Thread = 5, Index = 0, Test for time = 165

Thread = 5, Index = 1, Test for time = 165

Thread = 5, Index = 2, Test for time = 209

Thread = 7, Index = 0, Test for time = 166

Thread = 7, Index = 1, Test for time = 166

Thread = 7, Index = 2, Test for time = 201

Thread = 4, Index = 0, Test for time = 198

Thread = 4, Index = 1, Test for time = 166

Thread = 4, Index = 2, Test for time = 166

Thread = 8, Index = 0, Test for time = 202

Thread = 8, Index = 1, Test for time = 166

Thread = 8, Index = 2, Test for time = 230

Thread = 9, Index = 0, Test for time = 282

Thread = 9, Index = 1, Test for time = 210

Thread = 9, Index = 2, Test for time = 176

Thread = 2, Index = 0, Test for time = 68

Thread = 2, Index = 1, Test for time = 209

Thread = 2, Index = 2, Test for time = 171

Thread = 10, Index = 0, Test for time = 298

Thread = 10, Index = 1, Test for time = 234

Thread = 10, Index = 2, Test for time = 165

Thread = 11, Index = 0, Test for time = 192

Thread = 11, Index = 1, Test for time = 164

Thread = 11, Index = 2, Test for time = 165

Thread = 3, Index = 0, Test for time = 76

Thread = 3, Index = 1, Test for time = 196

Thread = 3, Index = 2, Test for time = 209

Thread = 12, Index = 0, Test for time = 449

Thread = 12, Index = 1, Test for time = 189

Thread = 12, Index = 2, Test for time = 166

Thread = 14, Index = 0, Test for time = 279

Thread = 14, Index = 1, Test for time = 165

Thread = 14, Index = 2, Test for time = 166

Thread = 15, Index = 0, Test for time = 197

Thread = 15, Index = 1, Test for time = 225

Thread = 15, Index = 2, Test for time = 166

Thread = 13, Index = 0, Test for time = 166

Thread = 13, Index = 1, Test for time = 200

Thread = 13, Index = 2, Test for time = 165

Thread = 16, Index = 0, Test for time = 291

Thread = 16, Index = 1, Test for time = 167

Thread = 16, Index = 2, Test for time = 166

Thread = 18, Index = 0, Test for time = 211

Thread = 18, Index = 1, Test for time = 212

Thread = 18, Index = 2, Test for time = 165

Thread = 17, Index = 0, Test for time = 166

Thread = 17, Index = 1, Test for time = 194

Thread = 17, Index = 2, Test for time = 165

Thread = 19, Index = 0, Test for time = 237

Thread = 19, Index = 1, Test for time = 207

Thread = 19, Index = 2, Test for time = 165

Thread = 20, Index = 0, Test for time = 206

Thread = 20, Index = 1, Test for time = 215

Thread = 20, Index = 2, Test for time = 195

>> oneAPI 2021 output

Thread = 2, Index = 0, Test for time = 664

Thread = 2, Index = 1, Test for time = 1099

Thread = 2, Index = 2, Test for time = 1084

Thread = 1, Index = 0, Test for time = 1587

Thread = 1, Index = 1, Test for time = 1150

Thread = 1, Index = 2, Test for time = 886

Thread = 5, Index = 0, Test for time = 355

Thread = 5, Index = 1, Test for time = 309

Thread = 5, Index = 2, Test for time = 294

Thread = 6, Index = 0, Test for time = 338

Thread = 6, Index = 1, Test for time = 278

Thread = 6, Index = 2, Test for time = 318

Thread = 7, Index = 0, Test for time = 632

Thread = 7, Index = 1, Test for time = 505

Thread = 7, Index = 2, Test for time = 288

Thread = 8, Index = 0, Test for time = 330

Thread = 8, Index = 1, Test for time = 312

Thread = 8, Index = 2, Test for time = 277

Thread = 9, Index = 0, Test for time = 459

Thread = 9, Index = 1, Test for time = 526

Thread = 9, Index = 2, Test for time = 284

Thread = 10, Index = 0, Test for time = 347

Thread = 10, Index = 1, Test for time = 293

Thread = 10, Index = 2, Test for time = 278

Thread = 11, Index = 0, Test for time = 371

Thread = 11, Index = 1, Test for time = 278

Thread = 11, Index = 2, Test for time = 645

Thread = 12, Index = 0, Test for time = 900

Thread = 12, Index = 1, Test for time = 760

Thread = 12, Index = 2, Test for time = 777

Thread = 3, Index = 0, Test for time = 1958

Thread = 3, Index = 1, Test for time = 816

Thread = 3, Index = 2, Test for time = 793

Thread = 13, Index = 0, Test for time = 939

Thread = 13, Index = 1, Test for time = 743

Thread = 13, Index = 2, Test for time = 520

Thread = 14, Index = 0, Test for time = 767

Thread = 14, Index = 1, Test for time = 778

Thread = 14, Index = 2, Test for time = 805

Thread = 15, Index = 0, Test for time = 516

Thread = 15, Index = 1, Test for time = 665

Thread = 15, Index = 2, Test for time = 778

Thread = 16, Index = 0, Test for time = 882

Thread = 16, Index = 1, Test for time = 875

Thread = 16, Index = 2, Test for time = 392

Thread = 17, Index = 0, Test for time = 685

Thread = 17, Index = 1, Test for time = 709

Thread = 17, Index = 2, Test for time = 755

Thread = 18, Index = 0, Test for time = 971

Thread = 18, Index = 1, Test for time = 750

Thread = 18, Index = 2, Test for time = 418

Thread = 19, Index = 0, Test for time = 627

Thread = 19, Index = 1, Test for time = 756

Thread = 19, Index = 2, Test for time = 746

Thread = 20, Index = 0, Test for time = 992

Thread = 20, Index = 1, Test for time = 769

Thread = 20, Index = 2, Test for time = 793

Thread = 0, Index = 0, Test for time = 6095

Thread = 0, Index = 1, Test for time = 783

Thread = 0, Index = 2, Test for time = 794

Thread = 4, Index = 0, Test for time = 2566

Thread = 4, Index = 1, Test for time = 760

Thread = 4, Index = 2, Test for time = 558

Could you please confirm whether your issue is with respect to icc compiler or mkl library

>> It does not call the mkl function.

Best Regards,

Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Thanks for providing the details.

We are looking into this issue. we will get back to you soon.

Regards,

Vidya.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Could you try the new version 2022.1 on RedHat 8.2 or 8.3?

Redhat 7.7 is not on the support list of oneAPI compiler in Base Toolkit. See the support OS matrix https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-too...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

i tried with icc19.1 and latest oneAPI 2022.1 release icc classic compiler. Running on ubuntu18.04, there is no much performance difference as shown by the two pictures from vtune. Build options : "icc test.c -g -o test-icc2019 -lpthread" (for icc19.1) and "icc test.c -g -o test2022 -lpthread" (for oneAPI 2022.1)

From the vtune hotspot analysis, there is an obvious issue that the sin/cos functions are from libm while not from the intel provided libimf and the vectorized-math APIs - libsvml. Adding the "-O3" option could involve vectorized math, so with "icc test.c -g -O3 -o test2022 -lpthread", you can see in below picture the vtune hotspot.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page