Strange OpenMP concurrency behavior

wpoely86 · ‎03-15-2011

Hi,

I've got a strange problem with OpenMP in VTune on Linux. I've written a simple program that creates 3 lists of matrices and fill 2 of them with random numbers. The matrices are then multiplied and stored in the 3th list. Using OpenMP, I have parallelized the loop that runs over the list and multiplieds the matrices. It all works fine but when I run in VTune a Concurrency analysis, I get a strange results. Take a look at the attachment: you see that the program starts good and uses all 4 cores on my i7 920 (HT disabled). But after 20s, it stops using the thirth core for some reason. I have tried this on a dual xeon E5440 and there i have the same results: after a while it stops using some cores. These machines are dedicated, nothing else is running on them.

As a check, i make every thread that runs, save it's thread id to an array so i could see if there were only 3 threads running, but surprise: it seems like all 4 threads are running normally. I use a static schedule with chunk_size of 1.

Can it be that VTune is wrong? Or am i missing something?

Regards,

Ward

Peter_W_Intel · ‎03-15-2011

Please verify example code, I used Intel C++ Comoser and found below erros.

source /opt/intel/composerxe-2011.0.084/bin/compilervars.sh intel64
[root@NHM02 problem_report]# icpc -g -openmp -openmp-report -mkl openmp-test.cpp -o openmp-test.icc
openmp-test.cpp(49): error: identifier "teller" is undefined
cout << "Starting multiplication: " << teller << endl;
^

openmp-test.cpp(57): error: identifier "dgemm_" is undefined
dgemm_(&trans,&trans,&M,&M,&M,α,matrixlist[i+1][0],&M,matrixlist[i+2][0],&M,β,matrixlist[0],&M);
^

compilation aborted for openmp-test.cpp (code 2)

wpoely86 · ‎03-16-2011

I'm sorry. I've added the wrong version of the file. The one in attachment should work just fine. You need to compile with the MKL to have dgemm.

I've also added the results of a vtune run in a zip file.

Peter_W_Intel · ‎03-24-2011

(I used my example - use OpenMP in Fortran code to count Primes)

This is not strange behaviour, since main thread also takesOpenMP work - please see ompprime() in main thread, in screen shot. My case, 16 logic cores - so expect 16 of OMP workers, actually OMP workers are 15, plus main thread.

This is not a bug from the tool.

Regards, Peter

robert-reed · ‎03-27-2011

You realize of course that Intel Math KernelLibrary (MKL) uses OpenMP to parallelize all its BLAS level three functions, including DGEMM? I suspect what's happening here is that the nested parallelism from the OpenMP loop you're using to call DGEMMis forcing one of the threads to wait--there's a whole bunch of wait time tied up with the OpenMP runtime but the level of CPU activity does not decline in the time line when the measured concurrency level drops.

What do you get if you comment out your OMP directive line and just let DGEMM work? Do you see better concurrency?

Peter_W_Intel · ‎03-27-2011

Robert is right!

If disabled "-openmp", the result willbe little better - since function "dgemm" is implemented by OpenMP in MKL (CPU time consumes less, but wrose concurrency)

Please submit thequestion to MKL forum to ask why average concurrency is low!

Regards, Peter

=================================================================================

Here are results: (8 cores + HT)

[root@NHM02 mkl_openmp]# amplxe-cl -report summary
Using result path `/root/intel/amplxe/projects/mkl_openmp/r000cc'
Executing actions 74 % Generating a report

Summary
-------

Average Concurrency: 6.709
Elapsed Time: 235.506
CPU Time: 1699.170
Wait Time: 2157.317
Executing actions 99 % done

[root@NHM02 mkl_no_openmp]# amplxe-cl -report summary
Using result path `/root/intel/amplxe/projects/mkl_no_openmp/r000cc'
Executing actions 74 % Generating a report

Summary
-------

Average Concurrency: 2.766
Elapsed Time: 231.508
CPU Time: 850.580
Wait Time: 1199.194
Executing actions 99 % done

===================================================================================
[root@NHM02 mkl_openmp]# amplxe-cl -report hotspots -group-by thread
Using result path `/root/intel/amplxe/projects/mkl_openmp/r000cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x6602) 232.650 0.080 141.331 7.880 83.359 0
OMP Worker Thread #1 (0x660d) 107.160 0.100 15.830 7.861 83.369 0
OMP Worker Thread #5 (0x6611) 104.820 0.080 13.560 7.950 83.230 0
OMP Worker Thread #4 (0x6610) 104.710 0.080 13.870 7.850 82.910 0
OMP Worker Thread #2 (0x660e) 104.690 0.110 13.560 7.839 83.181 0
OMP Worker Thread #3 (0x660f) 103.930 0.090 13.200 7.810 82.830 0
OMP Worker Thread #8 (0x6614) 99.590 0.070 8.480 7.770 83.270 0
OMP Worker Thread #10 (0x6616) 96.740 0.040 6.120 7.490 83.090 0
OMP Worker Thread #12 (0x6618) 96.620 0.050 6.790 6.750 83.030 0
OMP Worker Thread #6 (0x6612) 95.930 0.090 5.360 7.380 83.100 0
OMP Worker Thread #15 (0x661b) 94.600 0.040 4.410 6.830 83.320 0
OMP Worker Thread #9 (0x6615) 92.810 0 3.830 6.920 82.060 0
OMP Worker Thread #7 (0x6613) 92.300 0 2.580 6.450 83.270 0
OMP Worker Thread #14 (0x661a) 92.190 0 2.820 6.260 83.110 0
OMP Worker Thread #13 (0x6619) 90.480 0 2.060 5.670 82.750 0
OMP Worker Thread #11 (0x6617) 89.950 0 2.440 4.830 82.680 0
Executing actions 99 % done

[root@NHM02 mkl_no_openmp]# amplxe-cl -report hotspots -group-by thread
Using result path `/root/intel/amplxe/projects/mkl_no_openmp/r000cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x66c0) 230.930 0.390 230.540 0 0 0
OMP Worker Thread #4 (0x66d1) 88.550 0.300 88.250 0 0 0
OMP Worker Thread #2 (0x66cf) 88.540 0.480 88.060 0 0 0
OMP Worker Thread #1 (0x66ce) 88.530 0.370 88.160 0 0 0
OMP Worker Thread #7 (0x66d4) 88.520 0.280 88.240 0 0 0
OMP Worker Thread #3 (0x66d0) 88.510 0.280 88.230 0 0 0
OMP Worker Thread #6 (0x66d3) 88.510 0.330 88.180 0 0 0
OMP Worker Thread #5 (0x66d2) 88.490 0.279 88.211 0 0 0
Executing actions 99 % done

wpoely86 · ‎03-28-2011

I'm aware that the MKL uses OpenMP. My code was compiled with -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -liomp5 -lpthread to make sure that the dgemm was not run in parallel.

VTune still sometimes gives us strange results, threads that suddenly stop doing any work, ... I still don't always understand what VTune is trying to tell me, but after some other benchmarks and tests, we are sure that our program performances as expected.

Peter_W_Intel · ‎03-28-2011

Now I use your options to reproduce this problem. But every is OK.

[root@NHM02 problem_report]# icpc -g -openmp -openmp-report -mkl -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -liomp5 -lpthread openmp-test.cpp -o openmp-test2.icc
openmp-test.cpp(52) (col. 1): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

[root@NHM02 problem_report]# amplxe-cl -collect concurrency -- ./openmp-test2.icc
Starting multiplication: 0
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/compiler/lib/intel64/libiomp5.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_core.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_sequential.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_intel_lp64.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_mc3.so may be incorrect

Note: please ignore above warnings, some hot function names in mkl and openmp will be replaced with mkl & openmp module names.

[root@NHM02 problem_report]# amplxe-cl -report summary
Using result path `/home/peter/problem_report/r006cc'
Executing actions 74 % Generating a report

Summary
-------

Average Concurrency: 6.071
Elapsed Time: 234.339
CPU Time: 1668.400
Wait Time: 2300.318
Executing actions 99 % done

[root@NHM02 problem_report]# amplxe-cl -report hotspots -group-by thread
Using result path `/home/peter/problem_report/r006cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x7bb4) 232.600 0.390 145.411 6.889 79.909 0
OMP Worker Thread #1 (0x7bbf) 104.190 0.380 16.969 6.811 80.030 0
OMP Worker Thread #4 (0x7bc2) 103.160 0.370 15.900 6.850 80.040 0
OMP Worker Thread #5 (0x7bc3) 103.040 0.340 15.809 6.871 80.020 0
OMP Worker Thread #2 (0x7bc0) 102.950 0.370 15.770 6.890 79.920 0
OMP Worker Thread #3 (0x7bc1) 102.610 0.350 15.420 6.840 80.000 0
OMP Worker Thread #8 (0x7bc6) 95.130 0.050 8.600 6.620 79.860 0
OMP Worker Thread #11 (0x7bc9) 93.440 0.050 7.070 6.670 79.650 0
OMP Worker Thread #9 (0x7bc7) 93.150 0 6.780 6.550 79.820 0
OMP Worker Thread #13 (0x7bcb) 93.080 0.030 6.560 6.620 79.870 0
OMP Worker Thread #12 (0x7bca) 92.920 0.040 6.900 6.050 79.930 0
OMP Worker Thread #6 (0x7bc4) 92.360 0 5.590 6.730 80.040 0
OMP Worker Thread #7 (0x7bc5) 91.430 0 5.060 6.390 79.980 0
OMP Worker Thread #10 (0x7bc8) 91.260 0 5.110 6.210 79.940 0
OMP Worker Thread #14 (0x7bcc) 89.070 0.030 3.380 5.650 80.010 0
OMP Worker Thread #15 (0x7bcd) 88.010 0 3.070 5.140 79.800 0
Executing actions 99 % done

Now I have all expected results: Average Concurrency is 6 (my CPUhas 8-cores),most of CPU (per thread) is 80s around, workloads onall thread are balanced.

Peter_W_Intel · ‎03-28-2011

Is it possible that we used different Compiler and VTune Amplifier XE?

My info-
[root@NHM02 problem_report]# amplxe-cl -version
Intel VTune Amplifier XE 2011 Update 2 (build 139491) Command Line Tool
Copyright (C) 2009-2011 Intel Corporation. All rights reserved.

[root@NHM02 problem_report]# icpc --version
icpc (ICC) 12.0.0 20101006
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.

MKL version comes with Compose XE 2011 12.0.84

Regards, Peter

wpoely86 · ‎03-31-2011

Dear Peter,

I'm using the same versions as you are. Thanks for your anwsers so far, they have been helpfull.

I'm now trying to understand the output of a more complex program. Only parts of this program run in parallel using openmp. amplxe gives me quite high poor cpu times but i'm wondering if this is because not everything is run in parallel? The highest waiting time is in libiomp5.so in the start_thread <- clone. Is this normal?

Also, i'm wondering about hyperthreading. I have got 4 cores + 4 HT. Should the optimal concurrency then be 4 or 8?

Best regards,

Ward

TimP · ‎03-31-2011

If you could find the documentation on MKL_DYNAMIC, that might answer one of your questions. Forum search appears to be broken, so I can't find it today. MKL is highly optimized to keep the floating point arithmetic units busy, so there is no advantage in running more than 1 MKL thread per core.
I don't know what you mean by "high poor CPU times." Please don't rely on advertising literature for your terminology. It's entirely normal to see very large clock tick per instruction rates in short bits of your application; they aren't even worth investigating unless they consume a significant fraction of your elapsed time.
Parallel computation, both SIMD and threaded, normally increases clock per instruction rates. You wouldn't do parallel development if you bought the advertising about minimizing clocks per instructions.
OpenMP is expected to incur measurable time to start up a thread pool, if that's what you are discussing. libiomp5 keeps a thread pool alive for a time set by KMP_BLOCKTIME so as to reduce this overhead for subsequent parallel regions.
I don't know the current marketing approved way to use the libiompprof5 library, which provides a tabulation of barrier, imbalance, etc. for each Intel compiled parallel region (in file guide.gvs, in current directory, when run outside Amplifier).