- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've got a strange problem with OpenMP in VTune on Linux. I've written a simple program that creates 3 lists of matrices and fill 2 of them with random numbers. The matrices are then multiplied and stored in the 3th list. Using OpenMP, I have parallelized the loop that runs over the list and multiplieds the matrices. It all works fine but when I run in VTune a Concurrency analysis, I get a strange results. Take a look at the attachment: you see that the program starts good and uses all 4 cores on my i7 920 (HT disabled). But after 20s, it stops using the thirth core for some reason. I have tried this on a dual xeon E5440 and there i have the same results: after a while it stops using some cores. These machines are dedicated, nothing else is running on them.
As a check, i make every thread that runs, save it's thread id to an array so i could see if there were only 3 threads running, but surprise: it seems like all 4 threads are running normally. I use a static schedule with chunk_size of 1.
Can it be that VTune is wrong? Or am i missing something?
Regards,
Ward
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please verify example code, I used Intel C++ Comoser and found below erros.
source /opt/intel/composerxe-2011.0.084/bin/compilervars.sh intel64
[root@NHM02 problem_report]# icpc -g -openmp -openmp-report -mkl openmp-test.cpp -o openmp-test.icc
openmp-test.cpp(49): error: identifier "teller" is undefined
cout << "Starting multiplication: " << teller << endl;
^
openmp-test.cpp(57): error: identifier "dgemm_" is undefined
dgemm_(&trans,&trans,&M,&M,&M,α,matrixlist[i+1][0],&M,matrixlist[i+2][0],&M,β,matrixlist[0],&M);
^
compilation aborted for openmp-test.cpp (code 2)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is not strange behaviour, since main thread also takesOpenMP work - please see ompprime() in main thread, in screen shot. My case, 16 logic cores - so expect 16 of OMP workers, actually OMP workers are 15, plus main thread.
This is not a bug from the tool.
Regards, Peter

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What do you get if you comment out your OMP directive line and just let DGEMM work? Do you see better concurrency?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If disabled "-openmp", the result willbe little better - since function "dgemm" is implemented by OpenMP in MKL (CPU time consumes less, but wrose concurrency)
Please submit thequestion to MKL forum to ask why average concurrency is low!
Regards, Peter
=================================================================================
Here are results: (8 cores + HT)
[root@NHM02 mkl_openmp]# amplxe-cl -report summary
Using result path `/root/intel/amplxe/projects/mkl_openmp/r000cc'
Executing actions 74 % Generating a report
Summary
-------
Average Concurrency: 6.709
Elapsed Time: 235.506
CPU Time: 1699.170
Wait Time: 2157.317
Executing actions 99 % done
[root@NHM02 mkl_no_openmp]# amplxe-cl -report summary
Using result path `/root/intel/amplxe/projects/mkl_no_openmp/r000cc'
Executing actions 74 % Generating a report
Summary
-------
Average Concurrency: 2.766
Elapsed Time: 231.508
CPU Time: 850.580
Wait Time: 1199.194
Executing actions 99 % done
===================================================================================
[root@NHM02 mkl_openmp]# amplxe-cl -report hotspots -group-by thread
Using result path `/root/intel/amplxe/projects/mkl_openmp/r000cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x6602) 232.650 0.080 141.331 7.880 83.359 0
OMP Worker Thread #1 (0x660d) 107.160 0.100 15.830 7.861 83.369 0
OMP Worker Thread #5 (0x6611) 104.820 0.080 13.560 7.950 83.230 0
OMP Worker Thread #4 (0x6610) 104.710 0.080 13.870 7.850 82.910 0
OMP Worker Thread #2 (0x660e) 104.690 0.110 13.560 7.839 83.181 0
OMP Worker Thread #3 (0x660f) 103.930 0.090 13.200 7.810 82.830 0
OMP Worker Thread #8 (0x6614) 99.590 0.070 8.480 7.770 83.270 0
OMP Worker Thread #10 (0x6616) 96.740 0.040 6.120 7.490 83.090 0
OMP Worker Thread #12 (0x6618) 96.620 0.050 6.790 6.750 83.030 0
OMP Worker Thread #6 (0x6612) 95.930 0.090 5.360 7.380 83.100 0
OMP Worker Thread #15 (0x661b) 94.600 0.040 4.410 6.830 83.320 0
OMP Worker Thread #9 (0x6615) 92.810 0 3.830 6.920 82.060 0
OMP Worker Thread #7 (0x6613) 92.300 0 2.580 6.450 83.270 0
OMP Worker Thread #14 (0x661a) 92.190 0 2.820 6.260 83.110 0
OMP Worker Thread #13 (0x6619) 90.480 0 2.060 5.670 82.750 0
OMP Worker Thread #11 (0x6617) 89.950 0 2.440 4.830 82.680 0
Executing actions 99 % done
[root@NHM02 mkl_no_openmp]# amplxe-cl -report hotspots -group-by thread
Using result path `/root/intel/amplxe/projects/mkl_no_openmp/r000cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x66c0) 230.930 0.390 230.540 0 0 0
OMP Worker Thread #4 (0x66d1) 88.550 0.300 88.250 0 0 0
OMP Worker Thread #2 (0x66cf) 88.540 0.480 88.060 0 0 0
OMP Worker Thread #1 (0x66ce) 88.530 0.370 88.160 0 0 0
OMP Worker Thread #7 (0x66d4) 88.520 0.280 88.240 0 0 0
OMP Worker Thread #3 (0x66d0) 88.510 0.280 88.230 0 0 0
OMP Worker Thread #6 (0x66d3) 88.510 0.330 88.180 0 0 0
OMP Worker Thread #5 (0x66d2) 88.490 0.279 88.211 0 0 0
Executing actions 99 % done
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VTune still sometimes gives us strange results, threads that suddenly stop doing any work, ... I still don't always understand what VTune is trying to tell me, but after some other benchmarks and tests, we are sure that our program performances as expected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I use your options to reproduce this problem. But every is OK.
[root@NHM02 problem_report]# icpc -g -openmp -openmp-report -mkl -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -liomp5 -lpthread openmp-test.cpp -o openmp-test2.icc
openmp-test.cpp(52) (col. 1): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
[root@NHM02 problem_report]# amplxe-cl -collect concurrency -- ./openmp-test2.icc
Starting multiplication: 0
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/compiler/lib/intel64/libiomp5.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_core.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_sequential.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_intel_lp64.so may be incorrect
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.084/mkl/lib/intel64/libmkl_mc3.so may be incorrect
Note: please ignore above warnings, some hot function names in mkl and openmp will be replaced with mkl & openmp module names.
[root@NHM02 problem_report]# amplxe-cl -report summary
Using result path `/home/peter/problem_report/r006cc'
Executing actions 74 % Generating a report
Summary
-------
Average Concurrency: 6.071
Elapsed Time: 234.339
CPU Time: 1668.400
Wait Time: 2300.318
Executing actions 99 % done
[root@NHM02 problem_report]# amplxe-cl -report hotspots -group-by thread
Using result path `/home/peter/problem_report/r006cc'
Executing actions 74 % Generating a report
Thread CPU Time Idle:CPU Time Poor:CPU Time Ok:CPU Time Ideal:CPU Time Over:CPU Time
Thread (0x7bb4) 232.600 0.390 145.411 6.889 79.909 0
OMP Worker Thread #1 (0x7bbf) 104.190 0.380 16.969 6.811 80.030 0
OMP Worker Thread #4 (0x7bc2) 103.160 0.370 15.900 6.850 80.040 0
OMP Worker Thread #5 (0x7bc3) 103.040 0.340 15.809 6.871 80.020 0
OMP Worker Thread #2 (0x7bc0) 102.950 0.370 15.770 6.890 79.920 0
OMP Worker Thread #3 (0x7bc1) 102.610 0.350 15.420 6.840 80.000 0
OMP Worker Thread #8 (0x7bc6) 95.130 0.050 8.600 6.620 79.860 0
OMP Worker Thread #11 (0x7bc9) 93.440 0.050 7.070 6.670 79.650 0
OMP Worker Thread #9 (0x7bc7) 93.150 0 6.780 6.550 79.820 0
OMP Worker Thread #13 (0x7bcb) 93.080 0.030 6.560 6.620 79.870 0
OMP Worker Thread #12 (0x7bca) 92.920 0.040 6.900 6.050 79.930 0
OMP Worker Thread #6 (0x7bc4) 92.360 0 5.590 6.730 80.040 0
OMP Worker Thread #7 (0x7bc5) 91.430 0 5.060 6.390 79.980 0
OMP Worker Thread #10 (0x7bc8) 91.260 0 5.110 6.210 79.940 0
OMP Worker Thread #14 (0x7bcc) 89.070 0.030 3.380 5.650 80.010 0
OMP Worker Thread #15 (0x7bcd) 88.010 0 3.070 5.140 79.800 0
Executing actions 99 % done
Now I have all expected results: Average Concurrency is 6 (my CPUhas 8-cores),most of CPU (per thread) is 80s around, workloads onall thread are balanced.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is it possible that we used different Compiler and VTune Amplifier XE?
My info-
[root@NHM02 problem_report]# amplxe-cl -version
Intel VTune Amplifier XE 2011 Update 2 (build 139491) Command Line Tool
Copyright (C) 2009-2011 Intel Corporation. All rights reserved.
[root@NHM02 problem_report]# icpc --version
icpc (ICC) 12.0.0 20101006
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
MKL version comes with Compose XE 2011 12.0.84
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using the same versions as you are. Thanks for your anwsers so far, they have been helpfull.
I'm now trying to understand the output of a more complex program. Only parts of this program run in parallel using openmp. amplxe gives me quite high poor cpu times but i'm wondering if this is because not everything is run in parallel? The highest waiting time is in libiomp5.so in the start_thread <- clone. Is this normal?
Also, i'm wondering about hyperthreading. I have got 4 cores + 4 HT. Should the optimal concurrency then be 4 or 8?
Best regards,
Ward
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know what you mean by "high poor CPU times." Please don't rely on advertising literature for your terminology. It's entirely normal to see very large clock tick per instruction rates in short bits of your application; they aren't even worth investigating unless they consume a significant fraction of your elapsed time.
Parallel computation, both SIMD and threaded, normally increases clock per instruction rates. You wouldn't do parallel development if you bought the advertising about minimizing clocks per instructions.
OpenMP is expected to incur measurable time to start up a thread pool, if that's what you are discussing. libiomp5 keeps a thread pool alive for a time set by KMP_BLOCKTIME so as to reduce this overhead for subsequent parallel regions.
I don't know the current marketing approved way to use the libiompprof5 library, which provides a tabulation of barrier, imbalance, etc. for each Intel compiled parallel region (in file guide.gvs, in current directory, when run outside Amplifier).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page