Effect of turning off HT for SPMD style HPC applications (sandybridge and openmp)

andy-nisbet · ‎10-12-2011

Hello,
I am working on optimization of an iterative SPMD HPC application, could anyone point me towards any performance evaluation studies on this for Intel multicore processors (preferably sandybridge).

I'm using openmp and was surprised by the size of the slowdown ~34% from sequential to 1-thread parallel for my application which is using parallel sections around a dynamically scheduled for loop. The machine was "quiet" with no load other than a login/Xwindows running. I'm using intel tools.

Specifically, will more onchip resources (internal registers. load/store queues) become available to each thread executing on a single core if HT is disabled

I'm principally interested in optimizing single application performance with one thread per physical core.

Thanks,

Andy

Vladimir_P_1234567890 · ‎10-12-2011

hello Andy,

what is a difference betweensequential to 1-thread "parallel for" versions of application? is this just /Qopenmp compiler key?

thanks,

--Vladimir

andy-nisbet · ‎10-12-2011

Hello,
all I have done is insert the omp pragmas, important compiler options include -restrict -openmp and -openmp-report2 -O2 -g, this is using icpc 12.0.4

I should of mentioned that I am on linux and C++, but the code is "straight numerical", it is using AVX vector intrinsics and Ipp64 (aligned) memory arrays that are not aliased (hence use of restrict in compiler options and in the omitted here, specification of function argument parameters). If it helps, the computation in the for loop is a sparse matrix dense vector multiplication using a custom storage format optimised for vectorization.

#pragma omp parallel shared(...)
{

#pragma omp for schedule (dynamic,32)
for(...) {
// numerical code

}

}

jimdempseyatthecove · ‎10-12-2011

Andy,

Have you heard of the phrase "have your cake and eat it too"?

If your application is coded in C++ you might want to take a look at a threading toolkit I wrote called QuickThread (available for free download at www.quickthreadprogramming.com).

On your Sandy Bridge...

Assumein parts of your program you have a floating point intensiveparallel for loop that you wish to schedule restricting the thread team to one thread per core (4 cores). Assume following this you have an integer intensive forloop that you wish to use all HT threads (8 hardware threads). Assume further that next year you will update your processor the the next gen Sandy/Ivy Bridge with possibly 8 cores/16 threads, and you would like not to reprogram for the processor change:

parallel_for(OneEach_L1$, YourFloatingPointFunction, iBegin,iEnd [,optional args]);
parallel_for(AllThreads$, YourIntegerFunction, iBegin, iEnd [,optional args]);

(AllThreads$ is default and can be omitted from the argument list).
(optional args could be one or more arrays, etc...)

Assume later you have Ivy Bridge with 4 processors, each with 8 cores and 16 threads.
Assume you wish to partition the work by processor, then within each processors partition, parallel for using a team of one thread per core within the given processor: (using alternate lambda function format)

// slice rows by socket
parallel_for(OneEach_L3$, iRowBegin, iRowEnd,
[&](int i, int j) {
// slice this socket's col's by core
parallel_for(OneEach_Within_L3$+L1$, YourFloatingPointFunction, i,j [,optional args]); });

Note, the above will work on a 1 socket system too.

FYI, I am in the process of updating the software on the website. If you have any issues, please report via the email address listed on the web site.

Jim Dempsey

Olga_M_Intel · ‎10-13-2011

Quoting Andy Nisbet

using a custom storage format optimised for vectorization.

Andy,

Have you checked that your code was really vectorized? ( '-vec-report' option)

Alsoyou could check ifthere is any difference in performance if you use '-fast' option instead of '-O2'.

andy-nisbet · ‎10-13-2011

Howdy,
not tried -fast, I am only timing as below, but I will try it later on (got to teach in a bit) . mySpM.V function has the omp pragmas in it. The code *is* correctly vectorized via direct use of avx vector intrinsics. I will delve into the architecture optmization guide and use amplifier to try to pinpoint (no pun intended for you pin users) the microarchitecture performance issue and work backwards from there. My guess was, if inserting the omp pragmas is causing such a big performance hit, is there a problem with internal registers/resource usage. My arrays are all allocated using ippMalloc.

startTime ..
for iterations ... {
mySpM.V function
}
endTime

Thanks,

Andy

Andrey_C_Intel1 · ‎10-13-2011

Andy,

Have you followed the "common sense" parallelization rule -use parallelat as outer level as possible. That is if you run smallparallel region millions oftimes then you willlikely get a huge overhead.You may consider to run iterations insidesingle parallel region, though this may require more efforts to keep code correctness.

You are using ippMalloc. That means you are using Intel IPP library, which is aleady parallelized with OpenMP. So you may need to check how many threads your application uses. There is a danger of over-subscription if IPP would run its parallel regions inside your parallel regions. Or there is another possibility: IPP'sparallel regionsmay be serialized in your "parallel" 1-thread application (because by default OpenMP dynamic should be set to FALSE), and run in parallel in your "serial" application. This would explain the slowdown.

About HT. Usually for FP-intensive calculations it is better to disable HT, espetially for vectorised code. Otherwise two threads on a single core will compete for CPU resources and you can finally get additional overhead instead of speed increase. HT is suitable when two threads perform different activity, e.g. one thread does FP operation while another thread reads memory. But if both threads do FP operation, things will be serialized in hardware, I think. You would better check in practice if you can get speedup from HT in your parallel application, but I'd recommend to start with HT disabled.

And one more note. How long your application runs? If it take fractions of second then 34% overhead for 1-thread parallel comparing to serial version may be OK. Of cause if the application runs for minutes or longer, then the parallel overhead should be much less in "good" parallelization.

Regards,
Andrey

andy-nisbet · ‎10-13-2011

Hello,
for the issue I describe, I am using the environment variables NTHREADS=1 OMP_NUM_THREADS=1. I also have the environment variable set for thread affinity. When I execute, it is clear that the threads (KMP_AFFINITY verbose is on) are bound to appropriate processor cores (due to the information printed out). Execution time for the SpM.V is ~40s, albeit after multiple iterations (see earlier post on timing). I have set

omp_set_dynamic(0); earlier in the program. 

If this is wrong then please advise.

I am using ippMalloc to generate aligned data arrays, is there a better way, perhaps I have misunderstood, but this does not mean that accesses to an array allocated by ippMalloc are automatically parallelised? My code is using a custom sparse matrix storage format optimised for vectorization using AVX intrinsic operations.

I am examining my results for SpM.V on a few matrices where the performance is not what I expect, and/or the performance is at a tipping point where it is better/worse than OSKI/MKL. I have run my codes on > 30 sparse matrices for NTHREADS,OMP_NUM_THREADS set to exploit 1,2,3,4,6,8 threads and I do get speedup, (better then MKL for matrices that "fit my custom format").

The issue I am concerned about is the significant slowdown (in this instance) by adding in the the omp parallel section and the dynamically scheduled for loop. I have not examined this specific performance issue (yet) across all the matrices in my test set.

I'll try with HT disabled, but I think I need to check with amplifier to get microarchitectural information.

Thanks,

Andy

Andrey_C_Intel1 · ‎10-13-2011

OK, just using ippMalloc does not imply any implicitparallelization. I thought you might using other ipp routines...

As to the sowdown, it is unlikely caused by the OpenMP parallelization itself, it may rather be different (or lack of) optimizations because of OpenMP constructs. So you can check vectorization reports if you are using compiler vectorization, or play with compiler optimization options (-O3, -fast, etc. Be careful with the -fast, as on Linuxit gets everything be linked statically, so you may want to consider trying"fast without static", that is "-xHOST -O3 -ipo -no-prec-div").

Or get help of Amplifier. Or check if everything is OK with dataalignment when you apply vector intrinsics. No more ideas at the moment...

Hope this helps,
Regards,
Andrey

andy-nisbet · ‎10-13-2011

Hello Andrey,
ok thanks. I did note that the use of restrict in the function arguments for the used array pointers (ie specifying they do not alias each other) made a big difference to the achieved performance of my code (and that of OSKI). I will of course look at the generated assembler, perhaps the restrict (no aliasing) has been obfuscated to the compiler by the insertion of the omp pragmas. Sorry forgot to mention, I am already using -xavx, I'm trying to fit all of this stuff in, with a heavy teaching load.

Thanks,

Andy