I have a program that is decomposed in two parts:
One loop that allocates data: it does 4 iterations, one for each socket
One loop that does computation on the data, it does 48 iterations (each thread should work on a slice of data, hopefully a slice of data that is on the local socket).
My machine is a 4 socket, 12 cores per processor Xeon machine. I'm using ICC 15.0.1 20141023
To have good scalability, I need to allocate data evenly on each of my processors.
To that end, I have found that "KMP_AFFINITY=scatter" does exactly what I need.
The problem is that this does not work well with my second loop, that does computations. I'd like computations to occur on the socket that has the data allocated in (kind of like the "owner compute rule".
I thought that OpenMP 4.0's proc_bind(spread) for the first loop, then proc_bind(close) would allow me to have threads where I need them to be, but from my experience, and checking with "sched_getcpu()" to see where a thread is running, using "proc_bind(spread)" or not doesn't make any difference, while "KMP_AFFINITY=scatter" does exactly what I need.
Am I right to assume that proc_bind(spread) should do the same thing as "KMP_AFFINITY=scatter"?
Do you think I'm going the wrong way trying to use OpenMP to pin my threads in a custom manner for my two loop nests?
I hope you can help me with this,