Quote:Rajiv Deodhar (Intel)

Jianbin_F_ · ‎10-14-2015

Hey Guys,

I am playing with the CPU_MASK mechanism in COI (both in mpss 3.5.2 and mpss 3.6). However, I found it is not working as I expected. Suppose that we have 224 threads (on a Phi) and we divide them into 4 partitions. Thus, each partition of threads has 56 threads.

Partition 1: thread 1 -- thread 56

Partition 2: thread 57 -- thread 112

Partition 3: thread 113 -- thread 168

Partition 4: thread 169 -- thread 224.

Now I would like to use 2 out of the 4 partitions, e.g., Partition 2 and Partition 4. I create 2 pipelines and use 2 CPU MASKs for task binding. That is, I would like to bind Pipeline 1 to Partition2, and bind Pipeline 2 to Partition 4.

Also, I prepare a sink kernel and both pipelines run the same kernel. When the kernel is sequential, I observe that the pipelines can run on the corresponding partitions(/threads). However, when the kernel is parallelized with OpenMP, each with over 2 threads, we observe that the pipeline-to-partition bindings are not working as we expected. In some cases, all the OMP threads are 'forced' onto the same partition.

I wonder whether the masking from COI pipelines and from OpenMP interacts each other.

Jianbin

TimP · ‎10-14-2015

I suppose OpenMP needs the masks to be be set by its own mechanisms. KMP_PLACE_THREADS includes functionality which appears to cover what you say you want, if you can run your partitions in separate shell sessions (e.g. under MPI) or maybe using OMP_NESTED. According to your correction, you didn't mean to say 2 thread pools were bound to the same partition.

James_C_Intel2 · ‎10-14-2015

I suppose OpenMP needs the masks to be be set by its own mechanisms.

That should not be necessary. Unless you explicitly tell it not to, the Intel OpenMP runtime respects the affinity mask it gets using sched_getaffinity, and uses only the logicalCPUs which are enabled in that mask.

Requiring each level of the software to individually reiterate resource information that has been decide by a higher level resource manager would be the way to madness!

However, I am concerned that the resource allocations being set are not sensible anyway. If the enumeration used when setting these CPU_MASKs in COI is the same as the kernel's logicalCPU mapping, then you need to be very careful, since logcalCPUs 0:3 are not on the same physical core. Rather logicalCPU zero is lastCore, thread 0, logicalCPUs 1:4 are core 0, threads 0:3 and so on up to the last three logicalCPUs which are on the final core with logicalCPU zero.

You also likely want to avoid that core, since it is where the COI daemon and OS services like to run. (I believe that by default COI sets the affinity masks to do that automagically, which, with the initial point about OpenMP respecting the incoming affinty, is why you normally see 240 OpenMP threads in offload core and 244 in native).

Jianbin_F_ · ‎10-14-2015

Tim Prince wrote:

I suppose OpenMP needs the masks to be be set by its own mechanisms. KMP_PLACE_THREADS includes functionality which appears to cover what you say you want, if you can run your partitions in separate shell sessions (e.g. under MPI) or maybe using OMP_NESTED. If you really mean to bind 2 MPI pools sharing the same partition, why are you surprised at the result?

Hi Tim,

Sorry, I made a typo there. It should be "bind Pipeline 1 to Partition2, and bind Pipeline 2 to Partition 4"

Jianbin_F_ · ‎10-14-2015

James Cownie (Intel) wrote:

I suppose OpenMP needs the masks to be be set by its own mechanisms.

That should not be necessary. Unless you explicitly tell it not to, the Intel OpenMP runtime respects the affinity mask it gets using sched_getaffinity, and uses only the logicalCPUs which are enabled in that mask.

Requiring each level of the software to individually reiterate resource information that has been decide by a higher level resource manager would be the way to madness!

However, I am concerned that the resource allocations being set are not sensible anyway. If the enumeration used when setting these CPU_MASKs in COI is the same as the kernel's logicalCPU mapping, then you need to be very careful, since logcalCPUs 0:3 are not on the same physical core. Rather logicalCPU zero is lastCore, thread 0, logicalCPUs 1:4 are core 0, threads 0:3 and so on up to the last three logicalCPUs which are on the final core with logicalCPU zero.

You also likely want to avoid that core, since it is where the COI daemon and OS services like to run. (I believe that by default COI sets the affinity masks to do that automagically, which, with the initial point about OpenMP respecting the incoming affinty, is why you normally see 240 OpenMP threads in offload core and 244 in native).

Hi, I noticed the point. Actually, the Phi I used has 57 cores and 228 threads. Thus, I use 224 threads ranging from 1 to 224, and avoid thread 0, 225, 226, 227. Thus, this should not be a problem.

Rajiv_D_Intel · ‎10-14-2015

Affinities of OpenMP threads don't always get set correctly when you run concurrent parallel regions.

If you want to use COI directly, then I recommend running a function (skeleton attached) in each partition on the MIC before you start using OpenMP for actual computation.

Alternatively, you can use the 16.0 compiler offload streaming feature to create the streams and then use environment variables to control affinities of the streams:

export MIC_ENV_PREFIX=MIC
export MIC_KMP_AFFINITY="norespect,none"
export OFFLOAD_STREAM_AFFINITY=compact

Jianbin_F_ · ‎10-16-2015

Rajiv Deodhar (Intel) wrote:

Affinities of OpenMP threads don't always get set correctly when you run concurrent parallel regions.

If you want to use COI directly, then I recommend running a function (skeleton attached) in each partition on the MIC before you start using OpenMP for actual computation.

Alternatively, you can use the 16.0 compiler offload streaming feature to create the streams and then use environment variables to control affinities of the streams:

export MIC_ENV_PREFIX=MIC
export MIC_KMP_AFFINITY="norespect,none"
export OFFLOAD_STREAM_AFFINITY=compact

Rajiv, actually I also tried hStreams, which uses the function (i.e., set_affinity) you sent to me when creating a stream (/ initialize a partition). However, I met with the same issue. That is, I create 2 streams, and bind them to Partition 2 and Partition 4. The binding does not work as I expected.

Jianbin_F_ · ‎10-20-2015

Rajiv Deodhar (Intel) wrote:

Affinities of OpenMP threads don't always get set correctly when you run concurrent parallel regions.

If you want to use COI directly, then I recommend running a function (skeleton attached) in each partition on the MIC before you start using OpenMP for actual computation.

Alternatively, you can use the 16.0 compiler offload streaming feature to create the streams and then use environment variables to control affinities of the streams:

export MIC_ENV_PREFIX=MIC
export MIC_KMP_AFFINITY="norespect,none"
export OFFLOAD_STREAM_AFFINITY=compact

Thanks for attaching the skeleton code, Rajiv. Now I would like to use COI directly. Basically, I created two pipelines and bind them two separate groups of cores/threads with masks. However, they are not working as expected. Could anybody help me with the debugging? I wrote the code based the coi_simple in the COI tutorial. Thanks a lot!

Regards,

Jianbin

Rajiv_D_Intel · ‎10-20-2015

Can you post your code as-is instead of as a .rar file?

Jianbin_F_ · ‎10-20-2015

Rajiv Deodhar (Intel) wrote:

Can you post your code as-is instead of as a .rar file?

Here you go!

Rajiv_D_Intel · ‎10-20-2015

Your sink-side function set_affinity does not have the correct signature. All sink-side function should be declared like this:

COINATIVELIBEXPORT

void xxx (uint32_t in_BufferCount,

void** in_ppBufferPointers,

uint64_t* in_pBufferLengths,

void* in_pMiscData,

uint16_t in_MiscDataLength,

void* in_pReturnValue,

uint16_t in_ReturnValueLength)

That's why I think the mask value you use on the sink is not what is sent from the CPU.

Secondly, in each pipeline you are setting only two threads. Thus, even if the mask value went correctly from CPU to MIC, each pipeline would use only two threads. Is that your intent?

Jianbin_F_ · ‎10-20-2015

Rajiv Deodhar (Intel) wrote:

Your sink-side function set_affinity does not have the correct signature. All sink-side function should be declared like this:

COINATIVELIBEXPORT

void xxx (uint32_t in_BufferCount,

void** in_ppBufferPointers,

uint64_t* in_pBufferLengths,

void* in_pMiscData,

uint16_t in_MiscDataLength,

void* in_pReturnValue,

uint16_t in_ReturnValueLength)

That's why I think the mask value you use on the sink is not what is sent from the CPU.

Secondly, in each pipeline you are setting only two threads. Thus, even if the mask value went correctly from CPU to MIC, each pipeline would use only two threads. Is that your intent?

Yes, I intended to do so. Basically, I would like to test whether the pipeline binding is working as I expected.

I changed the code, and it is 'better' now. But the output still differs from what it should be. Basically, I bind Pipeline 0 to Core 3 (thread 9) and Core 4 (thread 13), and bind Pipeline 1 to Core 15 (thread 61) and Core 16 (thread 65). As you can see from the smc output, each pipeline uses 4 cores, which differs from what it should be. Could you further help me check the code?

CPU_MASK in COI