I tried MPI/openmp process pining, it seems that When I use non-blocking api(Iallreduce) and specific I_MPI_ASYNC_PROGRESS like the following command, it I set I_MPI_ASYNC_PROGRESS=enable, then application will spent much more time on libiomp.so(kmp_hyper_barrier_release), and vmlinux also got a little hotter, compare with (I_MPI_ASYNC_PROGRESS=disable), is there any issue with my configuration? I use vtune and it shows that all the cores are pin in the right cores. the only difference is core 67 is used by MPI communication thread.
mpirun -n 2 -ppn 1 -genv OMP_PROC_BIND=true -genv I_MPI_ASYNC_PROGRESS= -genv I_MPI_ASYNC_PROGRESS_PIN=67 -genv I_MPI_PIN_PROCS=0-66 -genv OMP_NUM_THREADS=67 -genv I_MPI_PIN_DOMAIN=sock -genv I_MPI_FABRICS=ofi -f ./hostfile python train_imagenet_cpu.py --arch alex --batchsize 256 --loaderjob 68 --epoch 100 --train_root /home/jiangzho/imagenet/ILSVRC2012_img_train --val_root /home/jiangzho/imagenet/ILSVRC2012_img_val --communicator naive /home/jiangzho/train.txt /home/jiangzho/val.txt
root caused why libiomp5.so got much hotter,
, set command as above, tring to make MPI communication thread pin on core 67 and openmp threads pin on core 0-core66, Vtune shows that MPI communication did pined on core 67 and OPenmp has 67 threads, but OMP_thread66 pined on core 67, so it lag the whole performance, making libiomp,so has lots of spin time.But I still didn’t figure out how to making it work correctly…
any idea? thanks