- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Could you please take a look at this problem? My machine has 16 CPUs and 4 MICs (47 coprocessors each), and I run my program with 8 MPI processors (mpi_comm_size = 8) and want to use MKL routines with automatic offload (AO) mode. As you can see in the test code attached, I tried three different methods.
METHOD-1: I allocate the 4 MICs to the first 4 CPUs each and let the other CPUs run w/o MIC. In this case the program works well as expected and I got the following performance test result when solving zgemm for 5k*5k size of complex & dense matrices.
CPU_ID 0 1 2 3 4 5 6 7
time(s) 1.67 1.93 1.97 1.93 13.85 12.94 12.94 12.93
METHOD-2: Now, this is the problematic situation. I want all the 8 CPUs to share the 4 MICs equally expecting that the CPUs show a performance of about 4 seconds for the same zgemm problem as method-1. However, this method does not work well but gives error messages right away or after solving its first zgemm problem,
*** glibc detected *** ../../../bin/test: malloc(): memory corruption: 0x00007f59fc000010 ***
or
CPU_ID 0 1 2 3 4 5 6 7
time(s) 101 10 101 95 26 25 14 14
*** glibc detected *** ../../../bin/test: free(): corrupted unsorted chunks: 0x0000000009f47270 ***
METHOD-3: If I replace mkl_mic_set_workdivision() with mkl_mic_set_resource_limit(), then the program does not crash but there's no response at all. I see that the CPU and MIC usages are almost zero.
Please take a look at a piece of my code attached and give some advices.
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I interpret your description, you have 1 host system, with 16 cores total in the processors on the host and 4 coprocessor cards installed on that host with 47(?) cores per coprocessor. If so, you probably meant 57 cores per coprocessor, instead of 47.
You are the only person using the system at the time you run your tests, correct? The resource reservation process doesn't work if more than one user is using the coprocessors at that time.
For case two, try replacing 'mkl_mic_set_workdivision(MKL_TARGET_MIC, _mpi_comm_rank, 1.0);' with 'mkl_mic_set_workdivision(MKL_TARGET_MIC, _mpi_comm_rank, MKL_MIC_AUTO_WORKDIVISION);'. I believe using 1.0 tells MKL to allocate all the memory on a coprocessor to each process running on it. Since there is no swap space on the coprocessor, this means you will run into memory allocation problems, exactly what problems being influenced by the variability of relative execution time of each mpi rank.
I'm not seeing the problem with case three. Do you think you can collect more information on this? Maybe use micscm to see how many cores and how much memory is being used on the coprocessors, if, indeed, the program is even getting to the point where the library offloads work? Check /var/log/mpssd to see if the MPSS reports any problems? Add calls to some of the mkl_mic_get routines along with print and flush statements, to see what the memory and process limits are being set to?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I interpret your description, you have 1 host system, with 16 cores total in the processors on the host and 4 coprocessor cards installed on that host with 47(?) cores per coprocessor. If so, you probably meant 57 cores per coprocessor, instead of 47.
You are the only person using the system at the time you run your tests, correct? The resource reservation process doesn't work if more than one user is using the coprocessors at that time.
For case two, try replacing 'mkl_mic_set_workdivision(MKL_TARGET_MIC, _mpi_comm_rank, 1.0);' with 'mkl_mic_set_workdivision(MKL_TARGET_MIC, _mpi_comm_rank, MKL_MIC_AUTO_WORKDIVISION);'. I believe using 1.0 tells MKL to allocate all the memory on a coprocessor to each process running on it. Since there is no swap space on the coprocessor, this means you will run into memory allocation problems, exactly what problems being influenced by the variability of relative execution time of each mpi rank.
I'm not seeing the problem with case three. Do you think you can collect more information on this? Maybe use micscm to see how many cores and how much memory is being used on the coprocessors, if, indeed, the program is even getting to the point where the library offloads work? Check /var/log/mpssd to see if the MPSS reports any problems? Add calls to some of the mkl_mic_get routines along with print and flush statements, to see what the memory and process limits are being set to?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The case-2 approach using mkl_mic_set_workdivision() function still does not work. However, after I reboot the system, the case-3 approach using mkl_mic_set_resource_limit() works. Thanks for the help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page