Solved: Intel Micro Benchmark latest version 3.2.4 --> integer overflow

phonlawat_k_ · ‎04-22-2014

I'm newbie with this forum and this is first time to use Intel Micro Benchmark for testing FDR infiniband Perofrmance.
Anyway, i use intel micro benchmark 3.2.4 and I have problem about Allgather with 1024 processes and 4M message sizes.

# Benchmarking Allgather
# #processes = 1024
# ( 256 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.01 0.01 0.01
1 1000 104.29 104.40 104.33
2 1000 114.27 114.37 114.31
4 1000 136.66 136.75 136.71
8 1000 177.69 177.80 177.73
16 1000 238.50 238.66 238.60
32 1000 333.43 333.62 333.54
64 1000 697.89 698.26 698.05
128 1000 854.70 855.05 854.87
256 1000 930.07 930.48 930.27
512 1000 1090.33 1090.75 1090.52
1024 875 1642.99 1643.97 1643.47
2048 812 2324.69 2325.83 2325.25
4096 812 4665.67 4668.12 4666.91
8192 812 6475.55 6477.72 6476.66
16384 812 10368.64 10369.84 10369.25
32768 533 18721.53 18725.46 18723.87
65536 294 33836.31 33844.84 33840.93
131072 154 65191.53 65219.06 65204.86
262144 78 128504.25 128580.44 128542.72
524288 38 270415.55 270791.08 270577.44
1048576 14 598741.36 601155.43 600013.42
2097152 7 1259113.14 1267490.28 1263382.81
2 total processes killed (some possibly by mpirun during cleanup)

and Gatherv and Allgatherv(the result is quite similar) show about interger overflow

#----------------------------------------------------------------
# Benchmarking Gatherv
# #processes = 512
# ( 768 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.19 0.11
1 1000 62.69 91.45 78.84
2 1000 60.23 76.77 73.33
4 1000 70.90 87.93 85.35
8 1000 54.29 76.14 70.08
16 1000 62.76 81.96 76.67
32 1000 64.86 84.97 80.23
64 1000 73.83 93.81 88.77
128 1000 75.35 100.01 93.76
256 1000 83.21 112.15 103.90
512 1000 101.91 134.05 119.33
1024 1000 134.34 167.31 148.18
2048 1000 221.45 263.69 239.97
4096 1000 377.56 463.34 406.43
8192 1000 3299.10 3680.48 3571.74
16384 1000 2861.94 3275.25 3193.59
32768 808 4740.91 5192.17 5009.50
65536 623 3426.82 9136.07 7408.62
131072 320 11377.57 14969.00 12842.39
262144 55 90220.56 93014.00 92067.95
524288 55 77019.98 79891.46 78945.73
1048576 40 129610.60 136606.38 134520.82
2097152 20 239053.26 266906.06 259315.17
4194304 int-overflow.; The production rank*size caused int overflow for given sample

After I face the problem, i check dmesg command which it doesn't show anything about this problem and i saw the comment below Intel Micro Benchmark 3.2.4. He suggest about size_t in IMB_mem_manager.c --> r_len = c_info->num_procs*(size_t)init_size;. After i change this file and compile Intel Micro Benchmark, it still have same problem. I try older version (3.2.2 , 3.2.3 and 4.0.0 beta ) and still be same.

P.S. sorry for bad writing skill and I hope i will improve my skill more than this.

Thank you very much.

James_T_Intel · ‎05-09-2014

The integer overflow notice is due to limitations on message sizes. The Intel® MPI Library currently has a maximum message size limit of 2 GB. This is due to how addresses are represented within the MPI standard (32 bit integer). The Intel® MPI Benchmarks include a safety check to ensure that messages are not over that limit. In the MPI 3 standard, this limit can be circumvented by using MPI_Count. Version 5 of the Intel® MPI Library will support this. If you want to try it now, we are in Beta, visit http://bit.ly/sw-dev-tools-2015-beta for more information.

View solution in original post

phonlawat_k_ · ‎04-22-2014

In addition detail and i've already attached Logfile for 64 Node test case from Intel Micro Benchmark

Compiler : Intel compiler composer_xe_2013_sp1.1.106 and Openmpi 1.6.5

Test Case : 64Node , 1280 cores , Allgather, Allgatherv

Anyway, i can fix integer overflow but (2 process killed in mpirun) problem still occur and this is some output in Logfile.

MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
0 /lib64/libc.so.6() [0x38fbc329a0]
1 /lib64/libc.so.6(memcpy+0x15b) [0x38fbc89aab]
2 /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(+0xfd161) [0x7f90f2105161]
3 /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(ompi_datatype_sndrcv+0x52f) [0x7f90f2074fff]
4 /usr/mpi/gcc/openmpi-1.6.5/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgatherv_intra_neighborexchange+0x8d) [0x7f90ac11a17d]
5 /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_Allgatherv+0x80) [0x7f90f20759c0]
6 IMB-MPI1(IMB_allgatherv+0x122) [0x40e062]
7 IMB-MPI1(IMB_init_buffers_iter+0x588) [0x408648]
8 IMB-MPI1(main+0x4b7) [0x404667]
9 /lib64/libc.so.6(__libc_start_main+0xfd) [0x38fbc1ed1d]
10 IMB-MPI1() [0x4040e9]
===================
mpirun noticed that process rank 1021 with PID 14100 on node prod-0053 exited on signal 11 (Segmentation fault).

Thank you

phonlawat_k_ · ‎04-24-2014

Thank you for nothing response may be my bad writing skill and my ambiguous meaning. Anyway I can figure out it. Thank you again for zero response

James_T_Intel · ‎05-05-2014

I apologize your concern wasn't addressed sooner. If you're still watching and would like to work with us to resolve this, let's see what we can do. How are you compiling the Intel® MPI Benchmarks?

phonlawat_k_ · ‎05-08-2014

ok thank you. For all of my information, I compile Intel Micro Benchmark 3.2.4 with openmpi-1.6.5 and intel compiler composer_xe_2013_sp1.1.106 and use command Make IMB-MPI1 . My problem is integer overflow in Allgather, Allgatherv, gather. because the result of multiplication between message size and number of processes over limit of integer range. I found some solution by change integer variable to be long variable and my problem has gone but when i use number of processes more than 512, new error ("mpirun noticed that process rank") in trace file from Intel Micro Benchmark so i just limit number of processes less than or equal 512 and number of message size not more than 4MB . I guess that this problem will not occur with Intel MPI compiler. Do you ever try Intel Micro Benchmark with Intel MPI or openMPI or something like that in large scale cluster via infiniband?

P.S. sorry for my bad writing skill

James_T_Intel · ‎05-09-2014

The integer overflow notice is due to limitations on message sizes. The Intel® MPI Library currently has a maximum message size limit of 2 GB. This is due to how addresses are represented within the MPI standard (32 bit integer). The Intel® MPI Benchmarks include a safety check to ensure that messages are not over that limit. In the MPI 3 standard, this limit can be circumvented by using MPI_Count. Version 5 of the Intel® MPI Library will support this. If you want to try it now, we are in Beta, visit http://bit.ly/sw-dev-tools-2015-beta for more information.

phonlawat_k_ · ‎05-09-2014

Thank you. this is the best time to try beta version.