Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2218 토론

Intel Micro Benchmark latest version 3.2.4 --> integer overflow

phonlawat_k_
초급자
1,709 조회수

I'm newbie with this forum and this is first time to use Intel Micro Benchmark for testing FDR infiniband Perofrmance.
Anyway, i use intel micro benchmark 3.2.4 and I have problem about Allgather with 1024 processes and 4M message sizes.

# Benchmarking Allgather 
# #processes = 1024 
# ( 256 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.01         0.01         0.01
            1         1000       104.29       104.40       104.33
            2         1000       114.27       114.37       114.31
            4         1000       136.66       136.75       136.71
            8         1000       177.69       177.80       177.73
           16         1000       238.50       238.66       238.60
           32         1000       333.43       333.62       333.54
           64         1000       697.89       698.26       698.05
          128         1000       854.70       855.05       854.87
          256         1000       930.07       930.48       930.27
          512         1000      1090.33      1090.75      1090.52
         1024          875      1642.99      1643.97      1643.47
         2048          812      2324.69      2325.83      2325.25
         4096          812      4665.67      4668.12      4666.91
         8192          812      6475.55      6477.72      6476.66
        16384          812     10368.64     10369.84     10369.25
        32768          533     18721.53     18725.46     18723.87
        65536          294     33836.31     33844.84     33840.93
       131072          154     65191.53     65219.06     65204.86
       262144           78    128504.25    128580.44    128542.72
       524288           38    270415.55    270791.08    270577.44
      1048576           14    598741.36    601155.43    600013.42
      2097152            7   1259113.14   1267490.28   1263382.81
2 total processes killed (some possibly by mpirun during cleanup)

          and Gatherv and Allgatherv(the result is quite similar) show about interger overflow

#----------------------------------------------------------------
# Benchmarking Gatherv 
# #processes = 512 
# ( 768 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.07         0.19         0.11
            1         1000        62.69        91.45        78.84
            2         1000        60.23        76.77        73.33
            4         1000        70.90        87.93        85.35
            8         1000        54.29        76.14        70.08
           16         1000        62.76        81.96        76.67
           32         1000        64.86        84.97        80.23
           64         1000        73.83        93.81        88.77
          128         1000        75.35       100.01        93.76
          256         1000        83.21       112.15       103.90
          512         1000       101.91       134.05       119.33
         1024         1000       134.34       167.31       148.18
         2048         1000       221.45       263.69       239.97
         4096         1000       377.56       463.34       406.43
         8192         1000      3299.10      3680.48      3571.74
        16384         1000      2861.94      3275.25      3193.59
        32768          808      4740.91      5192.17      5009.50
        65536          623      3426.82      9136.07      7408.62
       131072          320     11377.57     14969.00     12842.39
       262144           55     90220.56     93014.00     92067.95
       524288           55     77019.98     79891.46     78945.73
      1048576           40    129610.60    136606.38    134520.82
      2097152           20    239053.26    266906.06    259315.17
      4194304 int-overflow.; The production rank*size caused int overflow for given sample

              After I face the problem, i check dmesg command which it doesn't show anything about this problem and i saw the comment below Intel Micro Benchmark 3.2.4. He suggest about size_t in IMB_mem_manager.c --> r_len = c_info->num_procs*(size_t)init_size;. After i change this file and compile Intel Micro Benchmark, it still have same problem. I try older version (3.2.2 , 3.2.3 and 4.0.0 beta ) and still be same.

P.S. sorry for bad writing skill and I hope i will improve my skill more than this.

Thank you very much. 

0 포인트
1 솔루션
James_T_Intel
중재자
1,709 조회수

The integer overflow notice is due to limitations on message sizes.  The Intel® MPI Library currently has a maximum message size limit of 2 GB.  This is due to how addresses are represented within the MPI standard (32 bit integer).  The Intel® MPI Benchmarks include a safety check to ensure that messages are not over that limit.  In the MPI 3 standard, this limit can be circumvented by using MPI_Count.  Version 5 of the Intel® MPI Library will support this.  If you want to try it now, we are in Beta, visit http://bit.ly/sw-dev-tools-2015-beta for more information.

원본 게시물의 솔루션 보기

0 포인트
6 응답
phonlawat_k_
초급자
1,709 조회수

In addition detail and i've already attached Logfile for 64 Node test case from Intel Micro Benchmark  

   Compiler : Intel compiler composer_xe_2013_sp1.1.106 and Openmpi 1.6.5 

   Test Case : 64Node , 1280 cores , Allgather, Allgatherv

 Anyway, i can fix integer overflow but  (2 process killed in mpirun) problem still occur and this is some output in Logfile.

MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
MXM: Got signal 11 (Segmentation fault)
==== backtrace ====
    0  /lib64/libc.so.6() [0x38fbc329a0]
    1  /lib64/libc.so.6(memcpy+0x15b) [0x38fbc89aab]
    2  /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(+0xfd161) [0x7f90f2105161]
    3  /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(ompi_datatype_sndrcv+0x52f) [0x7f90f2074fff]
    4  /usr/mpi/gcc/openmpi-1.6.5/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgatherv_intra_neighborexchange+0x8d) [0x7f90ac11a17d]
    5  /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_Allgatherv+0x80) [0x7f90f20759c0]
    6  IMB-MPI1(IMB_allgatherv+0x122) [0x40e062]
    7  IMB-MPI1(IMB_init_buffers_iter+0x588) [0x408648]
    8  IMB-MPI1(main+0x4b7) [0x404667]
    9  /lib64/libc.so.6(__libc_start_main+0xfd) [0x38fbc1ed1d]
   10  IMB-MPI1() [0x4040e9]
===================
mpirun noticed that process rank 1021 with PID 14100 on node prod-0053 exited on signal 11 (Segmentation fault).

    Thank you

   

0 포인트
phonlawat_k_
초급자
1,709 조회수

Thank you for nothing response may be my bad writing skill and my ambiguous meaning. Anyway I can figure out it. Thank you again for zero response

0 포인트
James_T_Intel
중재자
1,709 조회수

I apologize your concern wasn't addressed sooner.  If you're still watching and would like to work with us to resolve this, let's see what we can do.  How are you compiling the Intel® MPI Benchmarks?

0 포인트
phonlawat_k_
초급자
1,709 조회수

ok thank you. For all of my information, I compile Intel Micro Benchmark 3.2.4 with openmpi-1.6.5 and intel compiler  composer_xe_2013_sp1.1.106 and use command Make IMB-MPI1 . My problem is integer overflow in Allgather, Allgatherv, gather. because the result of multiplication between message size and number of processes over limit of integer range. I found some solution by change integer variable to be long variable and my problem has gone but when i use number of processes more than 512, new error ("mpirun noticed that process rank") in trace file from Intel Micro Benchmark so i just limit number of processes less than or equal 512 and number of message size not more than 4MB . I guess that this problem will not occur with Intel MPI compiler.  Do you ever try Intel Micro Benchmark with Intel MPI or openMPI or something like that  in large scale cluster via infiniband?

P.S. sorry for my bad writing skill 

0 포인트
James_T_Intel
중재자
1,710 조회수

The integer overflow notice is due to limitations on message sizes.  The Intel® MPI Library currently has a maximum message size limit of 2 GB.  This is due to how addresses are represented within the MPI standard (32 bit integer).  The Intel® MPI Benchmarks include a safety check to ensure that messages are not over that limit.  In the MPI 3 standard, this limit can be circumvented by using MPI_Count.  Version 5 of the Intel® MPI Library will support this.  If you want to try it now, we are in Beta, visit http://bit.ly/sw-dev-tools-2015-beta for more information.

0 포인트
phonlawat_k_
초급자
1,709 조회수

Thank you. this is the best time to try beta version.

0 포인트
응답