Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Brian_S_3
Beginner
38 Views

Error using more than 1 node

HELP!
I have some code to simply solve Ax=b using an iterative method. I read in a partitioned file and load A and b into each processes unique A and b. Then iteratively try to solve. Everything is great until I run on more than 1 node. I have a 64 partition example which runs great on 1 node(20 physical cores) but trying to run on 40 60 or 80 cores there is erroneous behavior ( get wrong results or MPI hangs). I am going crazy over this. The unpartitioned Ax=b works great, 8 partitions and 16 work great 32 works on 2 nodes consistently but takes extra time for some reason. and 64 will hang or hit iteration limits and give incorrect results.

I'm using the Eigen matrix library . doing a simple ISend and Recv multiple times. And doing MPI_AllReduce.

I've tried using MPICH, openmpi, intel-mpi, they all compile without warnings but all suffer from the same issue. openmpi seems to suffer worse on my machine, causing some extra warnings to be generated during run time. 

Im using Cray CS300-LC Linux Cluster

2560 compute cores (2.8 GHz Intel Xeon E5-2680 v2), 128 nodes

15,360 coprocessor cores (Intel Xeon Phi 5110P), two per node

8 Terabytes of RAM

FDR InfiniBand Network

 

Any help is appreciated!

 
 

Thanks,

Brian

0 Kudos
1 Reply
James_T_Intel
Moderator
38 Views

The Intel® Trace Analyzer and Checker has a Message Checking capability.  I would recommend using this to examine your code.  See https://software.intel.com/en-us/articles/intel-trace-analyzer-and-collector-for-linux-intel-mpi-cor... for details on how to use it.

Reply