I have some code to simply solve Ax=b using an iterative method. I read in a partitioned file and load A and b into each processes unique A and b. Then iteratively try to solve. Everything is great until I run on more than 1 node. I have a 64 partition example which runs great on 1 node(20 physical cores) but trying to run on 40 60 or 80 cores there is erroneous behavior ( get wrong results or MPI hangs). I am going crazy over this. The unpartitioned Ax=b works great, 8 partitions and 16 work great 32 works on 2 nodes consistently but takes extra time for some reason. and 64 will hang or hit iteration limits and give incorrect results.
I'm using the Eigen matrix library . doing a simple ISend and Recv multiple times. And doing MPI_AllReduce.
I've tried using MPICH, openmpi, intel-mpi, they all compile without warnings but all suffer from the same issue. openmpi seems to suffer worse on my machine, causing some extra warnings to be generated during run time.
Im using Cray CS300-LC Linux Cluster
2560 compute cores (2.8 GHz Intel Xeon E5-2680 v2), 128 nodes
15,360 coprocessor cores (Intel Xeon Phi 5110P), two per node
8 Terabytes of RAM
FDR InfiniBand Network
Any help is appreciated!
The Intel® Trace Analyzer and Checker has a Message Checking capability. I would recommend using this to examine your code. See https://software.intel.com/en-us/articles/intel-trace-analyzer-and-collector-for-linux-intel-mpi-cor... for details on how to use it.