Community
cancel
Showing results for 
Search instead for 
Did you mean: 
MMagg
Beginner
186 Views

Wrong result in MPI Allreduce

Jump to solution

I have encountered in a strange problem in a module of  an Astrophysical simulation program. Essentially the sums from an MPI_Allreduce come back as garbage if the code is run on more than approximately 1100 cores. I dumped the input data from the Allreduce call into binary files to be able to reproduce the issue. The code and the input data is attached. Here is the basic issue:

The works with two arrays of length N, their data type is a struct "buffer" 

struct buffer
{
double com[3];
double mom[3];
double mass;
} * in, *out;

These are allocated and initialized:

in = malloc(N * sizeof(struct buffer));
out = malloc(N * sizeof(struct buffer));
memset(in, 0, N * sizeof(struct buffer));
memset(out, 0, N * sizeof(struct buffer));

 I read these from binary files that I previously dumped on each task:

sprintf(fname, "data/in.%d", task);
in_file = fopen(fname, "rb");
fread(in, sizeof(double), 7*N, in_file);
fclose(in_file);

And then they are all summed up:

  MPI_Allreduce(in, out, 7 * N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

The result of this summation is correct if I run with 1100 tasks but its garbage for the last three elements of the array if I run with 1200 tasks. Is there an error in the way we implemented this? Should I expect this to work and if not, what should I do instead?

Many thanks for the help in advance!

Further details:
- The error occurs on several machines with intel MPI 2019, but not with 2018 or earlier. 
- I was not sure whether I can or should expect the data to be contiguous. I compared memory addresses and I found that the data is where I extect it to be, i.e. &in[i].mass is equal to &in[0].mass+56*i, and the same for out, even in the cases where the results of the Allreduce are wrong.

0 Kudos
1 Solution
AbhishekD_Intel
Moderator
158 Views

Hi Mattis,

 

Its a known issue with AllReduce in earlier versions of 2019 for more details please refer below link (Known Issues and Limitations -> 2019 Update 3)

https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-release-notes-linux....

 

And it seems that this issue has been fixed in 2019 Update 7 refer below link (What's New -> 2019 update-7 )

https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-release-notes-linux....

 

So will you please execute it on the latest MPI (2019 Update-8 ) and let us know if you still face the same issue.

 

 

Warm Regards,

Abhishek

 

View solution in original post

3 Replies
AbhishekD_Intel
Moderator
159 Views

Hi Mattis,

 

Its a known issue with AllReduce in earlier versions of 2019 for more details please refer below link (Known Issues and Limitations -> 2019 Update 3)

https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-release-notes-linux....

 

And it seems that this issue has been fixed in 2019 Update 7 refer below link (What's New -> 2019 update-7 )

https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-library-release-notes-linux....

 

So will you please execute it on the latest MPI (2019 Update-8 ) and let us know if you still face the same issue.

 

 

Warm Regards,

Abhishek

 

View solution in original post

MMagg
Beginner
145 Views

Dear Abhishek

It appears my issue was exactly this. I tested it with MPI 2019.7 and it works there. I do not have 2019.8 available. I wasn't of this issue before, and probably should have looked at the version logs.

Many thanks for your help.

Best regards,
Mattis

AbhishekD_Intel
Moderator
140 Views

Dear Mattis,

Thanks for the confirmation. Glad to know that the provided details helped. We won't be monitoring this thread anymore. Kindly raise a new thread if you need further assistance.

 


Warm Regards,

Abhishek

 

 

 

Reply