- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
I am working with IntelMPI version 4.1.0.024 and I detected a problem with the MPI_Barrier() function (maybe a bug).
In the attached code I create a new process via the MPI_Comm_spawn function. Then I merge the intercomm and
the parent communicator with the MPI_Intercomm_merge function and I call a MPI_Barrier() function with the new
communicator.
The problem is some processes don't continue the execution (they remain held in the MPI_Barrier() function).
I have tested the code with other MPI implementations and it works fine.
Any solution??
Thanks,
Iván Cores.
The code is:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main( int argc, char *argv[] )
{
MPI_Comm parentcomm, intercomm;
printf("Starting ...\n");
MPI_Init( &argc, &argv );
MPI_Comm_get_parent( &parentcomm );
if (parentcomm == MPI_COMM_NULL)
{
char *newHost;
newHost = (char *)malloc(sizeof(char) * 255);
//Open de file
//Read host for new process from source file
//For this tests:
memcpy(newHost, "compute-0-0");
//Host for new process
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", newHost);
// Create 1 more process
int errcodes[1];
MPI_Comm_spawn( "~/testSpawn/spawn_example", MPI_ARGV_NULL, 1, info, 0, MPI_COMM_WORLD, &intercomm, errcodes );
char hostname[256];
gethostname(hostname,255);
printf(" I'm the parent %s.\n", hostname);
//Merge between the intercomm and the intracomm
MPI_Comm comm_new_and_old;
MPI_Intercomm_merge(intercomm, 0, &comm_new_and_old);
int npesNEW = -1;
int myidNEW = -1;
MPI_Comm_size(comm_new_and_old, &npesNEW);
MPI_Comm_rank(comm_new_and_old, &myidNEW);
printf(" Im %d of %d.\n", myidNEW, npesNEW);
//PROBLEMATIC BARRIER.
MPI_Barrier(comm_new_and_old);
printf(" After barrier %d\n", myidNEW);
MPI_Comm_free(&comm_new_and_old);
}
else
{
char hostname2[256];
gethostname(hostname2,255);
printf(" I'm the spawned %s.\n", hostname2);
//Merge between the intercomm and the intracomm
MPI_Comm comm_new_and_old;
MPI_Intercomm_merge(parentcomm, 1, &comm_new_and_old);
int npesNEW = -1;
int myidNEW = -1;
MPI_Comm_size(comm_new_and_old, &npesNEW);
MPI_Comm_rank(comm_new_and_old, &myidNEW);
printf(" Im %d of %d (New).\n", myidNEW, npesNEW);
//PROBLEMATIC BARRIER.
MPI_Barrier(comm_new_and_old);
printf(" After barrier (New proc.)\n");
MPI_Comm_free(&comm_new_and_old);
}
MPI_Finalize();
return 0;
}
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Ivan,
I had to make some modifications to your program (memcpy needs the length argument, and changing names of the host and the executable to launch), but with those modifications I was able to compile and run with no problems using 4.1.0.024. What compiler are you using?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi James,
Thanks for your answer. I apologize for the problem with the memcpy, it was a change in the last second to simplify the code without check. About the compiler we use the icc version 12.1.5.
We are using a new cluster (Intel Sandy Bridge with Infiniband). Could be a problem with the configure?
Sincerely,
Iván Cores.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Ivan,
I see, running with multiple ranks for the initial program and using DAPL I am able to reproduce the issue. I'm going to investigate this some more and I'll let you know when I've got more information.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Ivan,
I apologize for not responding sooner. From my investigations, I believe there is a bug we will need to correct. Running the correctness checking library shows that the intercommunicators are invalid, even for a simple example I have. They are "working" at small rank counts, but there is definitely a problem somewhere.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Ivan,
The root problem is that some parameters were being obtained directly by the spawned processes, rather than from the spawning processes. This led to inconsistencies in the full job. The developers have corrected this and the fix should be available in the next release.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Ivan,
We are planning to release the update this summer.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools