Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

MPI_Barrier bug with defined communicators.

ivcores
Beginner
1,086 Views

Hi,

I am working with IntelMPI version 4.1.0.024 and I detected a problem with the MPI_Barrier() function (maybe a bug).

In the attached code I create a new process via the MPI_Comm_spawn function. Then I merge the intercomm and

the parent communicator with the MPI_Intercomm_merge function and I call a MPI_Barrier() function with the new

communicator.

The problem is some processes don't continue the execution (they remain held in the MPI_Barrier() function).

I have tested the code with other MPI implementations and it works fine.

Any solution??

Thanks,

Iván Cores.

The code is:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>


int main( int argc, char *argv[] )
{

    MPI_Comm parentcomm, intercomm;
    printf("Starting ...\n");
    MPI_Init( &argc, &argv );

    MPI_Comm_get_parent( &parentcomm );
    
    if (parentcomm == MPI_COMM_NULL)
    {
          char *newHost;
        newHost = (char *)malloc(sizeof(char) * 255);
        
        //Open de file
        //Read host for new process from source file
        //For this tests:
        memcpy(newHost, "compute-0-0");

          //Host for new process
        MPI_Info info;
        MPI_Info_create(&info);
        MPI_Info_set(info, "host", newHost);
    
        // Create 1 more process
    int errcodes[1];
        MPI_Comm_spawn( "~/testSpawn/spawn_example", MPI_ARGV_NULL, 1, info, 0, MPI_COMM_WORLD, &intercomm, errcodes );
    
      char hostname[256];
      gethostname(hostname,255);
        printf("  I'm the parent %s.\n", hostname);
    
    //Merge between the intercomm and the intracomm
    MPI_Comm comm_new_and_old;
    MPI_Intercomm_merge(intercomm, 0, &comm_new_and_old);
    
    int npesNEW = -1;
    int myidNEW = -1;
     MPI_Comm_size(comm_new_and_old, &npesNEW);
    MPI_Comm_rank(comm_new_and_old, &myidNEW);
    printf("  Im %d of %d.\n", myidNEW, npesNEW);

    //PROBLEMATIC BARRIER.
    MPI_Barrier(comm_new_and_old);
    printf("  After barrier %d\n", myidNEW);

    MPI_Comm_free(&comm_new_and_old);
    

    }
    else
    {

      char hostname2[256];
      gethostname(hostname2,255);
        printf("      I'm the spawned %s.\n", hostname2);

    //Merge between the intercomm and the intracomm
    MPI_Comm comm_new_and_old;
    MPI_Intercomm_merge(parentcomm, 1, &comm_new_and_old);
        
    int npesNEW = -1;
    int myidNEW = -1;    
    MPI_Comm_size(comm_new_and_old, &npesNEW);
    MPI_Comm_rank(comm_new_and_old, &myidNEW);
    printf("      Im %d of %d (New).\n", myidNEW, npesNEW);
        
    //PROBLEMATIC BARRIER.
    MPI_Barrier(comm_new_and_old);
    
    printf("      After barrier (New proc.)\n");

    MPI_Comm_free(&comm_new_and_old);
    
    }
    MPI_Finalize();
    return 0;
}

0 Kudos
10 Replies
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

I had to make some modifications to your program (memcpy needs the length argument, and changing names of the host and the executable to launch), but with those modifications I was able to compile and run with no problems using 4.1.0.024.  What compiler are you using?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
ivcores
Beginner
1,086 Views

Hi James,

Thanks for your answer. I apologize for the problem with the memcpy, it was a change in the last second to simplify the code without check. About the compiler we use the icc version 12.1.5.

We are using a new cluster (Intel Sandy Bridge with Infiniband). Could be a problem with the configure?

Sincerely,

Iván Cores.

0 Kudos
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

I made a few more changes to the code so there are less hardcoded values.  Try running the attached example with I_MPI_DEBUG=5 and send me the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
ivcores
Beginner
1,086 Views

Hi James,

I run your code with I_MPI_DEBUG=5. I attached the output file.

I think it is a problem with the InfiniBand controller, but I don't know if I should change the I_MPI_FABRICS_LIST parameter.

Sincerely,

Iván Cores.

0 Kudos
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

I see, running with multiple ranks for the initial program and using DAPL I am able to reproduce the issue.  I'm going to investigate this some more and I'll let you know when I've got more information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
ivcores
Beginner
1,086 Views

Hi James,

Is there anything new about the issue?

Sincerely,

Iván Cores.

0 Kudos
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

I apologize for not responding sooner.  From my investigations, I believe there is a bug we will need to correct.  Running the correctness checking library shows that the intercommunicators are invalid, even for a simple example I have.  They are "working" at small rank counts, but there is definitely a problem somewhere.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

The root problem is that some parameters were being obtained directly by the spawned processes, rather than from the spawning processes.  This led to inconsistencies in the full job.  The developers have corrected this and the fix should be available in the next release.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
ivcores
Beginner
1,086 Views

Hi James,

Thank you so much for your response. I hope the next release will be available soon.

Sincerely,

Iván Cores.

0 Kudos
James_T_Intel
Moderator
1,086 Views

Hi Ivan,

We are planning to release the update this summer.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Reply