Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4761 Discussions

running well with IPoIB mode, but when the data size over a certain number via DAPL mode, the program failed

ZJING2
Beginner
1,421 Views

Hello,

We successfully installed the infiniband network and get the rational speed over servers. We also installed the Intel MPI. But in order to provide DAPL transport way, we installed the DAPL-ND as DAPL_PROVIDER which is a software produced by OFED. So we use the command "mpiexec -genv I_MPI_DEBUG=5 -genv I_MPI_FABRICS=shm:dapl -genv I_MPI_DAPL_PROVIDER=ND0 -n 2 -ppn 1 -hosts 11.4.12.11,11.4.12.12 MPIPassing_3.exe" to run the Intel MPI program.

However, we face a serious problem that when we use the dapl mode. We design a test program to test the intel mpi dapl speed. This program's mainly work is to use the standard Send(MPI_Send) to send the same picture data again and again . And another process is to use standard receive function(MPI_Recv) to receive those picture data on turn, One process on a server is responsible for sending picture data on turn. And another process on another server has duty to receive the picture data on turn.

we set the picture number which is the number of sending the same picture, when we run the program. When we set the picture number to more than 50(picture number>=50) and the size of picture is 314MB. The program evokes the error just like below.

dapls_ib_mr_register() NDRegister: (Dat_Status type 0x40000 subtype 0) @ line 1685 flags 0x7 len 329117696 vaddr 000002457531A000

At the same time, the connection to another recevice process on the receive-responisible server dropped. We have no idea why error just like that happened and why we lose our connection. The most strangest thing is that we can ordinary run this program with the same picture size and the same picture number via the IPoIB mode. We just use the command "mpiexec -genv I_MPI_DEBUG=5 -genv I_MPI_FABRICS=shm:tcp -genv I_MPI_TCP_NETMASK=ib -n 2 -ppn 1 -hosts 11.4.12.11,11.4.12.12 MPIPassing_3.exe". Even when we set the picture number to 100, it still works well. So we can conclude our code is good and our program doesn't exist the leak of memory because of our program's running well via IPoIB mode.

However, as long as we run the program with the picture number over 50 as well as the picture size "314MB" via the DAPL mode, it will emerge error like what l have mentioned above. When using the DAPL mode, with 314MB picture size and picture number<50, it can run successfully. Another phenomenon is that with 43MB picture size and picture number =100 via DAPL mode, it can run well.

So we doubt if there is a limitation of sending data size? Does the size of the RAM memory effect the limitation of sending data size? Or we did some wrong but we don't know. Why with the picture size>314MB ,picture number>50, the sending server will lose the connection to the receving servers?

Eventually, there is a significant thing that we run our intel mpi commands on the sending-responsible server.

Can someone help me? Any advice may help us. We are looking forward to your help. We demonstrate the critical part of our code as follow.

Our code for this Send Node process is like this:

MPI_Barrier(MPI_COMM_WORLD);

for (int i = 0; i < ComTimes; i++)

{

pProcessNode->sendMessage(pPic->PicData, Buf_Size, DataType, RECV_NODE, CommTag);// my sendMessage() is equal to the function:MPI_Send()

CommTag--;

}

And code for Recv Node process is like this:

MPI_Barrier(MPI_COMM_WORLD);

for (int i = 0; i < ComTimes; i++)

{

uchar *pRecvData = new uchar[Buf_Size];

pProcessNode->recvMessage(pRecvData, Buf_Size, DataType, SEND_NODE, CommTag, RecvMode);

CommTag--;

if (i == 0)

{

Start = MPI_Wtime();

} delete pRecvData; }

End = MPI_Wtime();

TimeTotal = End - Start;

std::cout << "Standars total time is:" << TimeTotal << std::endl;

0 Kudos
5 Replies
idata
Employee
468 Views

shanghua: Thank you very much for contacting the Intel® communities. We will do our best to assist you to try to fix this problem.

 

 

In order to deliver the most accurate response, could you please provide the model of the server board?

 

 

Any further questions, please let me know.

 

 

Regards,

 

Alberto R

 

ZJING2
Beginner
468 Views

Thank you for your answer.

It's 0:26 now. Because yesterday l didn't work and l am at home now rather than at my office, l can't afford you our info about the model of the server board now. l'm sorry about this situation, but l promise l will go to my office this morning to see the model of the server board. Then l will send this info as soon as l can. I guess l can provide you this info about 9'clock.

At the same time, l found a another weird phenomenon that l believe l need to tell you. To demonstrate some picture more clearly, l describe my finding and essential pictures on the attachmentcalled "weird_question.pdf".

Please read the attachment called "weird_question.pdf". And l also enclose three pictures about error info. "number=49.png" is the error info evoked when l set the picture number to 49."number=50.png" is the error info evoked when l set the picture number to 50. "number=51.png" is the error info evoked when l set the picture number to 51.

We're looking forward to hearing from you.

0 Kudos
ZJING2
Beginner
468 Views

As we mentioned before, we use two server to run intel-mpi-program. One is to send picture again and again. At the same time,another is to receive picture on turn.

We use the DirectX diagnosis tool as well as AIDA64 Extreme tool to get the model of the board info. We put these info on two attachments. One is called "Sending-ResponsibleServer.pdf" which records the info about the sending-responsible server board info, another file is named "Receving-ResponsibleServer.pdf" which records the info about the receving-responsible server board info.

We are looking forward to hearing from you. And please don't forget to read our last reply, which we mentioned a new finding about our RAM, which maybe critical for you.

0 Kudos
idata
Employee
468 Views

shanghua: You are very welcome. Thank you very much for providing all those details and information. In order for us to provide the most accurate response and to try to fix this problem, we need the model of the server board and the manufacturer of it, that information is fundamental for this type of scenario.

 

 

Any questions, please let me know.

 

 

Regards,

 

Alberto R

 

0 Kudos
idata
Employee
468 Views

shanghua: I just wanted to check if you saw the information we requested previously and if you need further assistance on this matter?

 

 

Any questions, please let me know.

 

 

Regards,

 

Alberto R

 

0 Kudos
Reply