Software Archive
Read-only legacy content

host-device bandwidth problem

King_Crimson
Beginner
610 Views

Dear forum,

I'm testing the host-device bandwidth using dapl fabric and Intel MPI (Isend/Irecv/Wait). 1.5 GB data are repeatedly sent back and forth. The initial result is:

host to device: ~5.6 GB/sec
device to host: ~5.8 GB/sec

Problem 1: The first send-receive appears to be extremely slow. Its bandwidth is:

host to device: ~2.6 GB/sec
device to host: ~2.5 GB/sec

I immediately thought of Linux' deferred memory allocation Jim pointed out in this post, so I memset the array prior to send/receive, but of little avail. So...is it because of the overhead of Intel MPI's first send/receive?

Problem 2: When I increased the data size to 2 GB, the following message was displayed:

[mic_name]:SCM:3be5:19664b40: 9659192 us(9659192 us!!!):  DAPL ERR reg_mr Cannot allocate memory

The program can complete without a problem, though. So what causes that error message?

 

Thanks for any advice.

0 Kudos
4 Replies
Artem_R_Intel1
Employee
610 Views

Hello,

Which version of Intel MPI Library do you use? Do you use any specific MPI environment variables?

Regarding to the 2nd problem I'd recommend to check system limits ('max locked memory' parameter) both on host and MIC ('ulimit -l'). If not yet done try to set it to 'unlimited' ('ulimit -l unlimited').

Just in case, also please specify OS/MPSS/OFED/DAPL versions you use.

0 Kudos
King_Crimson
Beginner
610 Views

Hi Artem,

Thanks for your reply.

Artem R. (Intel) wrote:

Which version of Intel MPI Library do you use? Do you use any specific MPI environment variables?

The version I'm using is 5.1.0.079

All the env vars are:

export I_MPI_DEBUG=7
export I_MPI_MIC=on
export I_MPI_FABRICS=dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=1
export KMP_AFFINITY=granularity=fine,balanced

Artem R. (Intel) wrote:

Regarding to the 2nd problem I'd recommend to check system limits ('max locked memory' parameter) both on host and MIC ('ulimit -l'). If not yet done try to set it to 'unlimited' ('ulimit -l unlimited').

Both the locked memory has been set to unlimited on the host and MIC prior to the test. Another symptom: When this error message occurs, the bandwidth is sharply reduced to ~1.3 GB.

Artem R. (Intel) wrote:

Just in case, also please specify OS/MPSS/OFED/DAPL versions you use.

OS: Scientific Linux 6.3, kernel 2.6.32-431.11.2.el6.x86_64

MPSS: 3.5.1

OFED: OFED-3.5-2-MIC

DAPL: 2.1.2-1

0 Kudos
Artem_R_Intel1
Employee
610 Views

Hi,

Problem 1:

This performance drop for 1st iteration may be explained by the initial memory registration. As far as I understand for next iterations you use the same buffer (correct?). If yes you can try to perform malloc/free for each iteration - this should stabilize the performance. Or skip 1st iteration - depending on what you would like to measure (like warm up iterations).

Problem 2:

Make sure that 'max locked memory' is set to 'unlimited' for the root account too - as far as I know some scif related kernel modules should be initialized with unlimited 'max locked memory'. If not done, host reboot may be required.

Additional question is about:

export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=1

Why do you use it? As far as I know it may significantly affect the performance.

0 Kudos
King_Crimson
Beginner
610 Views

Hi Artem,

Thanks again.

I set it to a low value because I would like to avoid eager protocol and ensure direct copy without buffering. I guess it should be set to a higher value if the message involved was smaller.

0 Kudos
Reply