- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear forum,
I'm testing the host-device bandwidth using dapl fabric and Intel MPI (Isend/Irecv/Wait). 1.5 GB data are repeatedly sent back and forth. The initial result is:
host to device: ~5.6 GB/sec device to host: ~5.8 GB/sec
Problem 1: The first send-receive appears to be extremely slow. Its bandwidth is:
host to device: ~2.6 GB/sec device to host: ~2.5 GB/sec
I immediately thought of Linux' deferred memory allocation Jim pointed out in this post, so I memset the array prior to send/receive, but of little avail. So...is it because of the overhead of Intel MPI's first send/receive?
Problem 2: When I increased the data size to 2 GB, the following message was displayed:
[mic_name]:SCM:3be5:19664b40: 9659192 us(9659192 us!!!): DAPL ERR reg_mr Cannot allocate memory
The program can complete without a problem, though. So what causes that error message?
Thanks for any advice.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Which version of Intel MPI Library do you use? Do you use any specific MPI environment variables?
Regarding to the 2nd problem I'd recommend to check system limits ('max locked memory' parameter) both on host and MIC ('ulimit -l'). If not yet done try to set it to 'unlimited' ('ulimit -l unlimited').
Just in case, also please specify OS/MPSS/OFED/DAPL versions you use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Artem,
Thanks for your reply.
Artem R. (Intel) wrote:
Which version of Intel MPI Library do you use? Do you use any specific MPI environment variables?
The version I'm using is 5.1.0.079
All the env vars are:
export I_MPI_DEBUG=7 export I_MPI_MIC=on export I_MPI_FABRICS=dapl export I_MPI_DAPL_PROVIDER=ofa-v2-scif0 export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=1 export KMP_AFFINITY=granularity=fine,balanced
Artem R. (Intel) wrote:
Regarding to the 2nd problem I'd recommend to check system limits ('max locked memory' parameter) both on host and MIC ('ulimit -l'). If not yet done try to set it to 'unlimited' ('ulimit -l unlimited').
Both the locked memory has been set to unlimited on the host and MIC prior to the test. Another symptom: When this error message occurs, the bandwidth is sharply reduced to ~1.3 GB.
Artem R. (Intel) wrote:
Just in case, also please specify OS/MPSS/OFED/DAPL versions you use.
OS: Scientific Linux 6.3, kernel 2.6.32-431.11.2.el6.x86_64
MPSS: 3.5.1
OFED: OFED-3.5-2-MIC
DAPL: 2.1.2-1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Problem 1:
This performance drop for 1st iteration may be explained by the initial memory registration. As far as I understand for next iterations you use the same buffer (correct?). If yes you can try to perform malloc/free for each iteration - this should stabilize the performance. Or skip 1st iteration - depending on what you would like to measure (like warm up iterations).
Problem 2:
Make sure that 'max locked memory' is set to 'unlimited' for the root account too - as far as I know some scif related kernel modules should be initialized with unlimited 'max locked memory'. If not done, host reboot may be required.
Additional question is about:
export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=1
Why do you use it? As far as I know it may significantly affect the performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Artem,
Thanks again.
I set it to a low value because I would like to avoid eager protocol and ensure direct copy without buffering. I guess it should be set to a higher value if the message involved was smaller.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page