- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to run an application with 64 proccess (4 nodes). with ofa I got this type of errors
[42] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x4678680 len=4857680
RTC entry - addr=0x4678680 len=4857680 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
internal ABORT - process 42
[44] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x4b059e0 len=4857680
RTC entry - addr=0x4b059e0 len=4857680 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
internal ABORT - process 44
[54] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x44bda20 len=4857680
RTC entry - addr=0x44bda20 len=4857680 cnt=1
while with dapl I got:
mn85:7977:4b489740: 22385958 us(22385958 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22387532 us(22387532 us!!!): reg_mr Cannot allocate memory
mn85:7980:b41fa740: 22387325 us(22387325 us!!!): reg_mr Cannot allocate memory
mn85:797c:cc9d3740: 22386986 us(22386986 us!!!): mn85:797b:70f7740: 22387216 us(22387216 us!!!): reg_mr Cannot allocate memory
reg_mr Cannot allocate memory
mn85:7974:b074b740: 22387763 us(22387763 us!!!): reg_mr Cannot allocate memory
mn85:7979:4ca6c740: 22389086 us(22389086 us!!!): reg_mr Cannot allocate memory
mn85:797e:c15a2740: 22388867 us(22388867 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22390126 us(2594 us): reg_mr Cannot allocate memory
mn85:7976:5bee3740: 22389524 us(22389524 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391260 us(1134 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391539 us(279 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391908 us(369 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22392231 us(323 us): reg_mr Cannot allocate memory
mn82:7b6a:8d5c2740: 22402315 us(22402315 us!!!): reg_mr Cannot allocate memory
mn82:7b6a:8d5c2740: 22402582 us(267 us): reg_mr Cannot allocate memory
anny suggestions?
thank you in advance,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi José Luis,
Have a look at this post: http://software.intel.com/en-us/forums/topic/329053
I had a similar problem some time ago, and I manage to solve it.
Let me know if it also works for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi José Luis,
Have a look at this post: http://software.intel.com/en-us/forums/topic/329053
I had a similar problem some time ago, and I manage to solve it.
Let me know if it also works for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, here are some other modifications that I had to do:
I found that our Infiniband switch has a limit for the maximum amount of registerable memory. In our case we have a Mellanox switch, and people from Mellanox recomnend to set the value of:
(2^log_num_mtt)*(2^log_mtts_per_seg)*PAGE_SIZE
at least the double of the physical available memory at the nodes (link1, link2). You can check the values of these parameters with:
getconf PAGE_SIZE
cat /sys/module/mlx4_core/parameters/log_num_mtt
cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
Mellanox people only recommend to change log_num_mtt. To do it you have to edit the file /etc/modprobe.conf and adding at the end of the file the line: options mlx4_core log_num_mtt=24. Then you have to restart the Infiniband network by doing the following in all the nodes of the cluster:
Stop opensm service: /etc/init.d/opensmd stop
Restart IB: /etc/init.d/openibd restart
Start opensm: /etc/init.d/opensmd start
Check the changes: cat /sys/module/mlx4_core/parameters/log_num_mtt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi José Luis,
There must have been some problems with a previous post I made. In that post I just wrote this link:
http://software.intel.com/en-us/forums/topic/329053
In case it could be helpful.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ivan,
thank you for your help.
I changed log_num_mtt to 24 and the app is running fine now.
saludos,
José Luis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear José Luis,
I am glad to know that it also worked for you.
Saludos,
Iván
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page