- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi All,
In SNC2 we will have 2 MCDRAM (node 2 and 3) nodes while in SNC4 we will have 4 MCDRAM (node 4,5,6,7) nodes. I want to map application for both SNC2 and SNC4 but on MCDRAM nodes not on DDR nodes.
For SNC2 I use: numactl -m 2,3 <application>
For SNC4 I use: numactl -m 4,5,6,7 <application>
Application: Intel Caffe
Number of threads: 16/32/64/128
In SNC2, node 2 is being used and then node 0 (DDR) is being used for memory allocation. I expect above memory allocation to use node 2 and node 3 not node 2 and node 0. Similarly, for SNC4 I am observing that node 4 and node 0 being used for memory and not node 4,5,6,7 as per numactl mapping above.
With this issue, I see performance difference as other nodes of HBM MCDRAM are not being used for memory allocation. It's not making sense to me. Can anyone suggest why this may be happening?
Thanks.
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The command that you are running is correct. You can test your approach with this code:
#include <stdio.h> int main() { int* A = malloc(sizeof(int)*(1L<<30L)); while (0==0) { int i; #pragma omp parallel for for (i = 0; i < (1<<30); i++) A = i; } printf("%d", A[1]); }
As you can see below, it loads NUMA nodes 4,5,6,7 (the MCDRAM nodes):
[u7474@c006-n004 ~]$ icc -qopenmp test-numa.c [u7474@c006-n004 ~]$ numactl -m 4,5,6,7 ./a.out & [1] 22624 [u7474@c006-n004 ~]$ numastat -p 22624 Per-node process memory usage (in MBs) for PID 22624 (a.out) Node 0 Node 1 Node 2 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.00 0.00 Private 0.64 0.09 0.00 ---------------- --------------- --------------- --------------- Total 0.64 0.09 0.00 Node 3 Node 4 Node 5 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.15 0.00 Stack 0.00 24.52 0.25 Private 0.50 1081.85 1080.68 ---------------- --------------- --------------- --------------- Total 0.50 1106.52 1080.93 Node 6 Node 7 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.16 Stack 0.22 2.22 27.21 Private 960.00 976.35 4100.11 ---------------- --------------- --------------- --------------- Total 960.22 978.57 4127.48 [u7474@c006-n004 ~]$
I think that the problem that you are seeing is rooted in your using Caffe from the Intel Distribution for Python. This version of Caffe uses BLAS functions from MKL, and those functions have a mind of their own when it comes to memory allocation. So they are ignoring your settings and doing their own allocation policy.
The good news is that MKL's functions are aware of MCDRAM and generally do the right thing when they allocate working datasets and scratch spaces.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Andrey,
I tested your code and correct MCDRAM nodes are being used.
Intel Caffe uses Intel MKL, but not Intel Distribution for Python. I have browsed the Caffe code and it doesn't allocate memory to specific nodes.
For Intel Caffe, there is no way to utilize all MCDRAM nodes when in Flat-SNC2/4? Any suggestions please?
Thanks.
