I measure the performance of memory copy in a NUMA machine with 4 Xeon(R) CPU E5-4620 processors. When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data from remote memory, I get much worse performance, only around 1GB/s. I use memcpy() to copy data and each copy is a page size (4KB).
I wonder if Intel processors provides special instructions for inter-processor data movement. I know Intel use QPI for inter-processor communication. Does it expose some interface for programmers? Is the performance above the best I can get?
TimP (Intel) wrote:No, I run my program in a 64-bit Linux, and I have switched to the Intel compiler. I have tried -static-intel, but it seems it doesn't the intel library statically. With or without -static-intel, the compiled executable is exactly the same (I ran cmp to the two versions of executables). If I add -static as a linker option, I got the same error when I ran the executable under Vtune. BTW, I copy 1G memory and the memory is aligned to a page size. Da
I guess you're looking at 32-bit gcc, which appears to optimize for short or non-aligned copies. If you use icc -static-intel, with long enough copies, you should be able to collect data to view __intel_fast_memcpy in assembly view. It might be interesting to see whether the results change with alignment.
iliyapolak wrote:No, 16 threads/8 cores per CPU. What do you mean by earlier memory speed degradation? The local memory copy speed is expected.
>>>So each CPU has 8 cores and there are 4 CPUs in the machine.>>>
Do you mean 8 threads/4 cores per CPU?
Have you experienced earlier memory speed degradation?