topic @zhengda1936 in Intel® Moderncode for Parallel Architectures

The best method for inter-processor data communication

zhengda1936 — Fri, 14 Dec 2012 21:01:22 GMT

Hello,

I measure the performance of memory copy in a NUMA machine with 4 Xeon(R) CPU E5-4620 processors. When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data from remote memory, I get much worse performance, only around 1GB/s. I use memcpy() to copy data and each copy is a page size (4KB).

I wonder if Intel processors provides special instructions for inter-processor data movement. I know Intel use QPI for inter-processor communication. Does it expose some interface for programmers? Is the performance above the best I can get?

Thanks,
Da

>>...I wonder if Intel

SergeyKostrov — Sat, 15 Dec 2012 07:21:19 GMT

>>...I wonder if Intel processors provides special instructions for inter-processor data movement... Please take a look at Instructions Set Reference located at: www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html >>...I get much worse performance, only around 1GB/s... Access to a local memory is always faster, however the 10x drop in performance is significant. Does it really so big in case of an access to a foreign memory?

The best option is to

Bernard — Sat, 15 Dec 2012 08:39:23 GMT

The best option is to disassemble memcpy() function and look at its machine code implementation.Rep prefix combined with movsd instruction are used to copy the memory in large quantity. I think that interprocessor communication at the lowest level is managed by the hardware itself.I do not know if there is some kind of programming interface exposed to the programmer in order to manage and control programmatically inter-processor communication. Regarding documentation I would like to recommend you to read Intel chipset documentation which probably does contain some information regarding inter-processor communication.

It may be worthwhile to check

TimP — Sun, 16 Dec 2012 00:41:40 GMT

It may be worthwhile to check your memcpy() version and your data alignments. At one time, the __intel_fast_memcpy() substitution made by Intel compilers could be a great help. Any up to date memcpy() ought to take advantage of simd nontemporal instructions in cases where alignment is compatible. rep movsd should be used only where alignment requires it. memcpy() supplied by early 64-bit linux distros was extremely poor. You might want to experiment with 16, 32, and 64-byte alignments for both source and destination. corei7-2 CPUs were supposed to be designed to improve performance of rep mov loops such as 32-bit gcc might create, but there would still be an advantage in setting alignment so as to use simd instructions. Some past CPUs performed poorly with rep mov loops. In connection with illyapolak's remark, it would be interesting to use a profiler such as VTune or oprofile to show which instructions are actually used in your slow case. I'm not sure what causes might be suspect for a slowdown such as you quote on that platform; more than a 2x penalty for remote memory would be disappointing. You should check whether the RAM is compatible and properly distributed among the slots.

Sorry for a question not

SergeyKostrov — Sun, 16 Dec 2012 02:18:00 GMT

Sorry for a question not related to the subject. >>...NUMA machine with 4 Xeon(R) CPU E5-4620 processors... Are these NUMA computers expensive? How much could cost a cheapest computer that supports NUMA architecture? Thanks in advance. Note: I'm asking because I couldn't find an answer on the web.

>>...NUMA machine with 4 Xeon

Bernard — Sun, 16 Dec 2012 07:28:00 GMT

>>...NUMA machine with 4 Xeon(R) CPU E5-4620 processors...>>> Albeit no Intel-based chipset , but you can calculate the price of the motherboard and the cpus. Please follow this link :http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180 And this link :http://www.pcsuperstore.com/products/11113480-Tyan-S8812WGM3NR.html For Intel-based chipset motherboards please follow these links:http://www.supermicro.com/products/motherboard/Xeon/C600/X9QR7-TF.cfm http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html Whole system can easily reach the price of 2500-3000$. For the complete solutions follow this link: www.supermicro.com/xeon_mp/http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html

>>>You should check whether

Bernard — Sun, 16 Dec 2012 08:00:30 GMT

>>>You should check whether the RAM is compatible and properly distributed among the slots.>>> Lower hardware layer in the form of chipset's memory controller or on-die memory controller should be also accounted for the poor memory performance. Regarding the internal implementation of the memcpy() , it is highly probable that compiler could wrongfully implement rep movsb instead of rep movsd instruction.

Thank you for all your

zhengda1936 — Mon, 17 Dec 2012 22:48:07 GMT

Thank you for all your suggestions. I started with checking the assembly code of a small test code: char src[4096]; char dest[4096]; int main() { memcpy(dest, src, sizeof(dest)); } gcc compiles the code into the following assembly code: 0000000000000000

: 0: bf 00 00 00 00 mov $0x0,%edi 5: be 00 00 00 00 mov $0x0,%esi a: b9 00 02 00 00 mov $0x200,%ecx f: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) 12: c3 retq The code is pretty straightforward and as expected. The Intel compiler compiles it into: 00000000004005c0

: 4005c0: 55 push %rbp 4005c1: 48 89 e5 mov %rsp,%rbp 4005c4: 48 83 e4 80 and $0xffffffffffffff80,%rsp 4005c8: 48 81 ec 80 00 00 00 sub $0x80,%rsp 4005cf: bf 03 00 00 00 mov $0x3,%edi 4005d4: e8 c7 00 00 00 callq 4006a0 <__intel_new_proc_init> 4005d9: 0f ae 1c 24 stmxcsr (%rsp) 4005dd: bf 00 9b 60 00 mov $0x609b00,%edi 4005e2: be 00 ab 60 00 mov $0x60ab00,%esi 4005e7: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp) 4005ee: ba 00 10 00 00 mov $0x1000,%edx 4005f3: 0f ae 14 24 ldmxcsr (%rsp) 4005f7: e8 54 00 00 00 callq 400650 <_intel_fast_memcpy> 4005fc: 33 c0 xor %eax,%eax 4005fe: 48 89 ec mov %rbp,%rsp 400601: 5d pop %rbp 400602: c3 retq 400603: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400608: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40060f: 00 So the Intel compiler uses _intel_fast_memcpy. From the performance perspective, the executable compiled by the Interl compiler isn't faster than compiled by gcc at all. I tried using VTune to profile the compiled code, and it shows me _intel_fast_memcpy uses most time, but it doesn't show me which instructions in _intel_fast_memcpy is time consuming.

I know this question is more

zhengda1936 — Mon, 17 Dec 2012 22:58:45 GMT

I know this question is more related to the topics in other sections: how do I profile the code in the external library. As I said, I can't see instructions in _intel_fast_memcpy. I tried to link the C library to my program statically, then I got an error as: $ amplxe-cl -collect hotspots ./rand-memcpy 1 8 Error: Binary file of the analysis target does not contain symbols required for profiling. See documentation for more details. Error: Valid pthread_setcancelstate symbol is not found in the static binary of the analysis target. So how do I profile _intel_fast_memcpy? Thanks, Da

I guess you're looking at 32

TimP — Tue, 18 Dec 2012 02:32:57 GMT

I guess you're looking at 32-bit gcc, which appears to optimize for short or non-aligned copies. If you use icc -static-intel, with long enough copies, you should be able to collect data to view __intel_fast_memcpy in assembly view. It might be interesting to see whether the results change with alignment.

@zhengda1936

Bernard — Tue, 18 Dec 2012 06:13:53 GMT

@zhengda1936 Please follow this call instruction 4005f7: e8 54 00 00 00 callq 400650 <_intel_fast_memcpy> It would be interesting to see the exact machine code implementation of that function.

Guys,

SergeyKostrov — Tue, 18 Dec 2012 15:01:10 GMT

Guys, 1. Please take a look / read the original post again because I really don't understand these continued pushes to disassemble / debug a memcpy function currently used in his tests . 2. The user is on a NUMA system and this is a "different world" ( I don't have access to any such system at the moment ) The user clearly described that: ... When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data !!! from !!! remote memory, I get much worse performance, only around 1GB/s ... He uses the same memcpy function in both cases and possibly experiences some hardware issue ( I can be wrong here ) and it has to be considered / taken into account. When he reads data from the remote memory it looks like he simply switches Source and Destination pointers in the the same memcpy function. A question to zhengda1936, Could you post C/C++ source codes of your test-case, please?

Quote:TimP (Intel) wrote:

zhengda1936 — Tue, 18 Dec 2012 20:16:16 GMT

TimP (Intel) wrote:
I guess you're looking at 32-bit gcc, which appears to optimize for short or non-aligned copies. If you use icc -static-intel, with long enough copies, you should be able to collect data to view __intel_fast_memcpy in assembly view. It might be interesting to see whether the results change with alignment.

No, I run my program in a 64-bit Linux, and I have switched to the Intel compiler. I have tried -static-intel, but it seems it doesn't the intel library statically. With or without -static-intel, the compiled executable is exactly the same (I ran cmp to the two versions of executables). If I add -static as a linker option, I got the same error when I ran the executable under Vtune. BTW, I copy 1G memory and the memory is aligned to a page size. Da

If it helps, I post my code

zhengda1936 — Tue, 18 Dec 2012 20:25:08 GMT

If it helps, I post my code here. Basically, it allocates 1GB of memory aligned with a page size in a NUMA node, and tries to copy memory to a specified NUMA node. #include #include #include #include #include #include #include #include #include #include #include #define NUM_THREADS 64 #define PAGE_SIZE 4096 #define ENTRY_SIZE PAGE_SIZE #define ARRAY_SIZE 1073741824 off_t *offset; unsigned int nentries; int nthreads; struct timeval global_start; char *array; char *dst_arr; void permute_offset(off_t *offset, int num) { int i; for (i = num - 1; i >= 1; i--) { int j = random() % i; off_t tmp = offset; offset = offset; offset = tmp; } } float time_diff(struct timeval time1, struct timeval time2) { return time2.tv_sec - time1.tv_sec + ((float)(time2.tv_usec - time1.tv_usec))/1000000; } void rand_read(void *arg) { int fd; ssize_t ret; int i, j, start_i, end_i; ssize_t read_bytes = 0; struct timeval start_time, end_time; start_i = (long) arg; end_i = start_i + nentries / nthreads; gettimeofday(&start_time, NULL); for (j = 0; j < 8; j++) { for (i = start_i; i < end_i; i++) { memcpy(dst_arr + offset, array + offset, ENTRY_SIZE); read_bytes += ENTRY_SIZE; } } gettimeofday(&end_time, NULL); printf("read %ld bytes, start at %f seconds, takes %f seconds\n", read_bytes, time_diff(global_start, start_time), time_diff(start_time, end_time)); pthread_exit((void *) read_bytes); } int main(int argc, char *argv[]) { int ret; int i; struct timeval start_time, end_time; ssize_t read_bytes = 0; pthread_t threads[NUM_THREADS]; /* the number of entries the array can contain. */ int node; if (argc != 3) { fprintf(stderr, "read node_id num_threads\n"); exit(1); } nentries = ARRAY_SIZE / ENTRY_SIZE; node = atoi(argv[1]); offset = valloc(sizeof(*offset) * nentries); for(i = 0; i < nentries; i++) { offset = ((off_t) i) * ENTRY_SIZE; } permute_offset(offset, nentries); #if 0 int ncpus = numa_num_configured_cpus(); printf("there are %d cores in the machine\n", ncpus); for (i = 0; i < ncpus; i++) { printf("cpu %d belongs to node %d\n", i, numa_node_of_cpu(i)); } #endif /* bind to node 0. */ nodemask_t nodemask; nodemask_zero(&nodemask); nodemask_set_compat(&nodemask, 0); unsigned long maxnode = NUMA_NUM_NODES; if (set_mempolicy(MPOL_BIND, (unsigned long *) &nodemask, maxnode) < 0) { perror("set_mempolicy"); exit(1); } printf("run on node 0\n"); if (numa_run_on_node(0) < 0) { perror("numa_run_on_node"); exit(1); } array = valloc(ARRAY_SIZE); /* we need to avoid the cost of page fault. */ for (i = 0; i < ARRAY_SIZE; i += PAGE_SIZE) array = 0; dst_arr = valloc(ARRAY_SIZE); /* we need to avoid the cost of page fault. */ for (i = 0; i < ARRAY_SIZE; i += PAGE_SIZE) dst_arr = 0; printf("run on node %d\n", node); if (numa_run_on_node(node) < 0) { perror("numa_run_on_node"); exit(1); } nthreads = atoi(argv[2]); if (nthreads > NUM_THREADS) { fprintf(stderr, "too many threads\n"); exit(1); } ret = setpriority(PRIO_PROCESS, getpid(), -20); if (ret < 0) { perror("setpriority"); exit(1); } gettimeofday(&start_time, NULL); global_start = start_time; for (i = 0; i < nthreads; i++) { ret = pthread_create(&threads, NULL, rand_read, (void *) (long) (nentries / nthreads * i)); if (ret) { perror("pthread_create"); exit(1); } } for (i = 0; i < nthreads; i++) { ssize_t size; ret = pthread_join(threads, (void **) &size); if (ret) { perror("pthread_join"); exit(1); } read_bytes += size; } gettimeofday(&end_time, NULL); printf("read %ld bytes, takes %f seconds\n", read_bytes, end_time.tv_sec - start_time.tv_sec + ((float)(end_time.tv_usec - start_time.tv_usec))/1000000); }

>>>The user is on a NUMA

Bernard — Tue, 18 Dec 2012 20:37:55 GMT

>>>The user is on a NUMA system and this is a "different world" ( I don't have access to any such system at the moment )>>> .It could very helpful if @zhengda1936 could post his hardware configuration.I'm sure that he has a quad CPU motherboard probably manufactured by TYAN or SuperMicro. >>> Please take a look / read the original post again because I really don't understand these continued pushes to disassemble / debug a memcpy function currently used in his tests>>> I think that disassembling memcpy() function and revealing its exact machine code implementation could provide us with the some insight into what is going under the hood.As I stated earlier in my post there is possibility that 'rep movsb' instruction is used by the compiler. I do not exclude the possibility of some hardware related issue. >>>When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data !!! from !!! remote memory, I get much worse performance, only around 1GB/s>>> It is obvious that in NUMA architecture one can expect memory transfer speed degradation when for example CPU 0 is accessing non local memory(remote memory) from its relative "point of view".

Sure, I can provide the

zhengda1936 — Tue, 18 Dec 2012 20:50:09 GMT

Sure, I can provide the hardware configuration. Here is the CPU info: description: CPU product: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz vendor: Intel Corp. physical id: 400 bus info: cpu@0 version: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz slot: CPU1 size: 2200MHz capacity: 3600MHz width: 64 bits clock: 2905MHz So each CPU has 8 cores and there are 4 CPUs in the machine. Memory info: description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns) product: M393B2G70BH0-YH9 vendor: 00CE00B300CE physical id: 0 serial: 342F3D9C slot: DIMM_A1 size: 16GiB width: 64 bits clock: 1333MHz (0.8ns) What other hardware configuration info I should provide to help you diagnose?

>>>So each CPU has 8 cores

Bernard — Tue, 18 Dec 2012 21:04:00 GMT

>>>So each CPU has 8 cores and there are 4 CPUs in the machine.>>> Do you mean 8 threads/4 cores per CPU? Have you experienced earlier memory speed degradation?

As I show above, gcc compiled

zhengda1936 — Tue, 18 Dec 2012 21:43:35 GMT

As I show above, gcc compiled memcpy into "rep movsq", and Intel compiler invokes __intel_fast_memcpy. I used perf to profile my program, and it seems it eventually invokes __intel_ssse3_rep_memcpy and the most time-consuming instructions are: 0.06 : 405657: movaps -0x10(%rsi),%xmm1 38.23 : 40565b: movaps %xmm1,-0x10(%rdi) 1.00 : 40565f: movaps -0x20(%rsi),%xmm2 0.06 : 405663: movaps %xmm2,-0x20(%rdi) 0.18 : 405667: movaps -0x30(%rsi),%xmm3 0.06 : 40566b: movaps %xmm3,-0x30(%rdi) 0.05 : 40566f: movaps -0x40(%rsi),%xmm4 0.01 : 405673: movaps %xmm4,-0x40(%rdi) 0.10 : 405677: movaps -0x50(%rsi),%xmm5 41.82 : 40567b: movaps %xmm5,-0x50(%rdi) 0.47 : 40567f: movaps -0x60(%rsi),%xmm5 0.03 : 405683: movaps %xmm5,-0x60(%rdi) 0.06 : 405687: movaps -0x70(%rsi),%xmm5 0.01 : 40568b: movaps %xmm5,-0x70(%rdi) 0.04 : 40568f: movaps -0x80(%rsi),%xmm5 0.01 : 405693: movaps %xmm5,-0x80(%rdi) It seems one data copy triggers moving 64 bytes to the remote node, so only the first data copy consumes most CPU time. I thought one data copy would trigger moving 128 bytes to a remote node (since the cache line is 128 bytes).

Quote:iliyapolak wrote:

zhengda1936 — Tue, 18 Dec 2012 21:46:48 GMT

iliyapolak wrote:
>>>So each CPU has 8 cores and there are 4 CPUs in the machine.>>>

Do you mean 8 threads/4 cores per CPU?
Have you experienced earlier memory speed degradation?

No, 16 threads/8 cores per CPU. What do you mean by earlier memory speed degradation? The local memory copy speed is expected.

Hi Da,

SergeyKostrov — Tue, 18 Dec 2012 23:08:00 GMT

Hi Da, It is always a right thing to follow a top-to-down approach when investigating a problem. That is: - Source codes -> - Analysis -> - Is there a hardware problem? - Are there any logical errors in the codes? -> - Could I reproduce a problem? -> - Could I simplify the test-case? -> - Could I remove some dependencies on 3rd party software components -> - Why does my application crash ( if this is the case )? -> - What else could be wrong with my codes? - Etc. It means, that if a C/C++ developer will try to do some investigation in opposite way, that is following a down-to-top approach ( dissassembling first all the rest later ), a significant amount of a project time could be wasted. From my point of view a Summary of the problem could look like: - Possible logical problem with the test-case ( very high possibility ) - Possible oversubscription of the processing threads ( high possibility ) - Possible hardware issue with the NUMA system ( very low possibility ) - Possible problem with CRT memcpy function ( low possibility ) A simplified test-case is needed without changing priorities of any threads or a process and ideally it would be nice to have just one thread of normal priority. This is needed to verify that NUMA system doesn't have any hardware issues. A logic for the simplified test-case could look like: - one thread test application - allocate a memory block in a 'local' memory - copy some data ( some number of times to get an average time ) - invalidate cache lines somehow - read some data ( some number of times to get an average time ) - save performance numbers - allocate a memory block in a 'remote' memory - copy some data ( some number of times to get an average time ) - invalidate cache lines somehow - read some data ( some number of times to get an average time ) - save performance numbers - compare results - repeat the test with more threads ( increase by 2 every time ) until it reaches 64 1. After a very quick code review of the test-case I noticed that a priority of the executing process is changed: ... setpriority( PRIO_PROCESS, getpid(), -20 ); ... Why do you change the priority of the process? 2. In order to clear any uncertanties with the 'memcpy' function I recommend to replace it with an external pure C function ( a couple of minutes to implement, right? ) 3. A Virtual Memory Manager ( VMM ) on any OS should have 'Above Normal' or 'High' priority. If processing thread(s) in some test have higher priorities then VMM will be preempted most of the time and any memory operations using 'mem'-like CRT functions will be affected. Also, there will be a performance degradation of the whole operating system. If processing thread(s) have lower priorities, like 'Below Normal' or 'Idle', then they will be preempted most of the time and performance of the test will be affected. 4. A brief high-level overview of the test-case will also help Best regards, Sergey