<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic @zhengda1936 in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975782#M5594</link>
    <description>@zhengda1936

Please follow this call instruction 4005f7: e8 54 00 00 00 callq 400650 &amp;lt;_intel_fast_memcpy&amp;gt;
It would be interesting to see the exact machine code implementation of that function.</description>
    <pubDate>Tue, 18 Dec 2012 06:13:53 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2012-12-18T06:13:53Z</dc:date>
    <item>
      <title>The best method for inter-processor data communication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975772#M5584</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I measure the performance of memory copy in a NUMA machine with 4&amp;nbsp;Xeon(R) CPU E5-4620 processors. When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data from remote memory, I get much worse performance, only around 1GB/s. I use memcpy() to copy data and each copy is a page size (4KB).&lt;/P&gt;
&lt;P&gt;I wonder if Intel processors provides special instructions for inter-processor data movement. I know Intel use QPI for inter-processor communication. Does it expose some interface for programmers? Is the performance above the best I can get?&lt;/P&gt;
&lt;P&gt;Thanks,&lt;BR /&gt;Da&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Dec 2012 21:01:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975772#M5584</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-14T21:01:22Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I wonder if Intel</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975773#M5585</link>
      <description>&amp;gt;&amp;gt;...I wonder if Intel processors provides special instructions for inter-processor data movement...

Please take a look at Instructions Set Reference located at: www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

&amp;gt;&amp;gt;...I get much worse performance, only around 1GB/s...

Access to a &lt;STRONG&gt;local&lt;/STRONG&gt; memory is always faster, however the 10x drop in performance is significant. Does it really so big in case of an access to a &lt;STRONG&gt;foreign&lt;/STRONG&gt; memory?</description>
      <pubDate>Sat, 15 Dec 2012 07:21:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975773#M5585</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-12-15T07:21:19Z</dc:date>
    </item>
    <item>
      <title>The best option is to</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975774#M5586</link>
      <description>The best option is to disassemble memcpy() function and look at its machine code implementation.Rep prefix combined with movsd instruction are used to copy the memory in large quantity.
I think that interprocessor communication at the lowest level is managed by the hardware itself.I do not know if there is some kind of programming interface exposed to the programmer in order to manage and control programmatically inter-processor communication.
Regarding documentation I would like to recommend you to read Intel chipset documentation which probably does contain some information regarding inter-processor communication.</description>
      <pubDate>Sat, 15 Dec 2012 08:39:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975774#M5586</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-15T08:39:23Z</dc:date>
    </item>
    <item>
      <title>It may be worthwhile to check</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975775#M5587</link>
      <description>It may be worthwhile to check your memcpy() version and your data alignments.  At one time, the __intel_fast_memcpy() substitution made by Intel compilers could be a great help.  Any up to date memcpy() ought to take advantage of simd nontemporal instructions in cases where alignment is compatible.  rep movsd should be used only where alignment requires it.  memcpy() supplied by early 64-bit linux distros was extremely poor.   You might want to experiment with 16, 32, and 64-byte alignments for both source and destination.
corei7-2 CPUs were supposed to be designed to improve performance of rep mov loops such as 32-bit gcc might create, but there would still be an advantage in setting alignment so as to use simd instructions.   Some past CPUs performed poorly with rep mov loops.
In connection with illyapolak's remark, it would be interesting to use a profiler such as VTune or oprofile to show which instructions are actually used in your slow case.
I'm not sure what causes might be suspect for a slowdown such as you quote on that platform; more than a 2x penalty for remote memory would be disappointing.  You should check whether the RAM is compatible and properly distributed among the slots.</description>
      <pubDate>Sun, 16 Dec 2012 00:41:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975775#M5587</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-12-16T00:41:40Z</dc:date>
    </item>
    <item>
      <title>Sorry for a question not</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975776#M5588</link>
      <description>Sorry for a question not related to the subject.

&amp;gt;&amp;gt;...NUMA machine with 4 Xeon(R) CPU E5-4620 processors...

Are these NUMA computers expensive? How much could cost a cheapest computer that supports NUMA architecture? Thanks in advance.

Note: I'm asking because I couldn't find an answer on the web.</description>
      <pubDate>Sun, 16 Dec 2012 02:18:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975776#M5588</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-12-16T02:18:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...NUMA machine with 4 Xeon</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975777#M5589</link>
      <description>&amp;gt;&amp;gt;...NUMA machine with 4 Xeon(R) CPU E5-4620 processors...&amp;gt;&amp;gt;&amp;gt;
Albeit no Intel-based chipset , but you can calculate the price of the motherboard and the cpus.
Please follow this link :http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&amp;amp;pid=670&amp;amp;SKU=600000180
And this link :http://www.pcsuperstore.com/products/11113480-Tyan-S8812WGM3NR.html

For Intel-based chipset motherboards please follow these links:http://www.supermicro.com/products/motherboard/Xeon/C600/X9QR7-TF.cfm
&lt;A href="http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html" target="_blank"&gt;http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html&lt;/A&gt;
Whole system can easily reach the price of 2500-3000$.
For the complete solutions follow this link:
&lt;A href="https://community.intel.com/www.supermicro.com/xeon_mp/http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html" target="_blank"&gt;www.supermicro.com/xeon_mp/http://www.alvio.com/xABK_PID1237628_supermicro-computer_mbd-x9qr7-tf-o_romley-quad-socket-sas2-ipmi-20-retail_amd-socket-f-1207-motherboards.html&lt;/A&gt;</description>
      <pubDate>Sun, 16 Dec 2012 07:28:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975777#M5589</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-16T07:28:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;You should check whether</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975778#M5590</link>
      <description>&amp;gt;&amp;gt;&amp;gt;You should check whether the RAM is compatible and properly distributed among the slots.&amp;gt;&amp;gt;&amp;gt;
Lower hardware layer in the form of chipset's memory controller or on-die memory controller should be also accounted for the poor memory performance.
Regarding the internal implementation of the memcpy() , it is highly probable that compiler could  wrongfully implement rep movsb instead of rep movsd instruction.</description>
      <pubDate>Sun, 16 Dec 2012 08:00:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975778#M5590</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-16T08:00:30Z</dc:date>
    </item>
    <item>
      <title>Thank you for all your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975779#M5591</link>
      <description>Thank you for all your suggestions.

I started with checking the assembly code of a small test code:
char src[4096];
char dest[4096];

int main()
{
	memcpy(dest, src, sizeof(dest));
}

gcc compiles the code into the following assembly code:
0000000000000000 &lt;MAIN&gt;:
   0:	bf 00 00 00 00       	mov    $0x0,%edi
   5:	be 00 00 00 00       	mov    $0x0,%esi
   a:	b9 00 02 00 00       	mov    $0x200,%ecx
   f:	f3 48 a5             	rep movsq %ds:(%rsi),%es:(%rdi)
  12:	c3                   	retq 
The code is pretty straightforward and as expected.

The Intel compiler compiles it into:
00000000004005c0 &lt;MAIN&gt;:
  4005c0:       55                      push   %rbp
  4005c1:       48 89 e5                mov    %rsp,%rbp
  4005c4:       48 83 e4 80             and    $0xffffffffffffff80,%rsp
  4005c8:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
  4005cf:       bf 03 00 00 00          mov    $0x3,%edi
  4005d4:       e8 c7 00 00 00          callq  4006a0 &amp;lt;__intel_new_proc_init&amp;gt;
  4005d9:       0f ae 1c 24             stmxcsr (%rsp)
  4005dd:       bf 00 9b 60 00          mov    $0x609b00,%edi
  4005e2:       be 00 ab 60 00          mov    $0x60ab00,%esi
  4005e7:       81 0c 24 40 80 00 00    orl    $0x8040,(%rsp)
  4005ee:       ba 00 10 00 00          mov    $0x1000,%edx
  4005f3:       0f ae 14 24             ldmxcsr (%rsp)
  4005f7:       e8 54 00 00 00          callq  400650 &amp;lt;_intel_fast_memcpy&amp;gt;
  4005fc:       33 c0                   xor    %eax,%eax
  4005fe:       48 89 ec                mov    %rbp,%rsp
  400601:       5d                      pop    %rbp
  400602:       c3                      retq   
  400603:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400608:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40060f:       00
So the Intel compiler uses _intel_fast_memcpy.

From the performance perspective, the executable compiled by the Interl compiler isn't faster than compiled by gcc at all. I tried using VTune to profile the compiled code, and it shows me _intel_fast_memcpy uses most time, but it doesn't show me which instructions in _intel_fast_memcpy is time consuming.&lt;/MAIN&gt;&lt;/MAIN&gt;</description>
      <pubDate>Mon, 17 Dec 2012 22:48:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975779#M5591</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-17T22:48:07Z</dc:date>
    </item>
    <item>
      <title>I know this question is more</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975780#M5592</link>
      <description>I know this question is more related to the topics in other sections: how do I profile the code in the external library. As I said, I can't see instructions in _intel_fast_memcpy. I tried to link the C library to my program statically, then I got an error as:
$ amplxe-cl -collect hotspots ./rand-memcpy 1 8
Error: Binary file of the analysis target does not contain symbols required for profiling. See documentation for more details.
Error: Valid pthread_setcancelstate symbol is not found in the static binary of the analysis target.
So how do I profile _intel_fast_memcpy?

Thanks,
Da</description>
      <pubDate>Mon, 17 Dec 2012 22:58:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975780#M5592</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-17T22:58:45Z</dc:date>
    </item>
    <item>
      <title>I guess you're looking at 32</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975781#M5593</link>
      <description>I guess you're looking at 32-bit gcc, which appears to optimize for short or non-aligned copies.  If you use icc -static-intel, with long enough copies, you should be able to collect data to view __intel_fast_memcpy in assembly view.  It might be interesting to see whether the results change with alignment.</description>
      <pubDate>Tue, 18 Dec 2012 02:32:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975781#M5593</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-12-18T02:32:57Z</dc:date>
    </item>
    <item>
      <title>@zhengda1936</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975782#M5594</link>
      <description>@zhengda1936

Please follow this call instruction 4005f7: e8 54 00 00 00 callq 400650 &amp;lt;_intel_fast_memcpy&amp;gt;
It would be interesting to see the exact machine code implementation of that function.</description>
      <pubDate>Tue, 18 Dec 2012 06:13:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975782#M5594</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-18T06:13:53Z</dc:date>
    </item>
    <item>
      <title>Guys,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975783#M5595</link>
      <description>Guys,

1. Please take a look / read the original post again because I really don't understand these continued pushes to disassemble / debug a &lt;STRONG&gt;memcpy&lt;/STRONG&gt; function currently used in his tests
.
2. The user is &lt;STRONG&gt;on a NUMA system&lt;/STRONG&gt; and this is a "different world" ( I don't have access to any such system at the moment )

The user clearly described that:
...
&lt;STRONG&gt;When I copy data in the local memory&lt;/STRONG&gt;, &lt;STRONG&gt;I can get&lt;/STRONG&gt; up to almost &lt;STRONG&gt;10GB/s&lt;/STRONG&gt;. However, &lt;STRONG&gt;when I copy data !!! from !!! remote memory&lt;/STRONG&gt;, &lt;STRONG&gt;I get&lt;/STRONG&gt; much worse performance, only around &lt;STRONG&gt;1GB/s&lt;/STRONG&gt;
...

He uses the &lt;STRONG&gt;same memcpy&lt;/STRONG&gt; function &lt;STRONG&gt;in both cases&lt;/STRONG&gt; and possibly experiences some hardware issue ( I can be wrong here ) and it has to be considered / taken into account. When he reads data from the remote memory it looks like he simply switches Source and Destination pointers in the the &lt;STRONG&gt;same memcpy&lt;/STRONG&gt; function.

A question to &lt;STRONG&gt;zhengda1936&lt;/STRONG&gt;,

Could you post C/C++ source codes of your test-case, please?</description>
      <pubDate>Tue, 18 Dec 2012 15:01:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975783#M5595</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-12-18T15:01:10Z</dc:date>
    </item>
    <item>
      <title>Quote:TimP (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975784#M5596</link>
      <description>&lt;BLOCKQUOTE&gt;TimP (Intel) wrote:&lt;BR /&gt;&lt;P&gt;I guess you're looking at 32-bit gcc, which appears to optimize for short or non-aligned copies.  If you use icc -static-intel, with long enough copies, you should be able to collect data to view __intel_fast_memcpy in assembly view.  It might be interesting to see whether the results change with alignment.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
No, I run my program in a 64-bit Linux, and I have switched to the Intel compiler. 

I have tried -static-intel, but it seems it doesn't the intel library statically. With or without -static-intel, the compiled executable is exactly the same (I ran cmp to the two versions of executables). If I add -static as a linker option, I got the same error when I ran the executable under Vtune. 

BTW, I copy 1G memory and the memory is aligned to a page size.

Da</description>
      <pubDate>Tue, 18 Dec 2012 20:16:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975784#M5596</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-18T20:16:16Z</dc:date>
    </item>
    <item>
      <title>If it helps, I post my code</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975785#M5597</link>
      <description>If it helps, I post my code here. Basically, it allocates 1GB of memory aligned with a page size in a NUMA node, and tries to copy memory to a specified NUMA node.

#include &lt;STDIO.H&gt;
#include &lt;UNISTD.H&gt;
#include &lt;SYS&gt;
#include &lt;SYS&gt;
#include &lt;FCNTL.H&gt;
#include &lt;SYS&gt;
#include &lt;STDLIB.H&gt;
#include &lt;SYS&gt;
#include &lt;STRING.H&gt;
#include &lt;NUMA.H&gt;
#include &lt;NUMAIF.H&gt;

#define NUM_THREADS 64
#define PAGE_SIZE 4096
#define ENTRY_SIZE PAGE_SIZE
#define ARRAY_SIZE 1073741824

off_t *offset;
unsigned int nentries;
int nthreads;
struct timeval global_start;
char *array;
char *dst_arr;

void permute_offset(off_t *offset, int num)
{
	int i;
	for (i = num - 1; i &amp;gt;= 1; i--) {
		int j = random() % i;
		off_t tmp = offset&lt;J&gt;;
		offset&lt;J&gt; = offset&lt;I&gt;;
		offset&lt;I&gt; = tmp;
	}
}

float time_diff(struct timeval time1, struct timeval time2)
{
	return time2.tv_sec - time1.tv_sec
			+ ((float)(time2.tv_usec - time1.tv_usec))/1000000;
}

void rand_read(void *arg)
{
	int fd;
	ssize_t ret;
	int i, j, start_i, end_i;
	ssize_t read_bytes = 0;
	struct timeval start_time, end_time;

	start_i = (long) arg;
	end_i = start_i + nentries / nthreads;
	gettimeofday(&amp;amp;start_time, NULL);
	for (j = 0; j &amp;lt; 8; j++) {
		for (i = start_i; i &amp;lt; end_i; i++) {
			memcpy(dst_arr + offset&lt;I&gt;, array + offset&lt;I&gt;, ENTRY_SIZE);
			read_bytes += ENTRY_SIZE;
		}
	}
	gettimeofday(&amp;amp;end_time, NULL);
	printf("read %ld bytes, start at %f seconds, takes %f seconds\n",
			read_bytes, time_diff(global_start, start_time),
			time_diff(start_time, end_time));
	
	pthread_exit((void *) read_bytes);
}

int main(int argc, char *argv[])
{
	int ret;
	int i;
	struct timeval start_time, end_time;
	ssize_t read_bytes = 0;
	pthread_t threads[NUM_THREADS];
	/* the number of entries the array can contain. */
	int node;

	if (argc != 3) {
		fprintf(stderr, "read node_id num_threads\n");
		exit(1);
	}

	nentries = ARRAY_SIZE / ENTRY_SIZE;
	node = atoi(argv[1]);
	offset = valloc(sizeof(*offset) * nentries);
	for(i = 0; i &amp;lt; nentries; i++) {
		offset&lt;I&gt; = ((off_t) i) * ENTRY_SIZE;
	}
	permute_offset(offset, nentries);

#if 0
	int ncpus = numa_num_configured_cpus();
	printf("there are %d cores in the machine\n", ncpus);
	for (i = 0; i &amp;lt; ncpus; i++) {
		printf("cpu %d belongs to node %d\n",
			i, numa_node_of_cpu(i));
	}
#endif
	/* bind to node 0. */
	nodemask_t nodemask;
	nodemask_zero(&amp;amp;nodemask);
	nodemask_set_compat(&amp;amp;nodemask, 0);
	unsigned long maxnode = NUMA_NUM_NODES;
	if (set_mempolicy(MPOL_BIND,
				(unsigned long *) &amp;amp;nodemask, maxnode) &amp;lt; 0) {
		perror("set_mempolicy");
		exit(1);
	}
	printf("run on node 0\n");
	if (numa_run_on_node(0) &amp;lt; 0) {
		perror("numa_run_on_node");
		exit(1);
	}

	array = valloc(ARRAY_SIZE);
	/* we need to avoid the cost of page fault. */
	for (i = 0; i &amp;lt; ARRAY_SIZE; i += PAGE_SIZE)
		array&lt;I&gt; = 0;
	dst_arr = valloc(ARRAY_SIZE);
	/* we need to avoid the cost of page fault. */
	for (i = 0; i &amp;lt; ARRAY_SIZE; i += PAGE_SIZE)
		dst_arr&lt;I&gt; = 0;

	printf("run on node %d\n", node);
	if (numa_run_on_node(node) &amp;lt; 0) {
		perror("numa_run_on_node");
		exit(1);
	}

	nthreads = atoi(argv[2]);
	if (nthreads &amp;gt; NUM_THREADS) {
		fprintf(stderr, "too many threads\n");
		exit(1);
	}

	ret = setpriority(PRIO_PROCESS, getpid(), -20);
	if (ret &amp;lt; 0) {
		perror("setpriority");
		exit(1);
	}

	gettimeofday(&amp;amp;start_time, NULL);
	global_start = start_time;
	for (i = 0; i &amp;lt; nthreads; i++) {
		ret = pthread_create(&amp;amp;threads&lt;I&gt;, NULL,
				rand_read, (void *) (long) (nentries / nthreads * i));
		if (ret) {
			perror("pthread_create");
			exit(1);
		}
	}

	for (i = 0; i &amp;lt; nthreads; i++) {
		ssize_t size;
		ret = pthread_join(threads&lt;I&gt;, (void **) &amp;amp;size);
		if (ret) {
			perror("pthread_join");
			exit(1);
		}
		read_bytes += size;
	}
	gettimeofday(&amp;amp;end_time, NULL);
	printf("read %ld bytes, takes %f seconds\n",
			read_bytes, end_time.tv_sec - start_time.tv_sec
			+ ((float)(end_time.tv_usec - start_time.tv_usec))/1000000);
}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/NUMAIF.H&gt;&lt;/NUMA.H&gt;&lt;/STRING.H&gt;&lt;/SYS&gt;&lt;/STDLIB.H&gt;&lt;/SYS&gt;&lt;/FCNTL.H&gt;&lt;/SYS&gt;&lt;/SYS&gt;&lt;/UNISTD.H&gt;&lt;/STDIO.H&gt;</description>
      <pubDate>Tue, 18 Dec 2012 20:25:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975785#M5597</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-18T20:25:08Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;The user is on a NUMA</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975786#M5598</link>
      <description>&amp;gt;&amp;gt;&amp;gt;The user is on a NUMA system and this is a "different world" ( I don't have access to any such system at the moment )&amp;gt;&amp;gt;&amp;gt;

.It could very helpful if  @zhengda1936 could post his hardware configuration.I'm sure that he has a quad CPU motherboard probably manufactured by TYAN or SuperMicro.

&amp;gt;&amp;gt;&amp;gt; Please take a look / read the original post again because I really don't understand these continued pushes to disassemble / debug a memcpy function currently used in his tests&amp;gt;&amp;gt;&amp;gt;

I think that  disassembling memcpy() function and revealing its exact machine code implementation could provide us with the some insight into
what is going under the hood.As I stated earlier in my post there is possibility that 'rep movsb' instruction is used by the compiler.
I do not exclude the possibility of some hardware related issue.

&amp;gt;&amp;gt;&amp;gt;When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data !!! from !!! remote memory, I get much worse performance, only around 1GB/s&amp;gt;&amp;gt;&amp;gt;

It is obvious that in NUMA  architecture one can expect memory transfer speed degradation when for example CPU 0  is accessing non local memory(remote memory) from its relative "point of view".</description>
      <pubDate>Tue, 18 Dec 2012 20:37:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975786#M5598</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-18T20:37:55Z</dc:date>
    </item>
    <item>
      <title>Sure, I can provide the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975787#M5599</link>
      <description>Sure, I can provide the hardware configuration. Here is the CPU info:
          description: CPU
          product: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
          vendor: Intel Corp.
          physical id: 400
          bus info: cpu@0
          version: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
          slot: CPU1
          size: 2200MHz
          capacity: 3600MHz
          width: 64 bits
          clock: 2905MHz

So each CPU has 8 cores and there are 4 CPUs in the machine.

Memory info:
             description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
             product: M393B2G70BH0-YH9
             vendor: 00CE00B300CE
             physical id: 0
             serial: 342F3D9C
             slot: DIMM_A1
             size: 16GiB
             width: 64 bits
             clock: 1333MHz (0.8ns)

What other hardware configuration info I should provide to help you diagnose?</description>
      <pubDate>Tue, 18 Dec 2012 20:50:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975787#M5599</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-18T20:50:09Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;So each CPU has 8 cores</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975788#M5600</link>
      <description>&amp;gt;&amp;gt;&amp;gt;So each CPU has 8 cores and there are 4 CPUs in the machine.&amp;gt;&amp;gt;&amp;gt;

Do you mean 8 threads/4 cores per CPU?
Have you experienced earlier memory speed degradation?</description>
      <pubDate>Tue, 18 Dec 2012 21:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975788#M5600</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-12-18T21:04:00Z</dc:date>
    </item>
    <item>
      <title>As I show above, gcc compiled</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975789#M5601</link>
      <description>As I show above, gcc compiled memcpy into "rep movsq", and Intel compiler invokes __intel_fast_memcpy.
I used perf to profile my program, and it seems it eventually invokes __intel_ssse3_rep_memcpy and the most time-consuming instructions are:
    0.06 :          405657:       movaps -0x10(%rsi),%xmm1
   38.23 :          40565b:       movaps %xmm1,-0x10(%rdi)
    1.00 :          40565f:       movaps -0x20(%rsi),%xmm2
    0.06 :          405663:       movaps %xmm2,-0x20(%rdi)
    0.18 :          405667:       movaps -0x30(%rsi),%xmm3
    0.06 :          40566b:       movaps %xmm3,-0x30(%rdi)
    0.05 :          40566f:       movaps -0x40(%rsi),%xmm4
    0.01 :          405673:       movaps %xmm4,-0x40(%rdi)
    0.10 :          405677:       movaps -0x50(%rsi),%xmm5
   41.82 :          40567b:       movaps %xmm5,-0x50(%rdi)
    0.47 :          40567f:       movaps -0x60(%rsi),%xmm5
    0.03 :          405683:       movaps %xmm5,-0x60(%rdi)
    0.06 :          405687:       movaps -0x70(%rsi),%xmm5
    0.01 :          40568b:       movaps %xmm5,-0x70(%rdi)
    0.04 :          40568f:       movaps -0x80(%rsi),%xmm5
    0.01 :          405693:       movaps %xmm5,-0x80(%rdi)
It seems one data copy triggers moving 64 bytes to the remote node, so only the first data copy consumes most CPU time. I thought one data copy would trigger moving 128 bytes to a remote node (since the cache line is 128 bytes).</description>
      <pubDate>Tue, 18 Dec 2012 21:43:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975789#M5601</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-18T21:43:35Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975790#M5602</link>
      <description>&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;So each CPU has 8 cores and there are 4 CPUs in the machine.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Do you mean 8 threads/4 cores per CPU?&lt;BR /&gt;
Have you experienced earlier memory speed degradation?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;

No, 16 threads/8 cores per CPU. 
What do you mean by earlier memory speed degradation?
The local memory copy speed is expected.</description>
      <pubDate>Tue, 18 Dec 2012 21:46:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975790#M5602</guid>
      <dc:creator>zhengda1936</dc:creator>
      <dc:date>2012-12-18T21:46:48Z</dc:date>
    </item>
    <item>
      <title>Hi Da,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975791#M5603</link>
      <description>Hi Da,

It is always a right thing to follow a &lt;STRONG&gt;top-to-down approach&lt;/STRONG&gt; when investigating a problem. That is:

- Source codes -&amp;gt;
- Analysis -&amp;gt;
- Is there a hardware problem?
- Are there any logical errors in the codes? -&amp;gt;
- Could I reproduce a problem? -&amp;gt;
- Could I simplify the test-case? -&amp;gt;
- Could I remove some dependencies on 3rd party software components -&amp;gt;
- Why does my application crash ( if this is the case )? -&amp;gt;
- What else could be wrong with &lt;STRONG&gt;my codes&lt;/STRONG&gt;?
- Etc.

It means, that if a C/C++ developer will try to do some investigation in opposite way, that is following a &lt;STRONG&gt;down-to-top approach&lt;/STRONG&gt; ( dissassembling first all the rest later ), a significant amount of a project time could be wasted.

From my point of view a &lt;STRONG&gt;Summary&lt;/STRONG&gt; of the problem could look like:

- Possible logical problem with the test-case ( very high possibility )
- Possible oversubscription of the processing threads ( high possibility )
- Possible hardware issue with the NUMA system ( very low possibility )
- Possible problem with CRT memcpy function ( low possibility )

A simplified test-case is needed &lt;STRONG&gt;without changing priorities&lt;/STRONG&gt; of any threads or a process and ideally it would be nice to have just one thread of normal priority. This is needed to verify that NUMA system doesn't have any hardware issues.

A logic for the simplified test-case could look like:

- one thread test application
- allocate a memory block in a 'local' memory
- copy some data ( some number of times to get an average time )
- invalidate cache lines somehow
- read some data ( some number of times to get an average time )
- save performance numbers
- allocate a memory block in a 'remote' memory
- copy some data ( some number of times to get an average time )
- invalidate cache lines somehow
- read some data ( some number of times to get an average time )
- save performance numbers
- compare results
- repeat the test with more threads ( increase by 2 every time ) until it reaches 64

&lt;STRONG&gt;1.&lt;/STRONG&gt; After a very quick code review of the test-case I noticed that a priority of the executing process is changed:
...
setpriority( PRIO_PROCESS, getpid(), -20 );
...
Why do you change the priority of the process?

&lt;STRONG&gt;2.&lt;/STRONG&gt; In order to clear &lt;STRONG&gt;any uncertanties&lt;/STRONG&gt; with the 'memcpy' function I recommend to replace it with an external pure C function ( a couple of minutes to implement, right? )

&lt;STRONG&gt;3.&lt;/STRONG&gt; A Virtual Memory Manager ( VMM ) on any OS should have 'Above Normal' or 'High' priority. If processing thread(s) in some test have higher priorities then VMM will be preempted most of the time and any memory operations using 'mem'-like CRT functions will be affected. Also, there will be a performance degradation of the whole operating system. If processing thread(s) have lower priorities, like 'Below Normal' or 'Idle', then they will be preempted most of the time and performance of the test will be affected.

&lt;STRONG&gt;4.&lt;/STRONG&gt; A brief high-level overview of the test-case will also help

Best regards,
Sergey</description>
      <pubDate>Tue, 18 Dec 2012 23:08:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/The-best-method-for-inter-processor-data-communication/m-p/975791#M5603</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-12-18T23:08:00Z</dc:date>
    </item>
  </channel>
</rss>

