<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Regarding what you are saying in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087053#M7182</link>
    <description>&lt;P&gt;Regarding what you are saying, I did lot of test and research to understand&amp;nbsp;how this thread management works ! I have understood, why this flexibility exists. It gave me a new vision for future optimizations.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;And now I have two small questions:&lt;/SPAN&gt;&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;When we make Hyper-Threading, are the L2 and L1 cache levels shared between each thread, or are they divided and each thread has a part?&lt;/LI&gt;
	&lt;LI&gt;I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Thu, 04 Feb 2016 11:04:47 GMT</pubDate>
    <dc:creator>MGRAV</dc:creator>
    <dc:date>2016-02-04T11:04:47Z</dc:date>
    <item>
      <title>omp parallel on the same CPU</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087049#M7178</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;I have some strange effect using numa library and OpenMP&lt;/P&gt;

&lt;P&gt;I do :&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;PRE class="brush:cpp;"&gt;#pragma omp parallel
{
		int cpu = sched_getcpu();
		int node=numa_node_of_cpu(cpu);
		printf("%d ; %d\n",cpu, node );
}&lt;/PRE&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;I expect have one thread one each CPU, so to have each CPU one time.&lt;BR /&gt;
	But in reality I have something like that :&lt;/P&gt;

&lt;P&gt;39 ; 1&lt;BR /&gt;
	22 ; 0&lt;BR /&gt;
	13 ; 1&lt;BR /&gt;
	12 ; 1&lt;BR /&gt;
	11 ; 1&lt;BR /&gt;
	31 ; 1&lt;BR /&gt;
	10 ; 1&lt;BR /&gt;
	21 ; 0&lt;BR /&gt;
	30 ; 1&lt;BR /&gt;
	3 ; 0&lt;BR /&gt;
	21 ; 0&lt;BR /&gt;
	10 ; 1&lt;BR /&gt;
	12 ; 1&lt;BR /&gt;
	32 ; 1&lt;BR /&gt;
	33 ; 1&lt;BR /&gt;
	21 ; 0&lt;BR /&gt;
	3 ; 0&lt;BR /&gt;
	14 ; 1&lt;BR /&gt;
	18 ; 1&lt;BR /&gt;
	16 ; 1&lt;BR /&gt;
	15 ; 1&lt;BR /&gt;
	21 ; 0&lt;BR /&gt;
	4 ; 0&lt;BR /&gt;
	33 ; 1&lt;BR /&gt;
	17 ; 1&lt;BR /&gt;
	30 ; 1&lt;BR /&gt;
	31 ; 1&lt;BR /&gt;
	37 ; 1&lt;BR /&gt;
	18 ; 1&lt;BR /&gt;
	21 ; 0&lt;BR /&gt;
	17 ; 1&lt;BR /&gt;
	14 ; 1&lt;BR /&gt;
	30 ; 1&lt;BR /&gt;
	18 ; 1&lt;BR /&gt;
	15 ; 1&lt;BR /&gt;
	38 ; 1&lt;BR /&gt;
	12 ; 1&lt;BR /&gt;
	30 ; 1&lt;BR /&gt;
	35 ; 1&lt;BR /&gt;
	30 ; 1&lt;/P&gt;

&lt;P&gt;So I have effectively 40 threads, but for some reason not one the good place&lt;BR /&gt;
	Have someone an explanation?&lt;BR /&gt;
	An give it a solution ?&lt;/P&gt;

&lt;P&gt;I am not looking for the thread number, it’s linked to the memory allocation in the good place an not about id.&lt;/P&gt;

&lt;P&gt;Mathieu&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 09:08:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087049#M7178</guid>
      <dc:creator>MGRAV</dc:creator>
      <dc:date>2016-01-14T09:08:43Z</dc:date>
    </item>
    <item>
      <title>I don't know if the code is</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087050#M7179</link>
      <description>&lt;P&gt;I don't know if the code is correct -- I have not used those interfaces before -- but there are a couple of other things you should consider.&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Most threaded applications should bind the threads to a core or set of cores.&amp;nbsp;
		&lt;OL&gt;
			&lt;LI&gt;With the Intel compilers the KMP_AFFINITY variable is the preferred way of controlling OpenMP thread placement, and the "verbose" option will cause the job to print out lots of useful information at the start of the job.&amp;nbsp; See example below.&lt;/LI&gt;
			&lt;LI&gt;With the GNU compilers the GOMP_CPU_AFFINITY variable is used to control OpenMP thread placement.&amp;nbsp; This provides similar low-level functionality, but requires cores to be numbered explicitly, so it is a pain to port across systems that use different core-numbering policies.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Some environments will provide an external binding when launching your job.&amp;nbsp; This is very common with MPI applications, but can also be done by other job control infrastructures or you can test this manually using "numactl".
		&lt;OL&gt;
			&lt;LI&gt;By default, KMP_AFFINITY will respect the processor mask that the OpenMP job inherits from its environment.&amp;nbsp; In this case the logical processor bindings shown by the "verbose" option will include only the logical processors that the external processor mask allows.&lt;/LI&gt;
			&lt;LI&gt;You can get the processor mask for each OpenMP thread using the "sched_getaffinity()" call.&amp;nbsp; For OpenMP threads you need to use a value of zero as the pid to get sched_getaffinity to return the affinity mask for the current OpenMP thread.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;If your threads are not bound to specific cores, the act of calling sched_getcpu() could easily cause the OS to move the thread.
		&lt;OL&gt;
			&lt;LI&gt;Every Linux system that I know of supports an alternative way of getting the chip and core currently being used, by including this information in the IA32_TSC_AUX register that can be read using the RDTSCP instruction.&amp;nbsp;&amp;nbsp; This is a user-mode instruction that will not give the OS any particular excuse to move the process.&amp;nbsp; (If the process is not bound the OS can move it whenever it wants to, but this is usually triggered by an OS call or by receiving an interrupt.&amp;nbsp; The RDTSCP instruction does not do either of these things.)&amp;nbsp;&amp;nbsp; A sample routine to execute the RDTSC instruction and return the socket and core number is appended.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Example using KMP_AFFINITY with and without external binding:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;icc -openmp binding_test.c -o binding_test
export OMP_NUM_THREADS=8
export KMP_AFFINITY="verbose,scatter"
echo "run with no external binding"
./binding_test
   [lots of output, including...]
   OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
   OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 
   [...]
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 0 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 1 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 2 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 3 bound to OS proc set {3}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 5 bound to OS proc set {5}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 4 bound to OS proc set {4}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 6 bound to OS proc set {6}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 7 bound to OS proc set {7}

echo "run with external binding to cores 0-3"
numactl --physcpubind=0-3 ./binding_test
   [lots of output, including...]
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 0 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 1 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 2 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 3 bound to OS proc set {3}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 5 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 4 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 6 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 7 bound to OS proc set {3}
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Example code to use RDTSCP -- the function returns the Time Stamp Counter value and also writes to the "chip" and "core" variables passed (by reference) with the chip number and (global) logical processor number where the code was running when the RDTSCP instruction was executed.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;unsigned long tacc_rdtscp(int *chip, int *core)
{
   unsigned a, d, c;

   __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
	*chip = (c &amp;amp; 0xFFF000)&amp;gt;&amp;gt;12;
	*core = c &amp;amp; 0xFFF;

   return ((unsigned long)a) | (((unsigned long)d) &amp;lt;&amp;lt; 32);;
}
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 16:27:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087050#M7179</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-01-14T16:27:12Z</dc:date>
    </item>
    <item>
      <title>It looks like you have 2</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087051#M7180</link>
      <description>&lt;P&gt;It looks like you have 2 physical processors (this is two chips), each with 10 cores, and with Hyperthreading enabled. (I am not aware of any 20 core chip).&lt;/P&gt;

&lt;P&gt;From your printout, it looks as if you do not have affinity set (see John's reply).&lt;/P&gt;

&lt;P&gt;The sched_getcpu function returns a system logical processor number, who's relationship to physical CPU, core, and hardware thread is dependent on configuration settings, or lack thereof, ... at the time of the function call.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 18 Jan 2016 22:46:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087051#M7180</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-01-18T22:46:10Z</dc:date>
    </item>
    <item>
      <title>I agree that your problem is</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087052#M7181</link>
      <description>I agree that your problem is related to Not setting affinity to these &lt;STRONG&gt;40&lt;/STRONG&gt; OpenMP threads. Next,

&amp;gt;&amp;gt;...I expect have one thread one each CPU, so to have each CPU one time...

Unfortunately, No. Because OpenMP directive

&lt;STRONG&gt;#pragma omp parallel&lt;/STRONG&gt;
{
...
some processing...
...
}

does &lt;STRONG&gt;Not&lt;/STRONG&gt; guarantee at all that every OpenMP thread will be assigned to some logical CPU ( with a relation 1-to-1 ) and it does &lt;STRONG&gt;Not&lt;/STRONG&gt; matter whether you're using &lt;STRONG&gt;NUMA&lt;/STRONG&gt; system or something else ( &lt;STRONG&gt;Non-NUMA&lt;/STRONG&gt; system ).

That is why your output has at least &lt;STRONG&gt;three&lt;/STRONG&gt; assignments to the CPU &lt;STRONG&gt;12&lt;/STRONG&gt; on the NUMA node &lt;STRONG&gt;1&lt;/STRONG&gt;:
...
12 ; 1
...
12 ; 1
...
12 ; 1
...
and, in that case the relation OpenMP-thread-to-CPU is &lt;STRONG&gt;3-to-1&lt;/STRONG&gt;. Do you agree with that?

But, you need to set affinity for all these &lt;STRONG&gt;40&lt;/STRONG&gt; OpenMP threads before (&lt;STRONG&gt;!!!&lt;/STRONG&gt;) &lt;STRONG&gt;#pragma omp parallel&lt;/STRONG&gt; directive and you should Not try to do it every time OpenMP thread is executed inside of &lt;STRONG&gt;#pragma omp parallel&lt;/STRONG&gt; block.

Since I've implemented my own fully portable OpenMP-Thread-to-CPU Affinity Management I would tell that OpenMP designers did not study in full what was done in a &lt;STRONG&gt;Multi Threaded World&lt;/STRONG&gt; in the past. It means, these designers neglected legacy of &lt;STRONG&gt;Process and Thread Affinity Management&lt;/STRONG&gt; from the beginning and do not want to introduce it in latest versions of OpenMP standard. That is why &lt;STRONG&gt;Intel&lt;/STRONG&gt; introduced its own &lt;STRONG&gt;KMP-based&lt;/STRONG&gt; ( partially portable ) solution of the problem. That is why I've implemented my own fully portable solution.

A famous &lt;STRONG&gt;David Cutler&lt;/STRONG&gt;, former Lead Software Engineer and the "Father" of Windows NT scheduler ( SMT-based ), designed &lt;STRONG&gt;SetProcessAffinityMask&lt;/STRONG&gt; and &lt;STRONG&gt;SetThreadAffinityMask&lt;/STRONG&gt; Win32 API functions for &lt;STRONG&gt;Windows NT&lt;/STRONG&gt; from the beginning! These two functions more than 25-year-old and, even if they were designed for &lt;STRONG&gt;Multi-Processor&lt;/STRONG&gt; systems of 1990th ( before &lt;STRONG&gt;Multi-Core&lt;/STRONG&gt; CPUs appeared in 2000 year, or so ) they do what they need to do now.</description>
      <pubDate>Mon, 01 Feb 2016 05:49:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087052#M7181</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-02-01T05:49:00Z</dc:date>
    </item>
    <item>
      <title>Regarding what you are saying</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087053#M7182</link>
      <description>&lt;P&gt;Regarding what you are saying, I did lot of test and research to understand&amp;nbsp;how this thread management works ! I have understood, why this flexibility exists. It gave me a new vision for future optimizations.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;And now I have two small questions:&lt;/SPAN&gt;&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;When we make Hyper-Threading, are the L2 and L1 cache levels shared between each thread, or are they divided and each thread has a part?&lt;/LI&gt;
	&lt;LI&gt;I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 04 Feb 2016 11:04:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087053#M7182</guid>
      <dc:creator>MGRAV</dc:creator>
      <dc:date>2016-02-04T11:04:47Z</dc:date>
    </item>
    <item>
      <title>On current processors, the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087054#M7183</link>
      <description>&lt;P&gt;On current processors, the HyperThreads (or Hardware Threads when speaking of Xeon Phi) of a single core,&amp;nbsp;share the L1 and L2 cache of that core. This also applies to the instruction cache.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&lt;EM&gt;I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;Doesn't this imply it does not port?&lt;/P&gt;

&lt;P&gt;Most multi-threaded applications that are concerned about cache also perform affinity pinning. One of&lt;/P&gt;

&lt;P&gt;Locking a software thread to a hardware thread&lt;BR /&gt;
	Locking a software thread&amp;nbsp;to a core (i.e. permitted to run on any hardware thread of a given core)&lt;BR /&gt;
	Locking a software thread to a CPU&lt;BR /&gt;
	Locking a software thread to a NUMA node&lt;BR /&gt;
	Locking a software thread to within a NUMA distance&lt;/P&gt;

&lt;P&gt;For L1 and L2 binding the first two are used.&lt;BR /&gt;
	For L3/LLC CPU binding is used&lt;BR /&gt;
	For close memory, either CPU or NUMA node is used&lt;BR /&gt;
	Most users will never experience more than one NUMA hop (current and adjacent).&lt;/P&gt;

&lt;P&gt;On KNC, there is only one&amp;nbsp;timestamp counter within the CPU (one CPU on coprocessor), and this counter does not run at the same rate as the host timestamp counter. While there may be a way to nearly synchronize the timestamp counters of multiple coprocessors, doing so is likely not practical.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 04 Feb 2016 13:24:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087054#M7183</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-02-04T13:24:07Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;I like the “rdtscp”</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087055#M7184</link>
      <description>&amp;gt;&amp;gt;I like the “&lt;STRONG&gt;rdtscp&lt;/STRONG&gt;” solution, that is more easy to port (and a little faster). Is there a version who is ok
&amp;gt;&amp;gt;with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')

&lt;STRONG&gt;1.&lt;/STRONG&gt; Take into account that &lt;STRONG&gt;rdtscp&lt;/STRONG&gt; is a privileged instruction and can be only executed on a &lt;STRONG&gt;Ring-0&lt;/STRONG&gt; ( privileged ring ).

&lt;STRONG&gt;2.&lt;/STRONG&gt; You need to provide more details about your C++ compiler ( Note: you could also check that &lt;STRONG&gt;immintrin.h&lt;/STRONG&gt; ( support for AVX ISA ) header file exists ).</description>
      <pubDate>Mon, 08 Feb 2016 02:14:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087055#M7184</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-02-08T02:14:16Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...While there may be a way</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087056#M7185</link>
      <description>&amp;gt;&amp;gt;...While there may be a way to nearly synchronize the timestamp counters of multiple coprocessors, doing so
&amp;gt;&amp;gt;is likely not practical.

I've done a research on that about 2 years ago but it was only partially completed. I managed to get timestamp values for all working CPUs but did not try to synchronize them. There is also some thread about it on IDZ.</description>
      <pubDate>Mon, 08 Feb 2016 02:22:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087056#M7185</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-02-08T02:22:05Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...or are they divided and</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087057#M7186</link>
      <description>&amp;gt;&amp;gt;...or are they divided and each thread has a part?

Honestly, this is what I want to hear from Intel hardware or software engineers and I don't think they will respond.

Standard &lt;STRONG&gt;prefetch&lt;/STRONG&gt; instruction which could be used with hints &lt;STRONG&gt;T0&lt;/STRONG&gt;, &lt;STRONG&gt;T1&lt;/STRONG&gt;, &lt;STRONG&gt;T2&lt;/STRONG&gt; and &lt;STRONG&gt;NTA&lt;/STRONG&gt; does not allow to load data into some portion of a cache. Am I wrong?

Here is a piece of codes from my headers just for your information:

&lt;STRONG&gt;[ HrtAL.h ]&lt;/STRONG&gt;
...
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Note 01: Descriptions of Hint Codes for _mm_prefetch intrinsic function:
//			Loads one cache line of data ( address is an input ) to a location
//			closer to the Processor Unit.
//
//	_MM_HINT_T0	- Prefetch data into all Levels of the cache hierarchy
//			  ( temporal data ).
//	_MM_HINT_T1	- Prefetch data into Level 1 cache and higher
//			  ( temporal data with respect to 1 Level cache ).
//	_MM_HINT_T2	- Prefetch data into Level 2 cache and higher
//			  ( temporal data with respect to 2 Level cache.
//	_MM_HINT_NTA	- Prefetch data into non-temporal cache structure and
//			  into a location close to the processor, minimizing cache pollution
//			  ( non-temporal data with respect to all cache Levels ).
//
// Note 02: Implementation HrtPrefetchData&amp;lt; T0/T1/T2/NTA &amp;gt; functions matches to:
//
//	_mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T0 );	// 0F 18 08		prefetcht0  [eax]
//	_mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T1 );	// 0F 18 10		prefetcht1  [eax]
//	_mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T2 );	// 0F 18 18		prefetcht2  [eax]
//	_mm_prefetch( ( RTchar * )piAddress, _MM_HINT_NTA );	// 0F 18 00		prefetchnta [eax]
...</description>
      <pubDate>Mon, 08 Feb 2016 02:35:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087057#M7186</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-02-08T02:35:20Z</dc:date>
    </item>
    <item>
      <title>RDTSCP is not a privileged</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087058#M7187</link>
      <description>&lt;P&gt;RDTSCP is not a privileged instruction except in the unusual case that the CR4.TSD bit is set.&amp;nbsp;&amp;nbsp; I have never seen this bit set outside of some virtual machine implementations, but I have heard that some folks who are extremely paranoid about covert channels may have also used this bit to disable low-latency time stamp counter access.&lt;/P&gt;

&lt;P&gt;Unfortunately, the RDTSCP instruction is newer than the P54C core used in the Xeon Phi, and it is not supported there.&amp;nbsp; Xeon Phi works best with strong mandatory thread binding using the KMP_AFFINITY and KMP_PLACE_THREADS environment variables.&amp;nbsp;&amp;nbsp; It looks like the next generation Xeon Phi (Knights Landing) will support the RDTSCP instruction, because it supports the associated IA32_TSC_AUX MSR.&lt;/P&gt;

&lt;P&gt;Concerning the caches -- when HyperThreading is enabled, the L1 Instruction Cache, L1 Data Cache, and L2 unified cache are shared by the two threads.&amp;nbsp; There are lots of ways to adjust the "fairness" of the sharing using the LRU policies of the caches, but I am not aware of any Intel disclosures in this area.&amp;nbsp;&amp;nbsp; For homogeneous workloads, the behavior is pretty much what you would expect if the cache were evenly split between the two threads, but there are corner cases where behavior is less easy to understand (particularly if each thread wants to use more than 4 of the 8 ways of associativity of the cache).&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 19:52:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/omp-parallel-on-the-same-CPU/m-p/1087058#M7187</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-02-08T19:52:33Z</dc:date>
    </item>
  </channel>
</rss>

