topic Run each (presumably in Intel® Moderncode for Parallel Architectures

reading two cache lines issue

Aleksandr_A_1 — Tue, 02 Aug 2016 07:48:58 GMT

First, I did 2 tests:

1) Prepared randomly list (randomly means that the next list item has random address within L1D and multiple of cache line size, this is done to eliminate ability of prefetcher to help reader), where every list item size is equal cache line size (64b), and the total number of items equal 32K (L1D size) / 64 (cache line size), then from the other core I run through the list and measure time, then this time I divide by number of elements, so as a result I get how long it would take to load one cache line from different core. It's consistent to what is said in the Intel documentation, about 50 cycles.

2) Of course, If I prepare 2 such lists and then from the other core run through both lists this number won't change too much, because CPU can issue 2 loads per cycle, and these 2 lists are completely independent of each other, so there is no data dependency and cpu can issue loads of two next list items simultaneously. This is consistent with what is said in the documentation as well.

What I do now: I have 4 variables (8 bytes each) which are aligned at the cache line size boundary, 2 of them mimic one pseudo queue and the other 2 mimic the other pseudo queue. So the first thread changes the first variable ('data'), then changes the second variable ('counter'), then reads from the third variable first and the fourth variable, the second thread does the opposite. So the first thread writes to the first pseudo queue and reads from th e second, the second thread reads from the first one and writes to the second.

code looks like this:

first thread:

for (size_t i = 0; i < count; ++i)
{
data0 = i;
// barrier();
value1 = i;

unsigned long long vtmp0;
unsigned long long tmp0 = value2;
// barrier();
vtmp0 = data1;

while (tmp0 != i)
{
cpu_pause();

tmp0 = value2;
// barrier();
vtmp0 = data1;
}

v += tmp0 + vtmp0; // just calculating something
}

the second thread:

for (size_t i = 0; i < count; ++i)
{
unsigned long long vtmp0;

unsigned long long tmp0 = value1;
// barrier();
vtmp0 = data0;

while (tmp0 != i)
{
cpu_pause();

tmp0 = value1;
// barrier();
vtmp0 = data0;

}
v += tmp0 + vtmp0; // just calculating something

data1 = i;
// barrier();
value2 = i;
}

of course I need those barrier() to force the compiler to obey the order of reads and writes

If barrier()s are not commented out, than the time is like 210

if barrier()s are commented out, than the time is like 175

I don't understand why this difference ever exists?

Any ideas?

Run each (presumably

jimdempseyatthecove — Tue, 02 Aug 2016 15:05:30 GMT

Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.

Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)

Jim Dempsey

>>...It's consistent to what

SergeyKostrov — Fri, 05 Aug 2016 21:14:58 GMT

>>...It's consistent to what is said in the Intel documentation, about 50 cycles... Where did you get that number? Source, please.

Quote:Sergey Kostrov wrote:

Aleksandr_A_1 — Mon, 08 Aug 2016 08:04:56 GMT

Sergey Kostrov wrote:

>>...It's consistent to what is said in the Intel documentation, about 50 cycles...

Where did you get that number? Source, please.

64-ia-32-architectures-optimization-manual.pdf

2.3.5.1 Load and store operations overview

lookup order and lookup latency

L2 and L1 DCache in another core 43 clean hit, 60 - dirty hit

2.3.5.1

Load and Store Operation Overview
	2.3.5.1
Load and Store Operation Overvi

2.3.5.1

Load and Store Operation Overview

2.3.5.1

Load and Store Operation Overview

Quote:jimdempseyatthecove

Aleksandr_A_1 — Mon, 08 Aug 2016 08:08:18 GMT

jimdempseyatthecove wrote:

Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.

Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)

Jim Dempsey

let's assume that instructions are reordered. What is happening is that one thread changes tцo cache lines and another thread reads them. Of course reader impedes writer with reads, but there is a pause instruction there, which is around 50 cycle, so writer should be able to write before ready interrupts writer again

>>let's assume that

jimdempseyatthecove — Mon, 08 Aug 2016 12:38:42 GMT

>>let's assume that instructions are reordered.

It is the programmers responsibility to work with the compiler to assure that the instruction set sequence is that required. (and to verify this with inspection)

This said, the IA32 and Intel64 (not necessarily IA64/Itanium) are strong ordered systems that preserve write ordering cache coherency.

Can you post your test program for others to examine?

Jim Dempsey