- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, I did 2 tests:
1) Prepared randomly list (randomly means that the next list item has random address within L1D and multiple of cache line size, this is done to eliminate ability of prefetcher to help reader), where every list item size is equal cache line size (64b), and the total number of items equal 32K (L1D size) / 64 (cache line size), then from the other core I run through the list and measure time, then this time I divide by number of elements, so as a result I get how long it would take to load one cache line from different core. It's consistent to what is said in the Intel documentation, about 50 cycles.
2) Of course, If I prepare 2 such lists and then from the other core run through both lists this number won't change too much, because CPU can issue 2 loads per cycle, and these 2 lists are completely independent of each other, so there is no data dependency and cpu can issue loads of two next list items simultaneously. This is consistent with what is said in the documentation as well.
What I do now: I have 4 variables (8 bytes each) which are aligned at the cache line size boundary, 2 of them mimic one pseudo queue and the other 2 mimic the other pseudo queue. So the first thread changes the first variable ('data'), then changes the second variable ('counter'), then reads from the third variable first and the fourth variable, the second thread does the opposite. So the first thread writes to the first pseudo queue and reads from th e second, the second thread reads from the first one and writes to the second.
code looks like this:
first thread:
for (size_t i = 0; i < count; ++i)
{
data0 = i;
// barrier();
value1 = i;
unsigned long long vtmp0;
unsigned long long tmp0 = value2;
// barrier();
vtmp0 = data1;
while (tmp0 != i)
{
cpu_pause();
tmp0 = value2;
// barrier();
vtmp0 = data1;
}
v += tmp0 + vtmp0; // just calculating something
}
the second thread:
for (size_t i = 0; i < count; ++i)
{
unsigned long long vtmp0;
unsigned long long tmp0 = value1;
// barrier();
vtmp0 = data0;
while (tmp0 != i)
{
cpu_pause();
tmp0 = value1;
// barrier();
vtmp0 = data0;
}
v += tmp0 + vtmp0; // just calculating something
data1 = i;
// barrier();
value2 = i;
}
of course I need those barrier() to force the compiler to obey the order of reads and writes
If barrier()s are not commented out, than the time is like 210
if barrier()s are commented out, than the time is like 175
I don't understand why this difference ever exists?
Any ideas?
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.
Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...It's consistent to what is said in the Intel documentation, about 50 cycles...
Where did you get that number? Source, please.
64-ia-32-architectures-optimization-manual.pdf
2.3.5.1 Load and store operations overview
lookup order and lookup latency
L2 and L1 DCache in another core 43 clean hit, 60 - dirty hit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.
Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)
Jim Dempsey
let's assume that instructions are reordered. What is happening is that one thread changes tцo cache lines and another thread reads them. Of course reader impedes writer with reads, but there is a pause instruction there, which is around 50 cycle, so writer should be able to write before ready interrupts writer again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>let's assume that instructions are reordered.
It is the programmers responsibility to work with the compiler to assure that the instruction set sequence is that required. (and to verify this with inspection)
This said, the IA32 and Intel64 (not necessarily IA64/Itanium) are strong ordered systems that preserve write ordering cache coherency.
Can you post your test program for others to examine?
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page