Strange slowdown on Xeon W3540

Ruben_Adamyan · ‎09-10-2010

Hi Everyone,

I have a simple test program which shows some strange slowdown on my desktop computer.

Desktop is running Linux and the CPU is a quad-core W3540.

The test creates two threads which iterate over an array of structure whose size is 128 bytes (double the size of cache line). The first thread modifies the first half the structure while the second thread modifies the second half. The array is aligned to cache line boundary so there should not be any false sharing.

But it shows a slowdown like there is a false sharing.

However if I increase the structure size to 256 the slowdown goes away. Or if I decrease array size from 46 to 45 the slowdown again goes away.

I would be glade if someone can explain the reason of such behavior.

I tested this on many computes having different Inter processors, it seems it only happens for W3540.

There is the source code of my program.

#include

struct S {

int m_i[32];

} __attribute__((aligned(64)));

const int n = 46;

S data;

int m;

void* f1(void*)

{

for (int j = 0; j < m; ++j) {

for (int i = 0; i < n; ++i) {

data.m_i[0]++;

}

return 0;

}

void* f2(void*)

{

for (int j = 0; j < m; ++j) {

for (int i = 0; i < n; ++i) {

data.m_i[16]++;

}

return 0;

}

int main(int argc, char* argv[])

{

m = 10000000;

pthread_t t1, t2;

pthread_create(&t1, 0, f1, 0);

pthread_create(&t2, 0, f2, 0);

pthread_exit(0);

return 0;

}

Roman_D_Intel · ‎09-20-2010

Hi Ruben,

to verify if you indeed have false sharing you can use VTune or PTU tools, please see thisarticle. You can also look at this guideto check if there is a different issue causing the slowdown.

Roman

jimdempseyatthecove · ‎09-22-2010

The item ofinterest here is when you change the value of n (array of structures size). This should reduce the amount of work by 1/46th, yet the difference in performance is reported as much larger than this. The variation in n will not affect the relativealignment for cache line. However, By varying the array size you are varying the size of a static object preceeding the code.

If I were to make an experienced guess, I would venture to guess that the movement of the code (f1 and f2) affected the alignment for the instruction cache (of one or both loops). This can be confirmed by looking at the dissassembly of the two loops under the two different values for n. Produce the address reports from the very same code that exhibits the difference in performance.

Also, you might consider an alignment of 4096 (typical VM page size), this will (may) reduce the number TLBs required for data from 3 to 2.

Jim Dempsey

TimP · ‎09-22-2010

As you didn't say anything about it, I suppose you have default alternate sector prefetch enabled. If you happen to align to an even cache line boundary, each active thread will attempt to keep both cache lines up to date in cache. This acts like a milder version of false sharing. On some server platforms, there may be a BIOS setup option to disable this style of prefetch. For a desktop with no such option, you (with full administrator rights) should be able to turn it off (for everyone) via MSR, after each reboot. As you said, increasing the structure to keep the data separated by 2 cache lines would prevent the adverse effect.

jimdempseyatthecove · ‎09-23-2010

This would be true except that the user has a struct of size 128 bytes (32 ints)
and the user has an array of these structs
The performance varies greatly dependent upon the number of these structs in his array of structs.

(not the number of ints within each struct)

The relative cache line alignment is the same regardless of the size of array of structs.

This is not to say you are wrong about the prefetch as the two tasks may be in lock-step (working in same struct) or may be skewed (working in different structs), and the numbers of these structs alter the lock-step/skew situation. Running VTune or other profiler that detects cache line evictsion would confirm or disclaim the hypothesis.

I would like to see his reply to running with the array of structs bounded on 4096 byte boundry. (i.e. to potentially reduce the number of TLB's required to map the array of structs).

Jim Dempsey