a large array on the sum of the optimization problem

fyten1985 · ‎03-26-2010

This is a large array on the sum of the optimization problem.

There are two double type array, then the code for this problem is as follows:

#pragma omp parallel for

for (long i=0; i<5000000; i++)

{

array1 += array2;

}

My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.

And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.

The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.

and then i used VC and IC compiled the program,the two result are basiclly the same.

and then i used the funtion of INTEL MKL 10.1 to compute,as follows:

cblas_daxpy(5000000, 1, array2, 1, array1, 1);

the performance of the program have no different.

and the i used other funtino of INTEL MKL 10.1:

vdAdd( n, a, b, y );

Program performance decreased significantly, and only about 80% of the original.

i would like to know what way to optimize this problem by enhancing program performance

TimP · ‎03-26-2010

Can you pick a single forum for this question?
If the performance of a vectorizing and a non-vectorizing OpenMP compiler are the same, and the extra operation in daxpy doesn't slow it down, it tends to confirm that memory performance is the main issue.
You might try 2 or 4 threads with Intel OpenMP library, and set kmp_affinity=scatter
Further, you might experiment with software prefetch to get 4KB pages of both arrays in advance.
Up to a year ago, your CPU had the best memory performance of any Intel CPU, but of course it isn't competive now.

bigknife · ‎03-27-2010

Hi, tim18,

How toprefetch 4KB pages of both arrays in advance?

Thanks!

Dmitry_Vyukov · ‎03-27-2010

Use _mm_prefetch(addr, _MM_HINT_NTA) from to do prefetch.
You may also consider storing data with non-temporal store instructions, this can reduce memory bus pressure. Check out _mm_stream_si128(addr, value) intrinsic.

TimP · ‎03-27-2010

nontemporal certainly should be considered for storing arrays which over-run last level cache. The arrays in this example aren't eligible, as both are read.

Dmitry_Vyukov · ‎03-27-2010

Tim, you are right. I missed that output array is read.
Actually, I not even sure about _mm_prefetch() with _MM_HINT_NTA. AFAIK, non-temporal prefetch done into L2 on modern Intel processors, but non-temporal data occupy at most 1 way in N-associative L2 cache to minimize pollution (info from the Optimization Reference Manual, search by PREFETCHNTA). So if it happens so that both arrays are equally aligned (which is quite likely since arrays are big), they will evict each other. Probably, _MM_HINT_T0 will be more appropriate. Humm... Tim, may you comment on this?

TimP · ‎03-27-2010

I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

Dmitry_Vyukov · ‎03-27-2010

> I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.

Thank you!

> A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active)

What is 'adjacent sector prefetch'? Is hardware prefetch current and next line on every software prefetch?

> A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

Btw, recursive C++ templates are good at unrolling loops. And note that PREFETCH does not cause segmentation/paging faults, so it's Ok to prefetch beyond the end of the arrays.

bigknife · ‎03-28-2010

Tim,

Can you provide a source-code solution of this problem to illustrate how to "unroll the loop and issue a single prefetch each time" ?

By the way, I would like to provide some details aboutIntel Xeon E5420:

Cores: 4;

Threads: 4;

L1 Data Cache: 4 x 32 K Bytes, 8-way;

L1 Instruct Cache: 4 x 32 K Bytes, 8-way;

L2 Cache: 2 x 6144 K Bytes, 24-way.

So the computer system is with 8 cores, and 8 threads are used in the computation.

Thank Tim very much!

AndDmitriy's comments and solution are quiteexpected.

Dmitry_Vyukov · ‎03-28-2010

My humble guess is:

[cpp]void sum(double* array1, double* array2, size_t size)
{
    size_t size0 = size / 8 * 8;
    for (size_t i = 0; i != size0; i += 8)
    {
        _mm_prefetch((char*)(array1 + i) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + i) + 4096, _MM_HINT_T0);

        array1[i + 0] += array2[i + 0];
        array1[i + 1] += array2[i + 1];
        array1[i + 2] += array2[i + 2];
        array1[i + 3] += array2[i + 3];
        array1[i + 4] += array2[i + 4];
        array1[i + 5] += array2[i + 5];
        array1[i + 6] += array2[i + 6];
        array1[i + 7] += array2[i + 7];
    }

    for (size_t i = size0; i != size; i += 1)
    {
        array1 += array2;
    }
}
[/cpp]

Dmitry_Vyukov · ‎03-28-2010

As for parallelization, the situation is more interesting.
Since you system is 2 processor and QPI-based, it's a NUMA system. I.e. each NUMA node (processor in your case) has dedicated memory bandwidth. So parallelization if done correctly must speed-up the program.
I believe that Win2k3 uses a so-called 'first-touch' NUMA allocation policy. So if you initialize your arrays in main thread, they will be placed on a single NUMA node, thus no speed-up from parallelization.
And what you need is as follows. Allocate first half of the arrays on first NUMA node, and second half of the arrays on second NUMA node. Then process first half on first NUMA node, and second half - on second NUMA node.
I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N). But if it's the case (likely) then you just need something along the lines of:

[cpp]// set kmp_affinity=scatter, it's important, it will distribute your threads across NUMA nodes

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size; i++)
    {
        // initialize array1 and array2
        array1 = i + 100;
        array2 = i + 200;
    }

    // ...

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size / 8; i++)
    {
        int j = i * 8;

        _mm_prefetch((char*)(array1 + j) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + j) + 4096, _MM_HINT_T0);

        array1[j + 0] += array2[j + 0];
        array1[j + 1] += array2[j + 1];
        array1[j + 2] += array2[j + 2];
        array1[j + 3] += array2[j + 3];
        array1[j + 4] += array2[j + 4];
        array1[j + 5] += array2[j + 5];
        array1[j + 6] += array2[j + 6];
        array1[j + 7] += array2[j + 7];
    }

    for (int i = size / 8 * 8; i != size; i += 1)
    {
        array1 += array2;
    }
[/cpp]

Tim, does what I am saying makes sense?

Dmitry_Vyukov · ‎03-28-2010

Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR

bigknife · ‎03-28-2010

Maybe you missed the OpenMP.

I think your code must be:

void sum(double*array1, double*array2, size_t size) 

{

 size_t size0=size/8*8; 

 #pragma omp parallel for

 for (size_t i=0;i!=size0;i+=8) 

{

 _mm_prefetch((char*)(array1+i)+4096,_MM_HINT_T0); 

 _mm_prefetch((char*)(array2+i)+4096,_MM_HINT_T0); 

 array1[i+0]+=array2[i+0]; 

 array1[i+1]+=array2[i+1]; 

 array1[i+2]+=array2[i+2]; 

 array1[i+3]+=array2[i+3]; 

 array1[i+4]+=array2[i+4]; 

 array1[i+5]+=array2[i+5]; 

 array1[i+6]+=array2[i+6]; 

 array1[i+7]+=array2[i+7]; 

}

 for (size_t i=size0;i!=size;i+=1) 

{

 array1+=array2; 

}

}

bigknife · ‎03-28-2010

Quoting Dmitriy Vyukov

Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR

Then, what should we do?

Dmitry_Vyukov · ‎03-28-2010

Then there is nothing we can do with the program as-is. You hit the memory bandwidth wall, the system is just incapable of shuffling more memory per a unit of time.

However, you can try to reduce memory bandwidth requirements. The easiest thing is to switch to floats (if it's acceptable of course). You can also try to combine some phases of your application. For example, if you initialize, process and output the arrays in a single loop, then the amount of computations per a unit of memory will be much higher.

bigknife · ‎03-28-2010

Quoting tim18

I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

I was told that the VC++ Compiler2008 and Intel C++ Compiler11.0 can do unrolling loop very well.

And I did do some experiments to test it. The code is just like

{

size_t size0 = size / 8 * 8;

# pragma omp parallel for

for (size_t i = 0; i != size0; i += 8)

{

array1[i + 0] += array2[i + 0];

array1[i + 1] += array2[i + 1];

array1[i + 2] += array2[i + 2];

array1[i + 3] += array2[i + 3];

array1[i + 4] += array2[i + 4];

array1[i + 5] += array2[i + 5];

array1[i + 6] += array2[i + 6];

array1[i + 7] += array2[i + 7];

}

for (size_t i = size0; i != size; i += 1)

{

array1 += array2;

}

I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

Dmitry_Vyukov · ‎03-28-2010

> I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.

bigknife · ‎03-28-2010

Quoting Dmitriy Vyukov

> I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.

Probably you are right. But how to add the 'if' ?

Dmitry_Vyukov · ‎03-28-2010

for (size_t i = 0; i != size; i += 1)

{
if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);

array1 += array2;

}

bigknife · ‎03-28-2010

Quoting Dmitriy Vyukov

for (size_t i = 0; i != size; i += 1)

{
if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);

array1 += array2;

}

Then, explicit "unrolling the loop" is unnecessary?

Do you believe that we would get a big perfomanceimprovement in that way?

Dmitry_Vyukov · ‎03-28-2010

It depends.
It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).