Community
cancel
Showing results for 
Search instead for 
Did you mean: 
fyten1985
Beginner
159 Views

a large array on the sum of the optimization problem

This is a large array on the sum of the optimization problem.

There are two double type array, then the code for this problem is as follows:

#pragma omp parallel for

for (long i=0; i<5000000; i++)

{

array1 += array2;

}

My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.

And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.

The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.

and then i used VC and IC compiled the program,the two result are basiclly the same.

and then i used the funtion of INTEL MKL 10.1 to compute,as follows:

cblas_daxpy(5000000, 1, array2, 1, array1, 1);

the performance of the program have no different.

and the i used other funtino of INTEL MKL 10.1:

vdAdd( n, a, b, y );

Program performance decreased significantly, and only about 80% of the original.

i would like to know what way to optimize this problem by enhancing program performance

0 Kudos
35 Replies
TimP
Black Belt
104 Views

Can you pick a single forum for this question?
If the performance of a vectorizing and a non-vectorizing OpenMP compiler are the same, and the extra operation in daxpy doesn't slow it down, it tends to confirm that memory performance is the main issue.
You might try 2 or 4 threads with Intel OpenMP library, and set kmp_affinity=scatter
Further, you might experiment with software prefetch to get 4KB pages of both arrays in advance.
Up to a year ago, your CPU had the best memory performance of any Intel CPU, but of course it isn't competive now.
bigknife
Beginner
104 Views

Hi, tim18,
How toprefetch 4KB pages of both arrays in advance?
Thanks!
Dmitry_Vyukov
Valued Contributor I
104 Views

Use _mm_prefetch(addr, _MM_HINT_NTA) from to do prefetch.
You may also consider storing data with non-temporal store instructions, this can reduce memory bus pressure. Check out _mm_stream_si128(addr, value) intrinsic.

TimP
Black Belt
104 Views

nontemporal certainly should be considered for storing arrays which over-run last level cache. The arrays in this example aren't eligible, as both are read.
Dmitry_Vyukov
Valued Contributor I
104 Views

Tim, you are right. I missed that output array is read.
Actually, I not even sure about _mm_prefetch() with _MM_HINT_NTA. AFAIK, non-temporal prefetch done into L2 on modern Intel processors, but non-temporal data occupy at most 1 way in N-associative L2 cache to minimize pollution (info from the Optimization Reference Manual, search by PREFETCHNTA). So if it happens so that both arrays are equally aligned (which is quite likely since arrays are big), they will evict each other. Probably, _MM_HINT_T0 will be more appropriate. Humm... Tim, may you comment on this?

TimP
Black Belt
104 Views

I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.
Dmitry_Vyukov
Valued Contributor I
104 Views

> I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.

Thank you!

> A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active)

What is 'adjacent sector prefetch'? Is hardware prefetch current and next line on every software prefetch?

> A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

Btw, recursive C++ templates are good at unrolling loops. And note that PREFETCH does not cause segmentation/paging faults, so it's Ok to prefetch beyond the end of the arrays.
bigknife
Beginner
104 Views

Tim,
Can you provide a source-code solution of this problem to illustrate how to "unroll the loop and issue a single prefetch each time" ?
By the way, I would like to provide some details aboutIntel Xeon E5420:
Cores: 4;
Threads: 4;
L1 Data Cache: 4 x 32 K Bytes, 8-way;
L1 Instruct Cache: 4 x 32 K Bytes, 8-way;
L2 Cache: 2 x 6144 K Bytes, 24-way.
So the computer system is with 8 cores, and 8 threads are used in the computation.
Thank Tim very much!
AndDmitriy's comments and solution are quiteexpected.
Dmitry_Vyukov
Valued Contributor I
104 Views

My humble guess is:

[cpp]void sum(double* array1, double* array2, size_t size)
{
    size_t size0 = size / 8 * 8;
    for (size_t i = 0; i != size0; i += 8)
    {
        _mm_prefetch((char*)(array1 + i) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + i) + 4096, _MM_HINT_T0);

        array1[i + 0] += array2[i + 0];
        array1[i + 1] += array2[i + 1];
        array1[i + 2] += array2[i + 2];
        array1[i + 3] += array2[i + 3];
        array1[i + 4] += array2[i + 4];
        array1[i + 5] += array2[i + 5];
        array1[i + 6] += array2[i + 6];
        array1[i + 7] += array2[i + 7];
    }

    for (size_t i = size0; i != size; i += 1)
    {
        array1 += array2;
    }
}
[/cpp]

Dmitry_Vyukov
Valued Contributor I
104 Views

As for parallelization, the situation is more interesting.
Since you system is 2 processor and QPI-based, it's a NUMA system. I.e. each NUMA node (processor in your case) has dedicated memory bandwidth. So parallelization if done correctly must speed-up the program.
I believe that Win2k3 uses a so-called 'first-touch' NUMA allocation policy. So if you initialize your arrays in main thread, they will be placed on a single NUMA node, thus no speed-up from parallelization.
And what you need is as follows. Allocate first half of the arrays on first NUMA node, and second half of the arrays on second NUMA node. Then process first half on first NUMA node, and second half - on second NUMA node.
I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N). But if it's the case (likely) then you just need something along the lines of:

[cpp]// set kmp_affinity=scatter, it's important, it will distribute your threads across NUMA nodes

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size; i++)
    {
        // initialize array1 and array2
        array1 = i + 100;
        array2 = i + 200;
    }

    // ...

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size / 8; i++)
    {
        int j = i * 8;

        _mm_prefetch((char*)(array1 + j) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + j) + 4096, _MM_HINT_T0);

        array1[j + 0] += array2[j + 0];
        array1[j + 1] += array2[j + 1];
        array1[j + 2] += array2[j + 2];
        array1[j + 3] += array2[j + 3];
        array1[j + 4] += array2[j + 4];
        array1[j + 5] += array2[j + 5];
        array1[j + 6] += array2[j + 6];
        array1[j + 7] += array2[j + 7];
    }

    for (int i = size / 8 * 8; i != size; i += 1)
    {
        array1 += array2;
    }
[/cpp]

Tim, does what I am saying makes sense?

Dmitry_Vyukov
Valued Contributor I
104 Views

Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR
bigknife
Beginner
104 Views

Maybe you missed the OpenMP.
I think your code must be:
  • void sum(double*array1, double*array2, size_t size)
  • {
  • size_t size0=size/8*8;
  • #pragma omp parallel for
  • for (size_t i=0;i!=size0;i+=8)
  • {
  • _mm_prefetch((char*)(array1+i)+4096,_MM_HINT_T0);
  • _mm_prefetch((char*)(array2+i)+4096,_MM_HINT_T0);
  • array1[i+0]+=array2[i+0];
  • array1[i+1]+=array2[i+1];
  • array1[i+2]+=array2[i+2];
  • array1[i+3]+=array2[i+3];
  • array1[i+4]+=array2[i+4];
  • array1[i+5]+=array2[i+5];
  • array1[i+6]+=array2[i+6];
  • array1[i+7]+=array2[i+7];
  • }
  • for (size_t i=size0;i!=size;i+=1)
  • {
  • array1+=array2;
  • }
  • }
  • bigknife
    Beginner
    104 Views

    Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
    http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR

    Then, what should we do?

    Dmitry_Vyukov
    Valued Contributor I
    104 Views

    Then there is nothing we can do with the program as-is. You hit the memory bandwidth wall, the system is just incapable of shuffling more memory per a unit of time.

    However, you can try to reduce memory bandwidth requirements. The easiest thing is to switch to floats (if it's acceptable of course). You can also try to combine some phases of your application. For example, if you initialize, process and output the arrays in a single loop, then the amount of computations per a unit of memory will be much higher.
    bigknife
    Beginner
    104 Views

    Quoting tim18
    I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
    A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

    I was told that the VC++ Compiler2008 and Intel C++ Compiler11.0 can do unrolling loop very well.

    And I did do some experiments to test it. The code is just like

    {
    size_t size0 = size / 8 * 8;
    # pragma omp parallel for
    for (size_t i = 0; i != size0; i += 8)
    {
    array1[i + 0] += array2[i + 0];
    array1[i + 1] += array2[i + 1];
    array1[i + 2] += array2[i + 2];
    array1[i + 3] += array2[i + 3];
    array1[i + 4] += array2[i + 4];
    array1[i + 5] += array2[i + 5];
    array1[i + 6] += array2[i + 6];
    array1[i + 7] += array2[i + 7];
    }
    for (size_t i = size0; i != size; i += 1)
    {
    array1 += array2;
    }
    }
    I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

    Dmitry_Vyukov
    Valued Contributor I
    104 Views

    > I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

    The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
    Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.
    bigknife
    Beginner
    104 Views

    > I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

    The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
    Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.

    Probably you are right. But how to add the 'if' ?

    Dmitry_Vyukov
    Valued Contributor I
    104 Views

    for (size_t i = 0; i != size; i += 1)
    {
    if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);
    array1 += array2;
    }
    bigknife
    Beginner
    104 Views

    for (size_t i = 0; i != size; i += 1)
    {
    if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);
    array1 += array2;
    }

    Then, explicit "unrolling the loop" is unnecessary?

    Do you believe that we would get a big perfomanceimprovement in that way?

    Dmitry_Vyukov
    Valued Contributor I
    57 Views

    It depends.
    It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
    If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).
    Reply