- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a large array on the sum of the optimization problem.
There are two double type array, then the code for this problem is as follows:
#pragma omp parallel for
for (long i=0; i<5000000; i++)
{
array1 += array2;
}
My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.
And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.
The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.
and then i used VC and IC compiled the program,the two result are basiclly the same.
and then i used the funtion of INTEL MKL 10.1 to compute,as follows:
cblas_daxpy(5000000, 1, array2, 1, array1, 1);
the performance of the program have no different.
and the i used other funtino of INTEL MKL 10.1:
vdAdd( n, a, b, y );
Program performance decreased significantly, and only about 80% of the original.
i would like to know what way to optimize this problem by enhancing program performance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the performance of a vectorizing and a non-vectorizing OpenMP compiler are the same, and the extra operation in daxpy doesn't slow it down, it tends to confirm that memory performance is the main issue.
You might try 2 or 4 threads with Intel OpenMP library, and set kmp_affinity=scatter
Further, you might experiment with software prefetch to get 4KB pages of both arrays in advance.
Up to a year ago, your CPU had the best memory performance of any Intel CPU, but of course it isn't competive now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may also consider storing data with non-temporal store instructions, this can reduce memory bus pressure. Check out _mm_stream_si128(addr, value) intrinsic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually, I not even sure about _mm_prefetch() with _MM_HINT_NTA. AFAIK, non-temporal prefetch done into L2 on modern Intel processors, but non-temporal data occupy at most 1 way in N-associative L2 cache to minimize pollution (info from the Optimization Reference Manual, search by PREFETCHNTA). So if it happens so that both arrays are equally aligned (which is quite likely since arrays are big), they will evict each other. Probably, _MM_HINT_T0 will be more appropriate. Humm... Tim, may you comment on this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you!
> A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active)
What is 'adjacent sector prefetch'? Is hardware prefetch current and next line on every software prefetch?
> A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.
Btw, recursive C++ templates are good at unrolling loops. And note that PREFETCH does not cause segmentation/paging faults, so it's Ok to prefetch beyond the end of the arrays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]void sum(double* array1, double* array2, size_t size) { size_t size0 = size / 8 * 8; for (size_t i = 0; i != size0; i += 8) { _mm_prefetch((char*)(array1 + i) + 4096, _MM_HINT_T0); _mm_prefetch((char*)(array2 + i) + 4096, _MM_HINT_T0); array1[i + 0] += array2[i + 0]; array1[i + 1] += array2[i + 1]; array1[i + 2] += array2[i + 2]; array1[i + 3] += array2[i + 3]; array1[i + 4] += array2[i + 4]; array1[i + 5] += array2[i + 5]; array1[i + 6] += array2[i + 6]; array1[i + 7] += array2[i + 7]; } for (size_t i = size0; i != size; i += 1) { array1 += array2; } } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since you system is 2 processor and QPI-based, it's a NUMA system. I.e. each NUMA node (processor in your case) has dedicated memory bandwidth. So parallelization if done correctly must speed-up the program.
I believe that Win2k3 uses a so-called 'first-touch' NUMA allocation policy. So if you initialize your arrays in main thread, they will be placed on a single NUMA node, thus no speed-up from parallelization.
And what you need is as follows. Allocate first half of the arrays on first NUMA node, and second half of the arrays on second NUMA node. Then process first half on first NUMA node, and second half - on second NUMA node.
I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N). But if it's the case (likely) then you just need something along the lines of:
[cpp]// set kmp_affinity=scatter, it's important, it will distribute your threads across NUMA nodes # pragma omp parallel for schedule(static) num_threads(2) for (int i = 0; i < size; i++) { // initialize array1 and array2 array1 = i + 100; array2 = i + 200; } // ... # pragma omp parallel for schedule(static) num_threads(2) for (int i = 0; i < size / 8; i++) { int j = i * 8; _mm_prefetch((char*)(array1 + j) + 4096, _MM_HINT_T0); _mm_prefetch((char*)(array2 + j) + 4096, _MM_HINT_T0); array1[j + 0] += array2[j + 0]; array1[j + 1] += array2[j + 1]; array1[j + 2] += array2[j + 2]; array1[j + 3] += array2[j + 3]; array1[j + 4] += array2[j + 4]; array1[j + 5] += array2[j + 5]; array1[j + 6] += array2[j + 6]; array1[j + 7] += array2[j + 7]; } for (int i = size / 8 * 8; i != size; i += 1) { array1 += array2; } [/cpp]
Tim, does what I am saying makes sense?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR
Then, what should we do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, you can try to reduce memory bandwidth requirements. The easiest thing is to switch to floats (if it's acceptable of course). You can also try to combine some phases of your application. For example, if you initialize, process and output the arrays in a single loop, then the amount of computations per a unit of memory will be much higher.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.
I was told that the VC++ Compiler2008 and Intel C++ Compiler11.0 can do unrolling loop very well.
And I did do some experiments to test it. The code is just like
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.
Probably you are right. But how to add the 'if' ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...);
Then, explicit "unrolling the loop" is unnecessary?
Do you believe that we would get a big perfomanceimprovement in that way?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page