- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When I sum elements in an array like this:

for(i...N; i+=16){

sumA += array

sumB += array[i+4]

sumC += array[i+8]

sumD += array[i+12]

}

Why is it not faster when compared to: for(i...N;i+=4) { sum +=array

sumB += array[i+4]

sumC += array[i+8]

sumD += array[i+12]

}

Why is it not faster when compared to: for(i...N;i+=4) { sum +=array

*; } ?*

It is actually a bit slower. I was under the impression that this inctruction level parallelism should speed it up. I checked it using Intel Architecture Code Analyzer and the performace critical path is about the same (while thesecond version sums only one float4 at a time), so I would expect the first version to be faster. Can someone explain why is my assumption wrong and what is limiting the performance?

I have been testing it on 06_17H, compiled in both 32 and 64bits by VS2010, memory of the array is locked to get rid of page faults, prefetches are in place

It is actually a bit slower. I was under the impression that this inctruction level parallelism should speed it up. I checked it using Intel Architecture Code Analyzer and the performace critical path is about the same (while thesecond version sums only one float4 at a time), so I would expect the first version to be faster. Can someone explain why is my assumption wrong and what is limiting the performance?

I have been testing it on 06_17H, compiled in both 32 and 64bits by VS2010, memory of the array is locked to get rid of page faults, prefetches are in place

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

So inside the loop I do this for every sum variable:

sumA = _mm_add_ps(sumA, _mm_load_ps(&elements

*));*

And I sum the temp sums after the loop finishes

And I sum the temp sums after the loop finishes

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Your loop should require 4 temps for the fetch of the elements and 4 temps for the sumA, ...

However, this is an AVX forum so consider using _mm256_add_ps(... where sumA, ... are __mm256 types.

IOW each instruction adds 8 floats. (change loop to use +8)

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Jim, I am aware of the fact that this is an AVX forum; however, I have not found a one that is more suitable. Please, let me know if there is one. The compiler has all optimizations enabled.

Tim, I appologize if it is not clear what I am trying to achieve. I am just trying to understand, why I don't experience any speedup if I unroll the loop and use more temporary variables. Furthermore, I am adding 4 elements to each sum, since it doesnt seem to be very clear, I am including the code.

__m128 Sums[4];

for(unsigned int i=0;i<4;i++) { Sums* = _mm_set_ps(0.f, 0.f, 0.f, 0.f); }*

for(unsigned int i=0;i*)); Sums[1] = _mm_add_ps(Sums[1], _mm_load_ps(&pBuffer[i+4])); Sums[2] = _mm_add_ps(Sums[2], _mm_load_ps(&pBuffer[i+8])); Sums[3] = _mm_add_ps(Sums[3], _mm_load_ps(&pBuffer[i+12]));}*

Sums[0] = _mm_add_ps(Sums[0], Sums[2]);

Sums[1] = _mm_add_ps(Sums[1], Sums[3]);

Sums[0] = _mm_add_ps(Sums[0], Sums[1]);

float Sum = 0.f;

for(unsigned int j=0;j<4;j++){

Sum += Sums[0].m128_f32

}

And the loop complies into this:

000D1A70 addps xmm3,xmmword ptr [eax-20h]

000D1A74 addps xmm2,xmmword ptr [eax-10h]

000D1A78 addps xmm1,xmmword ptr [eax]

000D1A7B addps xmm0,xmmword ptr [eax+10h]

000D1A7F add eax,40h

000D1A82 dec ecx

000D1A83 jne 000D1A70

Which is what I would expect, soI did not try to compile it with the Intel compiler. So if I compare this unrolled versionto the one with one sum variable I can indeed see the dependency path is shorter(As I mentioned before, they are +- same, but the unrolled version sums four vectors in oneinteration of the loop). This corresponds with Intel's presentation and recommendationsabout instruction level parallelism. (eg GDC 2011)However, despite this fact, it is not faster (itis slightly slower). And I would expect that to be faster since the load should take about 5 cycles and the latency of ADDPS is 3 cycles and these dependencies should be partly avoided in the unrolled version. Finally, there is not any speedup even with only 2 parallel sums.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

also are you sure that you initialize the array properly before the test? if there is NAN values or denormals in the data it can completely skew the timings

just a detail:since you are targetinga Penryngeneration CPU I'll advise to replace the sequence dec ecx, jne with something elseallowing macrofusion

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The array is filled withrandom <0,1> floats.

Thank you for the question about the array size! I should have realized that and tested it with various sizes. When the data set is small enough to fit into l1 dcache it really runsabout2 times faster than the unrolled version. When the array fits into l2, it is about 20-25% faster than the unrolled version. When the array does not fit into l2 (my cpu doesn't have l3 cache) it runs about the same.

Thank you very much for your help. I am just wondering - should I not be able to gain the speedup even when the array is bigger than the cache size? Since I prefetch the data (I use prefetchnta, +- 400 cycles ahead). Why doesthe speed dependon the size of the array? Is it some overhead for managing data in the caches ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

As your experiments show, when you don't have cache locality, you are likely to find that extra unrolling produces no performance gain, and hardware prefetch may work as well as software prefetch. If you are fetching data at the full rate of which the memory system is capable, those optimizations for cached data aren't effective.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

eax is being incremented by 0x20, where in the statements above, each one

of the four instructions handles a 0x10-byte quadword of the data.

Have you verified the two loops being compared are generating the correct

results?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Tim, I am not sure I understand that (or I dont understand what you are trying to say). I cannot see the difference between the small data set and the large one. The data is not in cache in neither of the cases and it has to get there. So ifI prefetch by software/hardware prefetch it should be there. The only difference I can see is that if the data set is large, some data must be evicted from the cache. However, since the data is not modified it should not be copied back or anything like that (as far as I know). Thus I believe the speedup should be same in both cases. Now when I know it is cache related, I will run it through a profiler and see what it says.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Also, consider looking at the code generated by the Intel compiler.

The dissassembly you posted for your _mm... loop looks as if it is optimized for size as opposed to speed. Use of temp registers might hide some of the fetch latencies.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

when you initialize the array the data are brought to cache, probably kept in the L1D for small arrays (if you are not forcing a full cache flush between the initialization andeach test (*1))

anyway, the way you explain itmakes me think thatyou conduct the test only once per array, the timings should be very imprecise for small arrays I guess,not only some data may be already in some cache level but the resolution of your timer isprobably not much better than the duration of the full run

for accurate timings I'll suggest to run each test several times with the same buffer, for example 1 timeto warm up the data (load in caches) then time with RDTSC 10'000x or more the same inner loop, for the case where the entire array fit in the L1D you'll be able to really compare the performance of yourvariants, otherwise it's more or less a L2/LLC/RAM bandwidth test

also for reproducible timings you should disable SpeedStep in the BIOS (+ disable turbo for Nehalem/Sandy Bridge targets)

*1: a classical issue goes as follow :

test1(array);

test2(array);

// test2 is measured asa lot faster than test1

test2(array);

test1(array);

// hey now it's test1 the fastest!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page