- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a code clip below
BYTE* pD; BYTE *pU; BYTE *pV;
int n;
for (n=0; n<size; n++) {
*pD++ = *pY++;
*pD++ = *pU++; pU++;
*pD++ = *pY++;
*pD++ = *pV++; pV++;
}
I tried to rewrite it by SSE2, however I do not know how to combine bytes by SSE. Any help?
BTW I simply tested following SSE code but found that CPU usage get worse in my program from 80% to 95%. I have thought that the use of SSE can improve CPU usage. The speed just improve a little bit. Where are things going wrong? I am running on Windows 8 Core i7 ultrabook.
int n;
__m128i tmp;
for (n=0; n<size; n+=16;) {
tmp = _mm_load_si128((__m128i *)pY);
_mm_store_si128((__m128i *)pD, tmp);
pD+=32; pY+=16; pU+=8; pV+=8;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to me that your code copies from one memory destination to another one.
There is no calculation or anything else. I think I heard that the speed of copy actions can't be improved by SSE very much.
What task do you want to achieve? Is it only copying?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It copies from 3-memory sources to 1 destination. Actually it is a part of YUV space to RGB space conversion. The SSE code is not the translate of C right now but just for testing. Hope SSE or AVX can great improve the copy performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on the context, you should consider adding restrict qualifier (BYTE *restrict...) and aligning the destination. This might be done more efficiently with SSE4.1 or AVX, but might still require.#pragma vector always. If the loop is long enough or being executed by enough threads to take advantage of nontemporal store, you would need to so specify e.g. #pragma vector aligned nontemporal
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I'll give you a number and it is ~9%. Is it significant or not? I think that for real-time applications even a 0.5% improvement could be considered as a very good thing ( sorry for a small deviation... ).>>>
Yes for example in video rendering.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...I heard that the speed of copy actions can't be improved by SSE very much...
I'll give you a number and it is ~9%. Is it significant or not? I think that for real-time applications even a 0.5% improvement could be considered as a very good thing ( sorry for a small deviation... ).
Please take a look at some recent results which demonstrate how a correct application of _mm_prefetch improves performance of copy operations: http://software.intel.com/en-us/forums/topic/352880
Ah this is quite interesting, I did not know about the prefetch operations and that the improve performance.
With "not very much" I referred to SSE SpeedUp of 2 or 4 (double of float) which most of the time can only be reached if computations are involved.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To TimP (Intel)
"Depending on the context, you should consider adding restrict qualifier (BYTE *restrict...) and aligning the destination. This might be done more efficiently with SSE4.1 or AVX, but might still require.#pragma vector always. If the loop is long enough or being executed by enough threads to take advantage of nontemporal store, you would need to so specify e.g. #pragma vector aligned nontemporal"
Your suggestion of using restrict is good and I have applied. I also tested the alignment and found one source is not aligned. Then I found a way to remove the unaligned pointer. It actually made improvement. "#pragma vector aligned nontemporal" can not be applied by my compiler.
To Sergey Kostrov:
"- If your data set is greater than 256KB a data prefetching, if properly applied (!), could improve performance ( please see Intel Software Optimization Manual )
- Verify in a Debugger that a C++ operator ( __m128i * ) is not used when passing pointers to data for intrinsic functions
- Usage of another pair of intrinsic functions _mm_stream_ps and _mm_load_ps could outperform your current SSE implementation"
tried prefetch but no obvious improvement. Checked the asm code generated there are only two statements of load and store. no problem.
By replacing _mm_store_si128 to _mm_stream_si128 I watched 0.5% improvement.
In summary, I have tried the AVX code but the result is similar to SSE2. When I removed all the copy code clips and do nothing on copy the performance got 25% improvement. So this kernel is critical. Unfortunately SSE2/AVX can not do better than original C++ code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the example _mm_prefetch after reading Intel 64 and IA-32 Architectures Optimization Reference Manual.
But the result is not satisfied.
Like memcpy my code is a non-tempral data processing. So stream computing can be done in pipeline. I made a new try and gain 10% overall performance increase and the kernel may have been speed up to 5x. Here are some key points:
1. Fill a CacheLine in a block.
This requires the MOVNTDQA after SSE4.1. I am not sure the cacheline size but 4 MOVNTDQA (64bytes) loads can fill a cacheline. So no prefetch is needed. The difference is L1 access may take only 3 cycles vs DRAM access in 18 cycles? so it is 6x speedup.
2. Use asm replace C
I read asm code after C compiler, a problem is that C compiler does not been optimized well. In my code, it load a sequence of data to xmm0 register only. The compiler is poor to use register allocation. By using asm I can use xmm0 to xmm15. So I can load a bulk of data and store a bulk of data. That is a perfect stream speedup.
3. Avoid using software cache control
Cache control is very very complicated problem. In the multi-thread and multi-core environment it is more complicated. Applying software control may cause uncertainity which is not suitable for stream processing.
4. AVX may be more helpful
Since AVX doubled the data bandwidth, stream non-tempral data can be accelerated more in moving.
5. Instruction Cycles
Intel may be better to present developers the cycle number for each instruction of a processor in different conditions. So software developers can estimated what code clips can be optimized. For example, the cycle number of MOVNTDQA in L1 cache, L2 cache, L3 cache, in DDR3, .etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Sandy Bridge or Ivy Bridge, AVX nontemporal doesn't necessarily accelerate an application beyond 128-bit nontemporal. Cache lines remain at 64 bytes. If you have optimized code which depends on store bandwidth for earlier architecture, it may be sufficient for Sandy Bridge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Intel may be better to present developers the cycle number for each instruction of a processor in different conditions.
>>So software developers can estimated what code clips can be optimized. For example, the cycle number of
>>MOVNTDQA in L1 cache, L2 cache, L3 cache, in DDR3, .etc.Absolutely agree with that point of view and Intel published clock cycle numbers for many instructions. However, some instructions are not on the list. It looks like MOVNTDQA is not on the list ( please correct me if it is not true ).
Agner's CPU documentation also has plenty information about the instructions cpi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Finally get optimized to use all xmm registers inside the kernel code. C++ compiler requires right define and use __m128i variables to allocate all registers. It is amazing that a short 8 line code runs slower than 60 line code. This may require 3 to 4 register instructions of SSE4 per cycle to run faster on longer code. The basic optimization is
movzx ecx, BYTE PTR [esi]
mov BYTE PTR [eax+1], cl
Another advantage of using SSE4 is the stream stability. Since less cache missing hits the stream may flow in smooth.
In my case AVX does not make lots of benefits because of the lack of shift operation in 256bit. AVX2 may be useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I solved the problem by studying SSE4/AVX2 instructions. SSE4/AVX2 provide rich instructions and I found one I needed. Now code is compact and fast. Thinking of SSE4 as 128bit processor and AVX2 as 256bit processor, Intel actually near completed the definition of their instruction sets. At present SSE4/AVX2 is hard to understand by programmers.
C language can not bring true optimization in some cases. When I unrolled 4 loops, the registers of SSE4 in x86 is not enough for using. In 32bit there are 8 xmm registers from xmm0 to xmm7. The xmm8 to xmm15 is only available for 64bit. Each loop kernel required 2 to 3 xmm registers in average. Unrolled 4 loops will need over 8 registers. Compiler simply allocates a memory unit to replace register when there is no enough registers. This makes optimization meaningless. In AVX2 because of 256bits I need only unrolled 2 loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>C language can not bring true optimization in some cases>>>
Yes that's true.I think that true optimization can be achieved with the help of assembly language or inline assembly.C still lacks some some features which can be easily achieved with the machine code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
![](/skins/images/AF5E7FF58F8A386030D1DB97A0249C2E/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page