Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX transition penalties and OS support

Christian_M_2
初學者
6,598 檢視

Hello,

I already got some experience with SSE to AVX transition penalties and read the following article: http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

There is written, only zeroall or zeroupper gets the cpu in the safe state where no penalties can occure.

Isn't this a problem in multithreading, multiprocessing? I mean, assume process A is running with SSE legacy code. For example normal floating point operations with scalar SSE code. And process B is using AVX and only at the end of function has a zeroupper.

What if context switch occurs in the middle of AVX code? The OS will switch context including YMM registers. But even if the upper are all zero, wouldn't the cpu remain in the other state? So context switches might lead to penalties for process A without any influene of the programmer. Or is there something I missunderstood?

This scenario just came to my mind and I don't know how one could solve this. Or is there a possibility for the OS to avoid this problem?

0 積分
54 回應
SergeyKostrov
傑出貢獻者 II
1,808 檢視
Hi Christian, >>...But can I find any concrete informations about cycles for both directions? Please try to look at Intel Manuals or try to find all ( or as many as possible ) Intel Articles related to that subject. Sorry, I can't suggest anything else now.
SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>...Sorry, I can't suggest anything else now... Christian, Are you interested in another small project related to measuring AVX-to-SSE and SSE-to-AVX transistions on Sandy Bridge and Ivy Bridge systems?
Bernard
傑出貢獻者 I
1,808 檢視
>>>I think I did not state my question clear enough. I am talking about AVX state transistion because of mixing avx with sse legacy>>> It is ok:) Later I understood your intention.I have found this article about the transition penalty ://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf Btw.I disabled rich-text option and removed any ocurrence of www and http protocol indentifiers.
Christian_M_2
初學者
1,808 檢視

>>>It is ok:) Later I understood your intention.I have found this article about the transition penalty ://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

Thanks for the article!

Christian_M_2
初學者
1,808 檢視

>>>Christian, Are you interested in another small project related to measuring AVX-to-SSE and SSE-to-AVX transistions on Sandy Bridge and Ivy Bridge systems?

Yes, would be interesting! Write me a PM or lets open a new thread, or is it related to this subject?

I look in the manuals once again. So far I only have these two articles that mention concrete numbers. Manuals only provide hints to avoid transition penalties, why and how to do this. If I find anything else, I will post it here.

Bernard
傑出貢獻者 I
1,808 檢視

>>>Thanks for the article!>>>

As always You are welcome.

SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>...Yes, would be interesting! Write me a PM or lets open a new thread, or is it related to this subject? Christian, we could proceed in the same way as we did with SqrtTestApp project. So, I'll prepare some proposals and let you know as soon as they are ready. Please try to think and prepare your proposals how you would measure these transitions. A new thread "Measuring AVX-to-SSE and SSE-to-AVX transitions on Sandy Bridge and Ivy Bridge systems" could be created as soon as we have some numbers. It doesn't make sence to create it right now.
Christian_M_2
初學者
1,808 檢視

Ok, this is my first idea:

Let us code a simple loops that does for example an addition and an multiplication with AVX. Then we might only store the lower half of the register with an SSE intrinsics. This should create both transition penalties. Compiling this code with and without /arch:AVX we should get an version with and one without the penalties. This could be checked by Intel SDE, where we get the exact count of transitions. If we meassure time and compare results, difference should be time for penalties. By knowing the number of transitions we should put this into concrete cycle numbers.

What do you think about it, do you have another idea or see some problem?

I think about one thing. If we only use one half of the AVX result and store it with SSE, will the compiler realized he could use SSE for AVX as we only need half of it? Then the test might faild.

TimP
榮譽貢獻者 III
1,808 檢視

The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty.

SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>...The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE >>intrinsics, so there should be no transition penalty. Thanks for the comment, Tim. I think we shouldn't use /arch:AVX option.
SergeyKostrov
傑出貢獻者 II
1,808 檢視
Hi everybody, >>...What do you think about it, do you have another idea or see some problem? Christian proposals are very good and please take a look at my proposals ( they are similar ): - Create a test application with support for 32-bit and 64-bit Windows platforms - Disable All optimizations in Release in Debug configurations for the test application - Select two instructions to reproduce AVX-to-SSE transition ( AVXI1 - AVX instruction 1 / SSEI1 - SSE instruction 1 ) - Select two instructions to reproduce SSE-to-AVX transition ( SSEI2 - SSE instruction 2 / AVXI2 - AVX instruction 2 ) - Implement two test cases to reproduce AVX-to-SSE and SSE-to-AVX transitions ( as simple as possible ) - Verify in the Debugger that compiler did not replace SSE instructions with AVX-128 instructions - Verify with SDE that transitions are present for both cases ( AVX-to-SSE and SSE-to-AVX ) - All measurements have to be done when a priority of the process is switched to Real-Time - Use rdtsc instruction to measure all time intervals in clock cycles ( cc ) - Number of interations should be around 2^20 and it will provide acceptable accuracy +/- 5(cc) - Measure overhead of an empty for loop ( OEFL(cc) - Overhead of Empty For Loop ) - Measure latency of AVXI1(cc) - Measure latency of SSE1(cc) - Measure latency of AVXI2(cc) - Measure latency of SSE2(cc) - Total time of test for AVX-to-SSE transition TTT1(cc) - Total time of test for SSE-to-AVX transition TTT2(cc) - Calculate time for AVX-to-SSE transition: T-AVX2SSE(cc) = ( TTT1(cc) - OEFL(cc) - AVXI1(cc) - SSE1(cc) ) / N - Calculate time for SSE-to-AVX transition: T-SSE2AVX(cc) = ( TTT2(cc) - OEFL(cc) - AVXI2(cc) - SSE2(cc) ) / N It is actually a good idea to combine Christian's and Sergey's proposals and their results should be consistent. If results are Not consistent then there is a problem and additional investigation will be needed. Note: I forgot to include TTT1(cc) and TTT2(cc)
SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>- Select two instructions to reproduce AVX-to-SSE transition ( AVXI1 - AVX instruction 1 / SSEI1 - SSE instruction 1 ) >>- Select two instructions to reproduce SSE-to-AVX transition ( SSEI2 - SSE instruction 2 / AVXI2 - AVX instruction 2 ) Christian, Could you select just two instructions, one for AVX and another for SSE, instead of four? I think it will simplify tests. Thanks in advance.
Bernard
傑出貢獻者 I
1,808 檢視
>>>Measure overhead of an empty for loop ( OEFL(cc) - Overhead of Empty For Loop )>>> Execution of for-loop statements will be performed in parallel with the execution of the code contained inside the loop block.This will be done on Port0 and/or Port1.
Christian_M_2
初學者
1,808 檢視

Sergey, looks quite good your idea.

What do you think of a simple logic and for SSE and AVX. Both should have latency 1 and througput 1 for sandy and ivy bridge. I think this is quite a good basis.

There is one thing: If we execute something in a loop we always get both transitions. Becauce if we do SSE and then AVX command in a loop. The next iteration will cause the opposite transition. So we might use a zeroall to avoid this.

>>> Execution of for-loop statements will be performed in parallel with the execution of the code contained inside the loop block.This will be done on Port0 and/or Port1.

This would make it hard to meassure empty for loop. But what about Intel SDE? You can let is analyse some statements regarding througput and latency. Maybe this information might help us.

SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>...There is one thing: If we execute something in a loop we always get both transitions. Becauce if we do SSE and then >>AVX command in a loop. The next iteration will cause the opposite transition. So we might use a zeroall to avoid this... Yes, that is correct and I missed it. >>...This would make it hard to meassure empty for loop... I did a couple of tests in the past and take a look: ... // Sub-Test 6.1 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.1 - [ Empty For Statement ]\n") ); g_uiTicksStart = SysGetTickCount(); for( RTint t = 0; t < 10000000; t++ ) { ; } CrtPrintf( RTU("Sub-Test 6.1 - 10,000,000 iterations - %4ld ticks\n"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) ); //*/ } // Sub-Test 6.2 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement ]\n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 1000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.2 - 1,000,000 iterations - %4ld clock cycles\n"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) ); //*/ } ... and I'll run these tests on Ivy Bridge with a higher number of interations. Notes: CrtPrintf = _tprintf RTU = _T SysGetTickCount = GetTickCount RTclock_t = clock_t CrtClock = __rdtsc
SergeyKostrov
傑出貢獻者 II
1,808 檢視
Christian, Let's wait for a couple of days for input from the community or Intel software engineers. Let's start in the middle of the next week.
Bernard
傑出貢獻者 I
1,808 檢視

>>>I did a couple of tests in the past and take a look:>>>

Ithink that loop overhead will have some influance on the speed of execution only when the floating-point values are used as a loop control variables.When you use integer values as loop control variables modern processor will exploit an instruction level parallelism and execute both types of instruction in parallel.

SergeyKostrov
傑出貢獻者 II
1,808 檢視
>>...There is one thing: If we execute something in a loop we always get both transitions... Christian, In that case our generic equations need to be changed to: >>>>... >>>>- Calculate time for AVX-to-SSE transition: T-AVX2SSE(cc) = ( TTT1(cc) - OEFL(cc) - AVXI1(cc) - SSE1(cc) ) / ( N * 2 ) >>>>- Calculate time for SSE-to-AVX transition: T-SSE2AVX(cc) = ( TTT2(cc) - OEFL(cc) - AVXI2(cc) - SSE2(cc) ) / ( N * 2 ) >>>>... However, these are too generic equations. We don't know if time for AVX-to-SSE transition will be equal to time for SSE-to-AVX transition for ALL possible combinations of AVX and SSE instructions which cause transitions. So, Zeroall-Approach at the end of for-loop looks good and I would really stick with a generic case assuming that some variations in transition time are possible but we're not going to prove or disprove it. It could be a different R&D project... Have you selected AVX and SSE instructions you're the most interested?
Bernard
傑出貢獻者 I
1,808 檢視

>>>This would make it hard to meassure empty for loop>>>

IIRC Sergey demonstrated it on "optimization of sine function" thread.Unfortunately the results sre lost because of forum transition.

P.s

Very interesting test case unfortunately I cannot contribute because of my old cpu.I will follow the tests.

Christian_M_2
初學者
1,783 檢視

Sergey,

I know it is only an assumption. But AVX to SSE means CPU has to store something and SSE to AVX means loading of the register information.  In my mind both should as a first approximation take nearly the same time. Maybe storing is faster if data is written to cache and for loading the data is already in memory or other cache.

Do you think the exact instruction has an incluence? I suppose the alwayse store and restore the same amount of data.

To start we should select ADD or MUL instructions. The are commonly used and provide great performance.

// EDIT: Maybe you are right and there is a dependcy. The article mentions 60-80 cycles. Either this is uncertainty of the meassurements or there is some factor that has incluence on this.

SergeyKostrov
傑出貢獻者 II
1,783 檢視
>>...Do you think the exact instruction has an incluence?.. I don't know. Even if there is a finite number of instructions verification of different combinations will be a time consuming ( wasting? ) process. >>... The article mentions 60-80 cycles... I was also surprized to see that range because 20 clock cycles difference looks too much.
回覆