- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I already got some experience with SSE to AVX transition penalties and read the following article: http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
There is written, only zeroall or zeroupper gets the cpu in the safe state where no penalties can occure.
Isn't this a problem in multithreading, multiprocessing? I mean, assume process A is running with SSE legacy code. For example normal floating point operations with scalar SSE code. And process B is using AVX and only at the end of function has a zeroupper.
What if context switch occurs in the middle of AVX code? The OS will switch context including YMM registers. But even if the upper are all zero, wouldn't the cpu remain in the other state? So context switches might lead to penalties for process A without any influene of the programmer. Or is there something I missunderstood?
This scenario just came to my mind and I don't know how one could solve this. Or is there a possibility for the OS to avoid this problem?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>It is ok:) Later I understood your intention.I have found this article about the transition penalty ://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
Thanks for the article!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Christian, Are you interested in another small project related to measuring AVX-to-SSE and SSE-to-AVX transistions on Sandy Bridge and Ivy Bridge systems?
Yes, would be interesting! Write me a PM or lets open a new thread, or is it related to this subject?
I look in the manuals once again. So far I only have these two articles that mention concrete numbers. Manuals only provide hints to avoid transition penalties, why and how to do this. If I find anything else, I will post it here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Thanks for the article!>>>
As always You are welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, this is my first idea:
Let us code a simple loops that does for example an addition and an multiplication with AVX. Then we might only store the lower half of the register with an SSE intrinsics. This should create both transition penalties. Compiling this code with and without /arch:AVX we should get an version with and one without the penalties. This could be checked by Intel SDE, where we get the exact count of transitions. If we meassure time and compare results, difference should be time for penalties. By knowing the number of transitions we should put this into concrete cycle numbers.
What do you think about it, do you have another idea or see some problem?
I think about one thing. If we only use one half of the AVX result and store it with SSE, will the compiler realized he could use SSE for AVX as we only need half of it? Then the test might faild.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey, looks quite good your idea.
What do you think of a simple logic and for SSE and AVX. Both should have latency 1 and througput 1 for sandy and ivy bridge. I think this is quite a good basis.
There is one thing: If we execute something in a loop we always get both transitions. Becauce if we do SSE and then AVX command in a loop. The next iteration will cause the opposite transition. So we might use a zeroall to avoid this.
>>> Execution of for-loop statements will be performed in parallel with the execution of the code contained inside the loop block.This will be done on Port0 and/or Port1.
This would make it hard to meassure empty for loop. But what about Intel SDE? You can let is analyse some statements regarding througput and latency. Maybe this information might help us.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I did a couple of tests in the past and take a look:>>>
Ithink that loop overhead will have some influance on the speed of execution only when the floating-point values are used as a loop control variables.When you use integer values as loop control variables modern processor will exploit an instruction level parallelism and execute both types of instruction in parallel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>This would make it hard to meassure empty for loop>>>
IIRC Sergey demonstrated it on "optimization of sine function" thread.Unfortunately the results sre lost because of forum transition.
P.s
Very interesting test case unfortunately I cannot contribute because of my old cpu.I will follow the tests.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
I know it is only an assumption. But AVX to SSE means CPU has to store something and SSE to AVX means loading of the register information. In my mind both should as a first approximation take nearly the same time. Maybe storing is faster if data is written to cache and for loading the data is already in memory or other cache.
Do you think the exact instruction has an incluence? I suppose the alwayse store and restore the same amount of data.
To start we should select ADD or MUL instructions. The are commonly used and provide great performance.
// EDIT: Maybe you are right and there is a dependcy. The article mentions 60-80 cycles. Either this is uncertainty of the meassurements or there is some factor that has incluence on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page