- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I already got some experience with SSE to AVX transition penalties and read the following article: http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
There is written, only zeroall or zeroupper gets the cpu in the safe state where no penalties can occure.
Isn't this a problem in multithreading, multiprocessing? I mean, assume process A is running with SSE legacy code. For example normal floating point operations with scalar SSE code. And process B is using AVX and only at the end of function has a zeroupper.
What if context switch occurs in the middle of AVX code? The OS will switch context including YMM registers. But even if the upper are all zero, wouldn't the cpu remain in the other state? So context switches might lead to penalties for process A without any influene of the programmer. Or is there something I missunderstood?
This scenario just came to my mind and I don't know how one could solve this. Or is there a possibility for the OS to avoid this problem?
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, unfortunately this is something the manuals do not make precise statements. You only read, please avoid this transitions and how to do this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think it is quite hard to create real tests. Even if we seperate the meassurements for both transitions we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.
Do you have some idea? Can one read something like CPU cycle count? This might be a solution if possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.>>>
It will be executed in parallel with SSE/AVX uops stream on Port0 and Port1.Only when unrelated integer instruction will be scheduled for execution somehow inbetween for-loop uops stream there could be some overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Can one read something like CPU cycle count?
Yes and RDTSC instruction needs to be used ( or unsigned __int64 __rdtsc(void) intrinsic declared in intrin.h ).
The built-in compiler intrinsic __rdtsc is available in Microsoft and Intel C++ compilers, but not in gcc, where you would need to write out inline asm according to your choice of 32- or 64-bit mode. Intel MKL library includes a dsecnd function based on time stamp counter, with built-in translation to seconds. Both __rdtsc and dsecnd require the programmer to select the same thread for each use of the function, to take care of situations where the counter is not synchronized among CPUs at hardware reset (as it is on most motherboards with Intel CPUs). On Intel CPUs since Woodcrest, the counter actually counts buss clock cycles and multiplies by the nominal CPU clock multiplier (independent of power settings or overclocking).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Christian, I don't know if you do any Linux programming with a GCC compiler...
>>... where you would need to write out inline asm according to your choice of 32- or 64-bit mode...
That is correct and the following code does the job:
inline uint64 GetClock( void )
{
uint64 uiValue;
__asm__ volatile
(
"rdtsc;" : "=A" ( uiValue )
);
return ( uint64 )uiValue;
}
That code is for 32-bit mode, with gcc or icc. The method to return the 64-bit result as a normal uint64_t in 64-bit mode is different.e.g.
unsigned int _hi,_lo;
asm volatile("rdtsc":"=a"(_lo),"=d"(_hi));
return ((unsigned long long int)_hi << 32) | _lo;
For Windows historians I have also the X64 code from before the implemention of __rdtsc built-in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quite interessting, I will check to code.
As to your question, no currently I am not using gcc under Linux, at least not in combination with SSE/AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
__Z8HrtClockv what is this function?It looks like some (by judging its name) RTC handler routine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »