AVX transition penalties and OS support - Page 3

Christian_M_2 · ‎02-02-2013

Hello,

I already got some experience with SSE to AVX transition penalties and read the following article: http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

There is written, only zeroall or zeroupper gets the cpu in the safe state where no penalties can occure.

Isn't this a problem in multithreading, multiprocessing? I mean, assume process A is running with SSE legacy code. For example normal floating point operations with scalar SSE code. And process B is using AVX and only at the end of function has a zeroupper.

What if context switch occurs in the middle of AVX code? The OS will switch context including YMM registers. But even if the upper are all zero, wouldn't the cpu remain in the other state? So context switches might lead to penalties for process A without any influene of the programmer. Or is there something I missunderstood?

This scenario just came to my mind and I don't know how one could solve this. Or is there a possibility for the OS to avoid this problem?

Christian_M_2 · ‎03-10-2013

Yes, unfortunately this is something the manuals do not make precise statements. You only read, please avoid this transitions and how to do this.

Christian_M_2 · ‎03-16-2013

I think it is quite hard to create real tests. Even if we seperate the meassurements for both transitions we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.

Do you have some idea? Can one read something like CPU cycle count? This might be a solution if possible.

SergeyKostrov · ‎03-18-2013

>>...Even if we seperate the meassurements for both transitions we have still the influence of the loop as meassuring for-loop >>overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away... I've measured an overhead of empty for-loop some time ago. Also, that is why the 2nd line in specs is: ... - Disable All optimizations in Release in Debug configurations for the test application ... >>...Can one read something like CPU cycle count? Yes and RDTSC instruction needs to be used ( or unsigned __int64 __rdtsc(void) intrinsic declared in intrin.h ).

Bernard · ‎03-18-2013

>>> we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.>>>

It will be executed in parallel with SSE/AVX uops stream on Port0 and Port1.Only when unrelated integer instruction will be scheduled for execution somehow inbetween for-loop uops stream there could be some overhead.

TimP · ‎03-19-2013

Sergey Kostrov wrote:

>>...Can one read something like CPU cycle count?

Yes and RDTSC instruction needs to be used ( or unsigned __int64 __rdtsc(void) intrinsic declared in intrin.h ).

The built-in compiler intrinsic __rdtsc is available in Microsoft and Intel C++ compilers, but not in gcc, where you would need to write out inline asm according to your choice of 32- or 64-bit mode. Intel MKL library includes a dsecnd function based on time stamp counter, with built-in translation to seconds. Both __rdtsc and dsecnd require the programmer to select the same thread for each use of the function, to take care of situations where the counter is not synchronized among CPUs at hardware reset (as it is on most motherboards with Intel CPUs). On Intel CPUs since Woodcrest, the counter actually counts buss clock cycles and multiplies by the nominal CPU clock multiplier (independent of power settings or overclocking).

SergeyKostrov · ‎03-19-2013

Christian, I don't know if you do any Linux programming with a GCC compiler... >>... where you would need to write out inline asm according to your choice of 32- or 64-bit mode... That is correct and the following code does the job: inline uint64 GetClock( void ) { uint64 uiValue; __asm__ volatile ( "rdtsc;" : "=A" ( uiValue ) ); return ( uint64 )uiValue; }

TimP · ‎03-19-2013

Sergey Kostrov wrote:

Christian, I don't know if you do any Linux programming with a GCC compiler...

>>... where you would need to write out inline asm according to your choice of 32- or 64-bit mode...

That is correct and the following code does the job:

inline uint64 GetClock( void )
{
uint64 uiValue;
__asm__ volatile
(
"rdtsc;" : "=A" ( uiValue )
);
return ( uint64 )uiValue;
}

That code is for 32-bit mode, with gcc or icc. The method to return the 64-bit result as a normal uint64_t in 64-bit mode is different.e.g.

   unsigned int _hi,_lo;
   asm volatile("rdtsc":"=a"(_lo),"=d"(_hi));
   return ((unsigned long long int)_hi << 32) | _lo;

For Windows historians I have also the X64 code from before the implemention of __rdtsc built-in.

SergeyKostrov · ‎03-19-2013

Thanks, Tim! I'll need to review that piece of codes ( used currently on a project... ) in case of 64-bit applications...

SergeyKostrov · ‎03-20-2013

>>...For Windows historians I have also the X64 code from before the implemention of __rdtsc built-in. That would be nice to see and please post it. Thanks in advance!

SergeyKostrov · ‎03-21-2013

I verified my 32-bit version of a function that uses RDTSC intrinsic funtion ( for MinGW compiler ) and generated assembler codes are as follows: ... __Z8HrtClockv: .stabs "../../Include/DevHrtAL.h",132,0,0,Ltext35 Ltext35: .stabn 68,0,356,LM3141-__Z8HrtClockv LM3141: pushl %ebp movl %esp, %ebp subl $8, %esp LBB551: LBB552: .stabn 68,0,365,LM3142-__Z8HrtClockv LM3142: /APP rdtsc; /NO_APP movl %eax, -8(%ebp) movl %edx, -4(%ebp) .stabn 68,0,367,LM3143-__Z8HrtClockv LM3143: movl -8(%ebp), %eax movl -4(%ebp), %edx LBE552: LBE551: .stabn 68,0,368,LM3144-__Z8HrtClockv LM3144: leave ret .stabs "uiValue:(110,22)",128,0,361,-8 .stabn 192,0,0,LBB552-__Z8HrtClockv .stabn 224,0,0,LBE552-__Z8HrtClockv ...

Christian_M_2 · ‎03-23-2013

Quite interessting, I will check to code.

As to your question, no currently I am not using gcc under Linux, at least not in combination with SSE/AVX.

Bernard · ‎03-23-2013

__Z8HrtClockv what is this function?It looks like some (by judging its name) RTC handler routine.

SergeyKostrov · ‎03-23-2013

>>Quite interessting, I will check to code. >> >>As to your question, no currently I am not using gcc under Linux, at least not in combination with SSE/AVX. For a long time I'm using MinGW C++ compiler v3.4.2 ( 32-bit version / supports SSE, SSE2 and SSE3 / No support for SSSE3, AVX, and AVX2 ) and I really like it because of its compatibility with GCC. You could consider it as a 99.99% GCC-compatible compiler on Windows platform and it means that up to some point I don't need a Linux and it reduces development overheads related to C/C++ codes compatibility. An upgrade to a newer version 4.x.x ( 32-bit & 64-bit with support for latest Intel instruction sets ) is scheduled some time this year. Anyway, this is a really good C/C++ compiler for free (!).