I wrote a subroutine mostly using compiler intrinsics of AVX2 and AVX, I used some SSE instructions too but I did set the enhanced instruction set to AVX2 in the project settings of Visual Studio.
My program has many other routines and every time I run it, it gives me the same result. But when I profile it or print out the times taken by different functions, the function that I wrote using AVX2 instructions shows weird behavior. It slows down in certain runs by a factor of 10 to 15 whereas other functions vary by a scale factor of at most 2.
Can anybody explain this? Any help will be greatly appreciated. For certain reasons I cannot paste my code here.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Some processors (definitely Xeon E5 v3, but perhaps others) incur a delay of about 10 microseconds (about 30,000 cycles) shortly after any core begins using 256-bit registers (if it has not used 256-bit registers in the last millisecond). (This appears to be a delay required to "turn on" the upper 128-bits of the vector pipelines -- similar to a frequency change, but occurring even if the frequency does not change, so it is probably a delay due to a voltage change.) This stall only occurs when using 256-bit registers -- AVX/AVX2 encodings of scalar and 128-bit register operations do not generate this delay.
If your functions are short (e.g., 3000 cycles), then this overhead would be 10x.
If the time between calls to functions using 256-bit registers is close to 1 millisecond, the stall becomes more random -- sometimes your code will execute just before the upper 128-bit pipeline is turned off and sometimes it will execute just after the upper 128-bit pipeline is turned off.
The only way to avoid this stall is to either (1) never use 256-bit registers, or (2) make sure that the code never goes a full millisecond without using 256-bit registers.
My benchmark codes now include a "warm-up" function that uses 256-bit registers and runs for approximately 2 milliseconds immediately before the initial timer read for any code that I want to test that uses 256-bit registers. This does not eliminate the stall, but it ensures that it happens before my timed region.
The stall seems long, but since it can only happen once per millisecond, it can't eat up more than about 1% of the overall cycles.
My thread is not pinned, but I was profiling with GPU view and found that even though the function was running uninterrupted in a single core it was taking around 10 times more time in certain cases.
In my program there are basically 3 expensive functions. Let the functions be funcA , funcB, funcInQuestion, where funcInQuestion is the function I am talking about.
I repeatedly call the functions in this order
funcA -> funcInQuestion -> funcB-> funcInQuestion.
Both funcA and funcB use 256 bit registers but not as heavily as the funcInQuestion and funcA and funcB never has such massive random slowdowns. And regarding funcInQuestion it has a expensive inner loop inside another loop which runs around 8 times. Inside the innermost loop I do simple vectorized math mainly using fmadd, fmsub, and use a little bit of permute and blend instructions.
I need to repeatedly use some 256 bit variables placed in some arrays and I gather the 256 bit data into variables so that I need to use minimal array indexing inside the innermost loop. When the function is behaving normally, each call to it takes around 250 microseconds and enters the innermost loop 8 times (with minimal intermediate time), thus to go through the loop a certain number of times takes on average of of 31 microseconds, thus I can assume I have an overhead of (10/250) for turning on the 256 bit registers, assuming funcA and funcB took more than 1 millisecond (which they don't most of the time) and only touched the 256 bit registers at their respective starting points (highly unlikely).
Basically funcInQuestion is the second version of my function. The last version I wrote performed consistently. When I disassembled the previous version it had around 237 assembly instructions inside the innermost loop, my rewritten version has around 127 instructions in the innermost loop. The new one performs better in most cases until and unless I have this random slowdown, but I DID NOT notice this random behavior in the old version.
One noticeable difference between funcInQuestion and its older counterpart is the use of blend instructions and the use of permute instructions that permute values across the whole 256 bit registers.
In both versions at one point I need three 256 bit registers with 6 floats with the first two float being garbage values with this format:
DC DC Calc1 Calc2 Calc3 Const1 Const2 Const3 (let me call it Desired1)
DC represent don't care, Calc represent calculated inside the loop and Const represent values that remain constant through the loop.
In my old version I end up getting
Calc1 Calc2 Calc3 Calc4 Calc5 Calc6 Calc7 (let me call it CalcS)
then I make my Desired1 256 bit variable by taking the values individually from the CalcS and take the Const1 Cons2 Const3 from an array and put all of them together using insert operations to form Desired1. This generated quite a bit of assembly code.
In my new version that is the funcInQuestion I end up having a variable that has
DC DC Calc1 Calc2 Calc3 Calc4 Calc5 Calc6 (CalcsNew)
I blend it with a 256 bit variable
DC DC DC DC DC DC Const1 Const2 Const3 (which I arranged outside the loop, say as variable ConstsRow0)
For Desired2 ( similar to Desired1) I permute CalcsNew and blend it with ConstsRow1(similar to ConstsRow0)
When the perfomance is slow the Visual Studio Profiler most of the time indicates that this blend instructions take a lot of time, but it cannot be trusted since Visual Studio Profiler in such cases, where it is pointing to single instructions as hot spots.
The Visual Studio profiler also sometimes indicates that a fmadd instruction inside the loop takes a lot of time , which is unreasonable given the disassembly showed that the instruction was mostly using registers.
What I am thinking is that when I access Const1 Const2 and Const3 separately from an array and put them in Desired1 using insert instructions in my old version of the code, the variables Const1, Const2 and Const3 go into different lines of the cache, but when in my new version I access ConstsRow0 (DC DC DC DC DC DC Const1 Const2 Const3) for the blend instruction they all gets placed in one single line of the cache, and probably when it gets evicted or it evicts another 256 bit variable that I am using I get this penalty and this situation is triggered given where the program is placed in memory every time.
But then again I checked online the cache structure of the Corei7 processor that I am using has 8 way set associative L1 and L2 cache and 16 way L3 cache and I use only around 18 256 bit variables inside my innermost loop and around 8 128 bit variables.
The format of my code is for the innermost loop is:
__m256 v1 = ...
__m256 v2 = ...
__m256 v3 = ...
__m128 z1 = ..
__m128 z2 = ..
__m128 z3 = ..
for( i =0; i <large number; i++)
Simple vectorized operations with the variables gathered on top. This loop is where the time is spent.
I made a similar post in stack overflow : http://stackoverflow.com/questions/41972223/reason-for-random-slowdown-in-vectorized-code
Thank you very much for your help and comments.
@John: I've already read some different statements - delay of some 100 ns up to 1 µs if AVX2 is not used in last 10 ns. The question interesting me too - how to avoid that situation, is somewhere one example?
It should be easy enough for you to try pinning the problem some thread. Note, if your GPU has some type of CPU binding (e.g. first core), then avoid that core when pinning your worker thread. Without pinning, should the O/S schedule something else and suspends your worker thread, your worker thread may resume on a different core. Resuming back on the same core will have some probability of maintaining some values in cache (little/none in L1, possibly some in L2), migrating to different core will definitely lose L1 and L2 (unless you are on a core duo in which 2 cores share L2).
You might try inserting a 256-bit instruction inside the outer loop of funcInQuestion.
I probably found the answer to the question but I am not able to verify it right now. I will post the test results as soon as I can.
What I did in my function was, to reduce array indexing and maximize the use of registers , I put some data from arrays into variables.
For example I want to access Darray, Darray ...... Darray;
In the start of the looop I used the code
__m256 D0 = Darray;
__m256 D1 = Darray;
and so on. Most of the time the compiler generates assembly code where the variables are put into registers and the registers are used but in this case the register pressure was too high and they were not put into registers and instead put into different memory locations. I printed out the address of D0 and the difference in address with the other variables D1 , D2 .... etc
This is the result that I got (the first number is the address of D0 and the next ones are the offsets from it):
280449315232 -704 -768 -640 -736 -608
Even though I access the variables in my code sequentially, sometimes they are quite far apart.
This is the result of another array (this one is the most surprising)
280449314144 416 512
812970176 128 192 256 224 160 1152
Thus when I access one variable it is unlikely I will bring the other one in the cache. But with one iteration of the loop I might bring all the variables in the cache but some other program has the ability to remove them from the cache anytime. If I used an array, even if the variables I fetched into cache could get removed from the cache, I would end up bringing some elements to the cache when I access other elements.
I will use arrays again for most of the data and try to fit the rest in registers. I will benchmark and report my findings in this post.
Thanks a lot.
@Anil - which compiler is used? I've expected similar "strange" behaviour from Microsoft VS (until our current 2015.3) C++ compiler. Even if only 16 variables, register-hinting and properly code sequence are used, the compiled assembly code is quite far away from optimal results (useles register spilling, data recaching, totally wrong access sequence to memory, etc.) - ofter I rewrote code complettelly in assembler and got huge performance improvement. See my other question here. MS VC++ compiler generates shameless imperformant code and thinks it does things better as programmer can do this.
@Alexander - I was using Visual Studio 2013 compiler. I have another issue with 2013 compiler, even if I specifically align a __m128 variable and use aligned access like _mm_load_ps it is still accessed using unaligned memory accesses.
Can you tell me how did you write your code in assembly as Visual Studio does not support inline assembly anymore ? Did you write your code in a separate IDE ?
Thanks a lot.
@Anil - the same with VS 2015. Alignment of variable ssems not to be in the scope of compiler interest. I's required either to use intrinsics or assembler command with aligned signature.
About a question how to write asssembler code:
1. Create a document with extension ".asm".
2. Add to your solution (if not already created from IDE, in this case it's already added).
3. Select this.asm-file from Solution Explorer, goto properties page (i.e. with right mouse click, select ...).
4. Set to custom build.
5. After selecting the custom build tool, the left side will be refreshed and you can put the parameter how to build.
6. For release build (you can put the same for debug or want to add debug info parameter) for x64 the command line looks like:
ml64.exe /DWIN_X64 /c /Zi /Fl /Fo $(IntDir)%(Filename).obj %(Fullpath)
and output line:
7. In most cases you will additionally define your assembler-method like
#pragma pack(push, 1) // exact fit - no padding
int aStride; // Length = 4 Offset = 0 <--- currently unused, but must be defined to preserve offset
int* aSomePtr; // Length = 8 Offset = 4 <--- currently unused, but must be defined to preserve offset
int* aSomePtr2; // Length = 8 Offset = 12 <--- currently unused, but must be defined to preserve offset
#pragma pack(pop) //back to whatever the previous packing mode was
extern "C" void __CollectObjectsInfo8bppAVX(void *CCollect8bpp_x64);
And call like:
col.aStride = aStride;
Details about MASM for x32 vs. x64 are documented by Microsoft, same for calling convention.
Be prepared for different pointer length between x32 and x64.
For all who reads here, today I've determined that sometimes (but really surprisingly not always!) is is required to write
ml64.exe /DWIN_X64 /c /Zi /Fl /Fo $(IntDir)%(Filename).obj "%(Fullpath)"
also with quotation marks, if path or name contains spaces otherwise MASM generates error file not found.
What I don't understand, why it worked bevore and stopped to work today after moving to one subdirectory (the directory with space is and was on the root path), but only for two files and not for all files. Really strange...
I changed the threadaffinity mask and tried with value 1 and 3 and saw in the resource monitor my specified CPU cores were used while I ran the program.
I arranged all the __m256 values and __m128 values I need to use in linear memory and used _mm_malloc with alignment parameter 64 to allocate the memory. I checked the assembly, even though it was not exactly following the sequence of the C code, more or less I was accessing linear memory.
The code runs fast but STILL I have this random factor of 5-20 slowdown. I checked using CPU Z , the frequency of the CPU cores do not vary by more than 50 Hz or so. I turned of SPEED STEP in BIOS. In resource monitor I saw that my program was not having hard faults either.
I do not understand the random slowdown of my function and what is more weird is that other functions in the same program do not have this behavior. Only difference is that this function is heavily AVX code dependent. I go through an array of structure, read a new structure and operate its values with a specific set of __m256 variables which now I have in linear memory. And as I have said before every time I run the program I get the same results.
I've programmed permanent timing-indicator in our software. And ofter I can see same behaviour - same method consumes more or less time. I think - that maybe dependent of cache pollution from other software-tasks or other system-processes or by process-slice sharing with other processes.
The next question - do you use high precision timers? I can send you the code for c++ and .net.
My function takes around 200-600 microseconds and around 10000 microseconds when slowed down so I think precision should not be an issue and I am using boost timers which are more accurate than std timers. I have noticed all functions I have written slows down randomly but by factors of 2-3 but what is weird about this specific function is that it slows down by factors of 10 to even 20 sometimes and also when I profile the performance by a performance profiler in those slowed down runs the profiler shows that this particular function took much more time than other functions but in most profiles this would not be the case. This function would be taking time with a more or less constant ratio compared to the other functions in the program.
May be the reason is in the other functions I do not have a particular set of memory that I use again and again. So they are less affected by cache pollution. I prefetched most of the data inside every iteration of the innermost loop and still I have this behavior but it is manifesting less often.
it is nearly impossible to say, what is definitivela a reason. But I would check/think about following conditions.
1. Amount read / write of memory used by this method vs. other methods.
2. Is used memory distibuted in some way - cache pollution. See the answer from Jim again.
3. Page segmentation. See also page fault counters if you can (i.e. in Taskmanager with vs. without this method executed)
4. Is the system at memory limit and swapping just occurs
5. Is it possible, that you start other thread (check priority) or this threads just starts at the time slowdown happens
6. Using of unprecise timestamps are only 15 ms (15.000 microseconds) exact, if I remember that correctly.
7. Do more excatly timing self - for a whole method and parts of methods - that is the nearly ultimative solution to identify parts where time is really consuming.
8. See profiling for each instruction, not for a whole block
For two weaks I had to improve one bigger algorithms. I was sure, the CCL part was the mostly (80-90% of a whole time) time consumprion part and very hard to improve or parallelize. But I done exactly time profiling. ... and was very surprised, some other parts consumes not much less time but those parts where notto hard to improve - so I won around 40% of time consumption without spend weeks of development at wrong site.
Here is simply example:
LARGE_INTEGER start, stop;
LONGLONG ticks=stop.QuadPart - start.QuadPart;
If you need it im ms:
Presumably you are on Windows (QueryPerformanceCounter is a Windows function);
When making fine level timings you want to exclude interference by external sources as much as possible. Some guide lines
When possible, test on system with more than 1 core (core not logical processor).
If you can set your BIOS to disable HT, do so for the period you perform your timing runs.
If the CPU is such that 2 cores share L2, then use a CPU that has more than 4 cores.
Affinity pin your application to the first hardware thread on the core that does not share L2 with the first core.
If HT is enabled, start additional threads to consume remaining HT siblings for that core (and optionally the HT threads of core sharing L2). These additional threads run for the duration of the test in a loop issuing _mm_pause();
Set the thread priority to above normal (do not set to highest priority).
IOW - avoid the first core (and second core if second core shares L2 with first core), and run the test program at higher priority than "normal" programs.
The scheme is to attempt to keep the O/S from scheduling threads that run on the core(s) that share resources with your test thread.
*** This does not guarantee the O/S won't run anything in a manner that interferes with your test program, but this procedure will reduce the probability that it will.
Note, VTune will (may) interfere with your benchmark each time it writes the performance counters to buffer or disk.
If you instrument your code (you have) then you can count the number of fast iterations and number of slow iterations (perhaps make a histogram).
Hello Jim & Anil,
as you correctly mentioned "Note, VTune will (may) interfere with your benchmark each time it writes the performance counters to buffer or disk." That is the reason to do timing by herself, without any instruments.
We can do more, shutown/disble all system services, pogramms, drivers etc. that not strongly reqired to system execution - thsi will also reduce the possibility for sheduling interrupts. By the way - some hardware can also issue hard interrupt.
One thing more - if exception is happened and promoted, we can get really huge performance gap. Sometimes we can find such things in system log.
If you can visualise some performance counters (like HDD, etc), this can also help - just to see if this have some strange behaviour during slowdown-phase. Maybe you also need to visualise CPU-Cores temperature. We had this problem 2 weeks ago just after changing of processor on test system.
Hello Jim and Alex,
I had tested with a profiler and also using boost timers. Both showed that there were significant random slowdowns.
I checked the temperature and noticed CPU frequency. There was no thermal throttling.
I had put the thread in high priority mode and also put in core 2 and still had random slowdowns.
I have been able to significantly reduce the slowdown by one simple change but I do not fully understand why it worked.
Initially I wanted to put the set of variables I would use during the full execution of the function in __m256 variables, but visual studio compiler instead of putting them into registers put them in memory and in very far addresses. I had random slowdowns.
Later I put all those variables in a dynamically allocated array of __m256 which I aligned to 64 bytes by _mm_malloc and still had slowdowns. By the way I allocate this memory and reuse it through out the duration of the program.
Now I just put the set of variables in a statically allocated __m256 array with __declspec(align(64)) specification and now I have this slowdown once in around 10 runs and the factor is at most 8 instead of 10-20. I have run the code many times and have never found that the time exceeded 4.5 ms whereas with the dynamically allocated array it was upto 12ms.
As far as the assembly goes, not very much changed, just that some instructions which were like this :
vfmadd231ps ymm9, ymm8, YMMWORD PTR [r9+rdx+128]
changed into instructions like this :
vfmadd231ps ymm9, ymm8, YMMWORD PTR [rdx-32]
and I assume they both will take same amount of time to execute.
One observation I have made. During the execution of my program this function along with other functions are called repeatedly. If the particular function does not have this slowdown, it runs fast during the whole execution of the program even if I keep the program running for 5 minutes and if I experience the slowdown, it executes slowly throughout the whole program run.
Could the address where the memory is allocated be the reason of the slowdown (page faults may be but then again Task Manager did not report any page fault during the slowdowns) ? As far as I know that is not supposed to affect performance as long as I allocate aligned memory and neither is access to statically allocated memory supposed to be faster than dynamically allocated memory.
all that does not explain the random slowdownes.
By the way - Visual Studio compiler is really bad - see my posts from this year. Others mentioned same problems. I.e. you use exactly 16 _m128i varibales - but the compiler spills that variables and generate a lot of totally unnecessary load/store instruction. Next step - declare this with "register" hint - does not work either, much more - the compiled code is worse! That - just to see (my post explains this): You wrote in code:
OP ymm0, ...
OP ymm3, ..
OP ymm6, ..
and what compiler does? It reorders STORE-instructions in that way:
Do you see?
That is, why I wrote nearly evering in assembler (using MASM) and got performance boost 1.7x - 3.0x
The next thing - compiler does not know anything about your alignment declaration. either you declare it with proper alignment, thei uses unaligned load/store instructions. But the difference is not that huge between this, the real difference would come from real alignment of data. Difference makes the WC (write combining) - using that technique you write 4xXMM registers together at the consecutive adresses. Please notify - ther real maximal speedup (as I seen this in my cases) will be achived if first write goes to 64-byte aligend adress. What happens? Processor see all that 4x16 byte writes (MOVNTDA) and puts it together in one write combining buffer (there are multiple of it per code), that goes them directly to memory without cace pollution. Why it speed ups? Simplyfied to say, because only 64 bytes can be transferred between cache system and other, so if only 16 ytes should be actulaized, other 48 bytes must be actualized from other source (i.e DRAM - wchis is real slow!) either bevore.
But if you write 4x _mm_stream_si128 one by one (and check generated assembler code) chance is realy big, that VS compiler reorders it and puts sometning in between - so WC would not work.
My recommendation is not to use sometnign like:
vfmadd231ps ymm9, ymm8, YMMWORD PTR [r9+rdx+128]
mov(nt)... ymm7, YMMWORD PTR [r9+rdx+128]
... sometnihg other consuming cpu cycles
vfmadd231ps ymm9, ymm8, ymm7
The next recommendation is to use Intel amplifier and go to instruction level to identify which instruction consumes most of time.
This work also with evaluation version, which integrates in visual stido really perfect - this help me a lot.