Can Intel VTune Amplifier be used to optimise a operating system kernel? I'm adding SSE,SSE2 and SSE3 support in my os kernel. It would be nice if it worked with it so i can optimise SIMD operations.
In general, yes! VTune Amplifier XE will take samples within the OS. If you have symbols in a supported format, it should be able to give you performance metrics for your functions, etc. You will need to run an app or something to cause OS code to be executed, obvously. But, for example, you can use the VTune Amplifier XE to optimize the Linux* kernel (and many people do ;).
Without more details, it is difficult to say more. :\
MOV EDI, [EBP + 0Ch]
MOV ESI, [EBP + 08h]
MOVNTDQ [EDI], XMM0
LEA EDI, [EDI+16] ;Add 16 to Destination Address
LEA ECX, [ECX-16] ;Sub 16 from ECX
I'm trying to create a memory clear function using the above code. But it seems i might be doing something wrong because rep stosb is much faster when i use it to clear memory
That, my friend, is a totally different question! Let's see if anyone has any suggestions.
Have you profiled this code using the VTune Amplifier XE? Did you take a look at the bandwidth (assuming it is a processor that has bandwidth analysis support)?
If you have the latest release, it supports Windows* 8. Please see Release Notes and documentation for details. If you are running in Metro mode, you will need to switch to desktop mode to run VTune Amplifier XE.
In general you should get more store memory transfer bandwidth by using movntdq when the source data is cached and accessed consecutively.Did you try to compare the results of rep stosb by looking at front-end and back-end stalls in Vtune?
I've got the lastest VTune but it supports only .exe files. My OS kernel is *.bin. for time measurement of the code i use the RTC time and check how many seconds it takes for rep stosb and SIMD memory operations to complete 10000 memory clears.i usually get 1 second for rep stosb and 2 secs for SIMD using the above code to clear 64 bytes of memory. the test was done only with a 16 byte aligned memory address.
I temprorily solved my problem by writing an console app with SIMD instructions in VC++ to test in Vtune. Is Movdqa faster than Movntdq? Because is seems Movdqa is 9x faster than movntdq under VTune.
Movntdq clocked at 0.065 seconds for moving 4KB of data
Movdqa clocked at 0.007 seconds for moving 4KB of data
By consulting Anger Fog instructions latency tables it seem that movntdq has a large latency of ~400 cycles compared to one cycle of stosb instruction.Throughput is the same for both instructions.
@5600, regarding the .bin issue, two comments:
1. There is a JIT API that would allow you to inform the VTune Amplifier XE where your code is loaded and what functions and statements exist in your code. You can find all the information in the product help files (see "JIT Profiling APIs").
2. Then, whatever loads your .bin and begins execution of the code is what you would configure VTune Amplifier XE to launch and profile?