- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can Intel VTune Amplifier be used to optimise a operating system kernel? I'm adding SSE,SSE2 and SSE3 support in my os kernel. It would be nice if it worked with it so i can optimise SIMD operations.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi 5600!
In general, yes! VTune Amplifier XE will take samples within the OS. If you have symbols in a supported format, it should be able to give you performance metrics for your functions, etc. You will need to run an app or something to cause OS code to be executed, obvously. But, for example, you can use the VTune Amplifier XE to optimize the Linux* kernel (and many people do ;).
Without more details, it is difficult to say more. :\
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MOV EDI, [EBP + 0Ch]
MOV ESI, [EBP + 08h]
XORPS XMM0,XMM0
.xLoop
MOVNTDQ [EDI], XMM0
LEA EDI, [EDI+16] ;Add 16 to Destination Address
LEA ECX, [ECX-16] ;Sub 16 from ECX
CMP ECX,0
JNZ .xLoop
I'm trying to create a memory clear function using the above code. But it seems i might be doing something wrong because rep stosb is much faster when i use it to clear memory
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi 5600:
That, my friend, is a totally different question! Let's see if anyone has any suggestions.
Have you profiled this code using the VTune Amplifier XE? Did you take a look at the bandwidth (assuming it is a processor that has bandwidth analysis support)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a Intel Core 2 Duo E6600 2.4GHz. Tried it on Intel VTune Amplifier XE 2011 but it doesn't like Windows 8 so i get errors from VTune.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have the latest release, it supports Windows* 8. Please see Release Notes and documentation for details. If you are running in Metro mode, you will need to switch to desktop mode to run VTune Amplifier XE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems like nontemporal streaming stores.How do you perform time measurement of that code?Rep stosb writes could be cached because of predictable behavior of the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general you should get more store memory transfer bandwidth by using movntdq when the source data is cached and accessed consecutively.Did you try to compare the results of rep stosb by looking at front-end and back-end stalls in Vtune?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've got the lastest VTune but it supports only .exe files. My OS kernel is *.bin. for time measurement of the code i use the RTC time and check how many seconds it takes for rep stosb and SIMD memory operations to complete 10000 memory clears.i usually get 1 second for rep stosb and 2 secs for SIMD using the above code to clear 64 bytes of memory. the test was done only with a 16 byte aligned memory address.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I temprorily solved my problem by writing an console app with SIMD instructions in VC++ to test in Vtune. Is Movdqa faster than Movntdq? Because is seems Movdqa is 9x faster than movntdq under VTune.
Movntdq clocked at 0.065 seconds for moving 4KB of data
Movdqa clocked at 0.007 seconds for moving 4KB of data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
By consulting Anger Fog instructions latency tables it seem that movntdq has a large latency of ~400 cycles compared to one cycle of stosb instruction.Throughput is the same for both instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes movdqa is faster it consumers three clocks and on Has well throughput is two instructions per cycle(Anger Fog tables).But at cost of cache pollution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@5600, regarding the .bin issue, two comments:
1. There is a JIT API that would allow you to inform the VTune Amplifier XE where your code is loaded and what functions and statements exist in your code. You can find all the information in the product help files (see "JIT Profiling APIs").
2. Then, whatever loads your .bin and begins execution of the code is what you would configure VTune Amplifier XE to launch and profile?
Just FYI.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page