Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Intel VTune Amplifier and OS developement

5600
Beginner
538 Views

Can Intel VTune Amplifier be used to optimise a operating system kernel? I'm adding SSE,SSE2 and SSE3 support in my os kernel. It would be nice if it worked with it so i can optimise SIMD  operations.

0 Kudos
12 Replies
David_A_Intel1
Employee
538 Views

Hi 5600!

In general, yes!  VTune Amplifier XE will take samples within the OS.  If you have symbols in a supported format, it should be able to give you performance metrics for your functions, etc.  You will need to run an app or something to cause OS code to be executed, obvously.  But, for example, you can use the VTune Amplifier XE to optimize the Linux* kernel (and many people do ;).

Without more details, it is difficult to say more. :\

0 Kudos
5600
Beginner
538 Views

MOV EDI, [EBP + 0Ch]
MOV ESI, [EBP + 08h]
XORPS XMM0,XMM0

.xLoop

MOVNTDQ [EDI], XMM0
LEA EDI,  [EDI+16] ;Add 16 to Destination Address
LEA ECX, [ECX-16] ;Sub 16 from ECX
CMP ECX,0
JNZ .xLoop

I'm trying to create a memory clear function using the above code. But it seems i might be doing something wrong because rep stosb is much faster when i use it to clear memory 

0 Kudos
David_A_Intel1
Employee
538 Views

Hi 5600:

That, my friend, is a totally different question!  Let's see if anyone has any suggestions.

Have you profiled this code using the VTune Amplifier XE?  Did you take a look at the bandwidth (assuming it is a processor that has bandwidth analysis support)?

0 Kudos
5600
Beginner
538 Views

I have a Intel Core 2 Duo E6600 2.4GHz. Tried it on Intel VTune Amplifier XE 2011 but it doesn't like Windows 8 so i get errors from VTune.

0 Kudos
David_A_Intel1
Employee
538 Views

If you have the latest release, it supports Windows* 8.  Please see Release Notes and documentation for details.  If you are running in Metro mode, you will need to switch to desktop mode to run VTune Amplifier XE.

0 Kudos
Bernard
Valued Contributor I
538 Views

It seems like nontemporal streaming stores.How do you perform time measurement of that code?Rep stosb writes could be cached because of predictable behavior of the loop.

0 Kudos
Bernard
Valued Contributor I
538 Views

In general you should get more store memory transfer bandwidth by using movntdq when the source  data is cached and accessed consecutively.Did you try to compare the results of rep stosb by looking at front-end and back-end stalls in Vtune?

0 Kudos
5600
Beginner
538 Views

I've got the lastest VTune but it supports only .exe files. My OS kernel is  *.bin.  for time measurement of the code i use the RTC time and check how many seconds it takes for rep stosb and SIMD memory operations to complete 10000 memory clears.i usually get 1 second for rep stosb and 2 secs for SIMD using the above code to clear 64 bytes of memory. the test was done only with a 16 byte  aligned memory address.

0 Kudos
5600
Beginner
538 Views

I temprorily solved my problem by writing an console app with SIMD instructions in VC++ to test in Vtune. Is Movdqa faster than Movntdq?  Because is seems Movdqa is 9x faster than movntdq under VTune. 

Movntdq clocked at 0.065 seconds for moving 4KB of data

Movdqa clocked at 0.007 seconds for moving 4KB of data

0 Kudos
Bernard
Valued Contributor I
538 Views

Hi 

By consulting Anger Fog instructions latency tables it seem that movntdq has a large latency of ~400 cycles compared to one cycle of stosb instruction.Throughput is the same for both instructions.

0 Kudos
Bernard
Valued Contributor I
538 Views

Yes movdqa is faster it consumers three clocks and on Has well throughput is two instructions per cycle(Anger Fog tables).But at cost of cache pollution.

0 Kudos
David_A_Intel1
Employee
538 Views

@5600, regarding the .bin issue, two comments:

1. There is a JIT API that would allow you to inform the VTune Amplifier XE where your code is loaded and what functions and statements exist in your code.  You can find all the information in the product help files (see "JIT Profiling APIs").

2. Then, whatever loads your .bin and begins execution of the code is what you would configure VTune Amplifier XE to launch and profile?

Just FYI.

0 Kudos
Reply