Re: application runs quite slowly with vtune

softarts · ‎12-11-2009

the applicationcapture packet from the NIC (use PACKET_MMAP mechanism,copy data from kernel memory,no much SYS call)

even I slow the network traffic to 2-3 Mbps,the application still perform very badly,only capture 200 packets/second

without vtune it can capture 800Mbps+ packet (200K packets/s)

whyperformance impacted so much?
the code like this:
while (1)
{
buffer=new Buffer();
memcpy(buffer,kernel_buffer,size)
...
}

//add
one more question:
also found pthread_spin_lock time is very high,does vtune perform a 'real' thread on multi-core pc?
or it is just a 'fake' multithread? by use SW to simulate multithread?
my vtune is linux 9.1, downloaded on 2008/11/28

Vladimir_T_Intel · ‎12-11-2009

Quoting - softarts

the applicationcapture packet from the NIC (use PACKET_MMAP mechanism,copy data from kernel memory,no much SYS call)

even I slow the network traffic to 2-3 Mbps,the application still perform very badly,only capture 200 packets/second

without vtune it can capture 800Mbps+ packet (200K packets/s)

whyperformance impacted so much?
the code like this:
while (1)
{
buffer=new Buffer();
memcpy(buffer,kernel_buffer,size)
...
}

This is just my best guess. Depending on SAV set in VTune and the frequency of data being copied from user space to kernel space in your app, there could be mutual interference between VTune sampling driver interruption handler and the application. Try to increase the SAV for the events, and keep the number of events being collected small.

softarts · ‎12-11-2009

Quoting - Vladimir Tsymbal (Intel)

This is just my best guess. Depending on SAV set in VTune and the frequency of data being copied from user space to kernel space in your app, there could be mutual interference between VTune sampling driver interruption handler and the application. Try to increase the SAV for the events, and keep the number of events being collected small.

it's not sample,but call graph
will too much profile work in kernel impact this?

TimP · ‎12-12-2009

VTune call graph certainly may be expected to kill performance of an application with real time aspects. Yes, it's single threaded. gprof could perform call graphing with much less distortion of performance.

Vladimir_T_Intel · ‎12-12-2009

Quoting - softarts

it's not sample,but call graph

It's better to specify the type of analysis in the original question. It would help us to make less guesses.

softarts · ‎12-15-2009

for this program,vtune/call graph also got a different result:

the purpose is to compare 'if-else' with 'func pointer array'
and measure the branch predict impact.
the program output:
test caseA
branchA time=2510484704

test caseB
branchB time=1934092704

but the vtune show a different result:
time caseB spent much more time than caseA(about 5:1)
why call graph shows such differences?

----code----------------------

inline void HandleFuncA0(int y)
{ count=y+3;}
inline void HandleFuncA1(int y)
{ count=y+7;}
inline void HandleFuncA2(int y)
{ count=y-2;}
...

inline void BranchA2(int x)
{
if (x>RANGE*9/10)
HandleFuncA1(x);
else if (x>RANGE*8/10)
HandleFuncA2(x);
...
}
inline void BranchB(int x)
{
array(x); //array[0]=HandleFuncA0,array[1]=HandleFuncA1,...etc
}
void test_caseA()
{
printf("test caseAn");
timespec l_startTime,l_endTime,l_interval;
int ret;
long long i_period;
clock_gettime(CLOCK_REALTIME,&l_startTime);
for (int i =0;i{
BranchA2(inp[i%3000]); //inp[] has been initialized with rand()
}
clock_gettime(CLOCK_REALTIME,&l_endTime);
ret = delta_t(&l_interval, &l_startTime, &l_endTime);
i_period = l_interval.tv_sec*1000000000+l_interval.tv_nsec;

printf("branchA time=%un",i_period);

}
void test_case4B()
{
.../same as test_caseA
for (xxx)
BranchB(inp[i%3000]); //inp[] has been initialized with rand()
...
}

TimP · ‎12-15-2009

Quoting - softarts

for this program,vtune/call graph also got a different result:

the purpose is to compare 'if-else' with 'func pointer array'
and measure the branch predict impact.
the program output:
test caseA
branchA time=2510484704

test caseB
branchB time=1934092704

but the vtune show a different result:
time caseB spent much more time than caseA(about 5:1)
why call graph shows such differences?

At best, call graph is designed to find the hot path, not to measure performance accurately. If you were using VTune for performance analysis, you would be using event sampling, so why care about the details of call graph performance distortion?