- Page 2

amir_k_ · ‎11-03-2014

Hi,

my software is very sensitive to low latency and i'm trying to find what cause the latency and how to resolve it.

this is pseudo code of my software

timer_func() {
do some calc
register new timer
}

main(){
while(1){
check if timer ready ?
if ready - timer_func
}
}

i'm measuring the time it takes the timer function to run. i'm doing 10000 iterations.
i found that some that almost all the time it takes less then 200 nano to run the timer code but some times around 10 times it takes 6.5 micro second !!!?

i used libpfm to measure cpu events and found out that when it takes 6.5 micro second i'm seeing RAT_STALLS:FLAGS

How can i solve this issue ?

amir_k_ · ‎11-03-2014

these are the events i'm going to collect

CPU_CLK_UNHALTED
INT_MISC:RAT_STALL_CYCLES:c=1
HW_INTERRUPTS
PERF_COUNT_SW_CONTEXT_SWITCHES

taken from libpfm running on Sandybridge:

any other events needed ?

i'll post the results later today

Bernard · ‎11-03-2014

You can use VTune to collect Context Switches by enabling Advanced Hotspot Analysis.

Bernard · ‎11-03-2014

amir k. wrote:

false

It is interesting what really happens at machine code level around ~100 nano seconds. There is big spike of RAT_STALLS event(s) , but there is no way how to map those events to executed assembly code.

@amir

Can you tell me how this screenshot correlates to executed code? I mean time frame with 1e+3 nano sec step resolution corresponds to main() function or to the prologue of your timer function?

Bernard · ‎11-03-2014

@amir

In my previous post I quoted your response #21 , but screenshot was not posted.

amir_k_ · ‎11-04-2014

I'll post vtune screen shuts later today.

amir_k_ · ‎11-04-2014

i see some interrupts during the run

eth0-TxRx-N - on all each cores

NMI - on some cores

LOC - on all cores

i'll try to disable the interrupts and run the test again.

Bernard · ‎11-04-2014

>>>eth0-TxRx-N - on all each cores>>>

Do you have heavy network traffic when you are testing your code?

amir_k_ · ‎11-04-2014

no i don't have heavy network traffic.

amir_k_ · ‎11-04-2014

i set the affinity of all interrupt to core0 except for NMI and LOC and the application is running on all the other cores except for core0.

i still get slow iterations

McCalpinJohn · ‎11-04-2014

1. I am not quite sure what this means:

if i call the timer function more frequent less then 1 millisecond i don't see slow iteration but if it's slower then 1 millisecond i see the problem.

Does this mean that if you decrease the number of iterations so that the total test takes less than 1 millisecond, you get no slow iterations? If this is what you mean then it seems extremely likely that the slow iterations are due to the local timer interrupt. You might be able to build a kernel with a 100 Hz (10 millisecond) local timer interrupt, but the local timer interrupts will always be there -- they are fundamental to the way Linux does task management and scheduling.

2. Is the number of slow iterations approximately equal to the total execution time in milliseconds? This would be another easy "hint" that the slow iterations are due to local timer interrupts.

3. Did you compute the interval between slow iterations? Even if you are not storing all the timer values, it should only take a few extra cycles to check to see if the iteration is "slow" and save the absolute start and/or end times for those iterations.

4. The posted figure shows that "RAT_STALL" cycles dominate in the "normal" iterations, and are quite small in the long duration iterations. So this counter is not helpful if you are trying to understand the cause of the long iterations.

5. The source of the RAT_STALL cycles in the "normal" iterations depends on what you are using for a timer. What are you using for a timer? An inline RDTSC or RDTSCP instruction should be much faster than the 200 ns you are seeing -- they take about 30-40 core cycles on most platforms (10-15 ns).

amir_k_ · ‎11-04-2014

1) "Does this mean that if you decrease the number of iterations so that the total test takes less than 1 millisecond, you get no slow iterations?" - yes no sloe iteration.

2) "Is the number of slow iterations approximately equal to the total execution time in milliseconds?" - no

3) "Did you compute the interval between slow iterations?" - i'll do it.

5) "What are you using for a timer?" - i'm not sure what you mean with this question so i'm going to answer on two things 1) to measure time i'm using rdtsc 2) my timer code uses stl vector to store all pending timers and called rdtsc to find out if it should run the timer.

amir_k_ · ‎11-04-2014

one more finding from today.

when i set the irq_affinity of all the irq's to core0 (i don't run my test on core0) there where fewer slow iterations.

i added tables with how many slow iterations on all cores please disregard the package it always set to 1. affinity 1 - 7 package 1 and affinity 8-15 package 2

before setting eth0 irq affinity

after setting eth0 irq affinity

Bernard · ‎11-05-2014

>>>when i set the irq_affinity of all the irq's to core0 (i don't run my test on core0) there where fewer slow iterations.>>>

So probably timer interrupts are causing slow down of your code and not RAT_STALL events.

amir_k_ · ‎11-05-2014

i'll compile kernel with 100 Hz (10 millisecond) local timer interrupt and run the test again.

thanks everyone.

i'll post the results in few days.

Bernard · ‎11-05-2014

You are welcome.

how to optimize RAT_STALLS ?