Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Measuring ILP on the presence of busy waiting

Caesar
Beginner
1,058 Views

Hello there.

I've an parallel application that uses busy waiting for synchronization. The parallel region is a loop and I've unrolled this loop a few times on the hope to reduce synchronization overhead, increase ILP and expose more optimization opportunities for the compiler (this is in fact my experiment). Now I need to measure several properties of this parallel program, among them are ILP and thread synchronization overhead. My question is: what are the appropriate events for measuring these properties?

I believe that it would be wrong to calculate IPC as (INST_RETIRED.ANY / CPU_CLK_UNHALTED.REF_TSC) in this case, because busy waiting can skew the number of executed instructions. Currently I'm considering the use of UOPS_EXECUTED.CYCLES_GE_*_UOPS_EXEC as a approximation for how much ILP is "present" on the loop, do you think this is a reasonable approach?

As for cache-to-cache transf. overhead due to synchronization I'm considering the use of MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT*_PS and OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HIT*_OTHER_CORE_0 as target counters. Do you think this is a valid approximation?

I'm running these experiments on a Ivy Bridge running Ubuntu 13.10.

Any advise on this will be appreciated. Thanks in advance!

0 Kudos
12 Replies
TimP
Honored Contributor III
1,058 Views

You make a valid point that busy wait spin loops issue a lot of instructions which accomplish nothing at a higher than normal rate.  So you would have to exclude those if trying to use something like clocks per instruction as a measure of useful ILP.

If you have multliple threads modifying the same or neighboring cache lines (false sharing or prefetch sharing), I doubt whether unrolling is an effective way to get satisfactory performance, but maybe I don't understand your point.

0 Kudos
Caesar
Beginner
1,058 Views

Thank you for your answer.

I've got rid of false-sharing in another experiment. Currently I'm not worried with performance, I just need to measure this two things in several situations:

1) Did unrolling increased the number of uOps executed in parallel? By how much?

2) How many cache lines were forwarded from core-to-cores.

Regarding (2) the behavior of the application is to busy wait on a shared variable until some data is ready. I need an approximation of how many times the program did that (i.e. a modified cache line present on core B have a read request from core A). The application uses a pipeline model of parallelization, only one pair of threads read/write to a shared var at each moment (producer -> consumer). 

Thanks!

0 Kudos
Bernard
Valued Contributor I
1,058 Views

I suppose that unrolling in the case of non interdepend data can better utilise execution stack.I am not sure if you can see it as an increment in total number of retired floating point SIMD uops per program run.

0 Kudos
Bernard
Valued Contributor I
1,058 Views

As far as I was able to understand your question I think that you must put more effort in exploiting ILP and not thread level parallelism because as you mentioned that your code uses producer-consumer model.

0 Kudos
Divino_C_
New Contributor I
1,058 Views

I'll try to simplify the question. =)

I have two programs and I need to check which of the two have more ILP.

Considering that these are integer programs (ie. assuming no use of SIMD instructions), I would just execute the programs on VTune and check which one have a smaller CPI. However, these programs are parallel and use busy wait for synchronization, thus I'm afraid that just taking total_cycles / total_instructions would produce a *very* wrong result. How can I measure CPI but discarding instructions executed by busy wait loops out of the equation?

Thanks.

0 Kudos
Bernard
Valued Contributor I
1,058 Views

>>>I have two programs and I need to check which of the two have more ILP>>>

I am afraid that there is no any automated methods beside compiler logic to do this. By writing automated methods I mean software.

For simple and short piece of code you can do it manually. Pay an attention to data and instruction interdependencies. Haswell architecture has dedicated Port and executing unit(logic) for branch calculation so there is possibility that you will see a code for branch calculation which  is taken and between cmp , jmp instructions you will see memory load/store instruction.

0 Kudos
Divino_C_
New Contributor I
1,058 Views

Hi iliyapolak,

thanks for you answer.

Do you think that analyzing back-end port utilization (e.g. UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) is a valid approach? I mean, if the number of cycles in which 3+ ports were used increase and the number of cycles only 1+ and 2+ ports decrease, it seems valid to argue that ILP increased, do you agree? 

 

0 Kudos
Bernard
Valued Contributor I
1,058 Views

>>>How can I measure CPI >>>

For accurate CPI measurement please read the following link:

http://en.wikipedia.org/wiki/Cycles_per_instruction

Btw, IIRC this formula is taken from Pattersen/Hennesy book. For advanced treatment of ILP topic please refer to " Optimizing Compilers for Modern Architectures" book.

http://www.amazon.com/Optimizing-Compilers-Modern-Architectures-Dependence-based/dp/1558602860

0 Kudos
McCalpinJohn
Honored Contributor III
1,058 Views

My approach to this problem is to manually split the loop so that I have an explicit loop over threads with each thread computing its own start and stop indices for the original loop variable.  Then each thread reads its performance counters at the beginning of loop execution and at the end of its execution (i.e., before it enters the barrier).  This allows me to directly compute various metrics for the "real" code while excluding counts accrued during spin-waiting.

An example of how this is implemented may be helpful.   For the STREAM Triad benchmark, the original code is:

    t_start = mysecond();
#pragma omp parallel for
    for (j=0; j<STREAM_ARRAY_SIZE; j++)  a = b+scalar*c;  // TRIAD kernel
    t_end = mysecond();

The transformed code is:

        jblock = STREAM_ARRAY_SIZE / numthreads;
        t_start = mysecond();
#pragma omp parallel for private(j,jstart,jstop)
        for (i=0; i<numthreads; i++) {
            jstart = i*(jblock);
            jstop = (i+1)*(jblock)-1;
            perf_read_core_counters(iteration,i,6);        // PERF: read counters at thread start
            for (j=jstart; j<jstop+1; j++) a = b+scalar*c; // TRIAD kernel
            perf_read_core_counters(iteration,i,7);      // PERF: read counters at thread finish (before barrier)
        }
        t_end = mysecond();

Please note that the code above does not compute the loop indices correctly if STREAM_ARRAY_SIZE is not evenly divisible by numthreads.  This is not hard to fix, but I was lazy and this does not change the quality of the measurements in this particular case.  The "6" and "7" in the calls to the perf_read_core_counters() routine are just extra array indices to tell the routine where to store these particular samples in a big multidimensional array (with indices for "iteration", "thread",  "sampling point", and "performance counter number").

It is easy to compute IPC or CPI metrics with inline measurements like those used above, while it is extremely difficult to remove the effects of the spin loops from any sampling-based measurement strategy -- especially if you don't know exactly what code is being executed by the spin-waiting loop.

0 Kudos
Bernard
Valued Contributor I
1,058 Views

>>> I mean, if the number of cycles in which 3+ ports were used increase and the number of cycles only 1+ and 2+ ports decrease, it seems valid to argue that ILP increased, do you agree? >>>

Yes.

0 Kudos
Divino_C_
New Contributor I
1,058 Views

Thank you John and illyapolak for helping me on this.

0 Kudos
Bernard
Valued Contributor I
1,058 Views

You are welcome.

0 Kudos
Reply