Solved: Examining the serialized memory access effect, in multi-threaded softwares

Farhad_P_ · ‎01-24-2016

hello everyone,

I am working on a multi-threaded video encoder application (x265).

I need to prove that, while increasing the number of threads can improve the total run-time, after certain number of thread (cores), it will cause in insufficient memory resources. that is to say, concurrent memory accesses from different cores, will lead to a queue of request at the DRAM, so the delay from the memory can affect the performance.

1- what do you think is the best method to get these results?

2- I have performed y tests with 2,4, and 8 threads (cores) on my machine (intel ivy bridge i7), on memory access analysis mode. But while the "Memory Latency" factor in vtune starts to increase (2threads: 0.048, 4threads: 0.533, 8threads: 0.735), the "Average Latency (cycles)" remains almost constant (around 11 or 12). why do you think that happens? because I think the average latency should've increased due to longer DRAM access time. can anyone please tell me what "Average Latency (cycles)" and "Memory Latency" exactly are? does the average latency take into account the memory latency too?

thanks in advance,

Farhad

McCalpinJohn · ‎01-25-2016

It looks like there is an uncore counter in the Core i3/15/i7 (Sandy Bridge/Ivy Bridge) that may count exactly what you want.

The event is called UNC_ARB_TRK_OCCUPANCY.ALL and it is described in Table 19-16 of Volume 3 of the Intel Arch SW Developer's manual (document 325384, revision 056). Dividing that count by the count of UNC_ARB_TRK_REQUEST.ALL, should give the average latency of each request to DRAM.

Note that the performance counters described in Table 19-16 are not the ordinary core performance counters. These are special uncore counters whose use is described in section 18.8.6. The descriptions are for Sandy Bridge Core i3/i5/i7 processors, but it is likely that the Ivy Bridge versions will be the same -- at least I can't find any references that say they are different....

View solution in original post

Peter_W_Intel · ‎01-24-2016

>I need to prove that, while increasing the number of threads can improve the total run-time, after certain number of thread >(cores), it will cause in insufficient memory resources. that is to say, concurrent memory accesses from different cores, will >lead to a queue of request at the DRAM, so the delay from the memory can affect the performance.

Increasing the number of threads depends on the number of cores you run. It mean, if you run big number of threads on N cores, there are only N threads running at a time, other threads are ready to wait context switches.

The best thing is to reduce concurrent memory accesses from different cores, you may reduce shared memory numbers from different cores, or reduce wait counts and wait time.

>1- what do you think is the best method to get these results?

Advanced-hotspots analysis is a good method to know performance data quickly, and also know effective CPU time, wait time and overhead time caused by threads. General-exploration analysis helps to identify issue on front-end & back-end, memory-access analysis identify memory access issues.

>But while the "Memory Latency" factor in vtune starts to increase (2threads: 0.048, 4threads: 0.533, 8threads: 0.735), the >"Average Latency (cycles)" remains almost constant (around 11 or 12).

Memory Latency data was accumulated from all threads, that is why we need "Average latency (cycles)" to evaluate performance per memory load. (another consideration is to ensure threads are parallelized first, use concurrency analysis).

McCalpinJohn · ‎01-25-2016

It looks like there is an uncore counter in the Core i3/15/i7 (Sandy Bridge/Ivy Bridge) that may count exactly what you want.

The event is called UNC_ARB_TRK_OCCUPANCY.ALL and it is described in Table 19-16 of Volume 3 of the Intel Arch SW Developer's manual (document 325384, revision 056). Dividing that count by the count of UNC_ARB_TRK_REQUEST.ALL, should give the average latency of each request to DRAM.

Note that the performance counters described in Table 19-16 are not the ordinary core performance counters. These are special uncore counters whose use is described in section 18.8.6. The descriptions are for Sandy Bridge Core i3/i5/i7 processors, but it is likely that the Ivy Bridge versions will be the same -- at least I can't find any references that say they are different....

Farhad_P_ · ‎01-25-2016

Peter Wang and John McCalpin, thank you very much for your answers and helpful suggestions.

UNC_ARB_TRK_OCCUPANCY.ALL seems like a very good metric for average latency, it helps a lot.

but as my application is meant to work in real time conditions, I also need to know in worst cases, where probably several cores reach for DRAM simultaniousely, how much delay will be added to the memory latency, and how many of the requests will face such delay. any idea about that?

Peter_W_Intel · ‎01-25-2016

You can find other clues that Dr. McCalpin mentioned in doc, and VTune Amplifier provides supported events - use: $ amplxe-cl -collect-with runsa -knob event-config=? | grep UNC_ARB UNC_ARB_TRK_OCCUPANCY.ALL "Counts cycles weighted by total number of UNC_ARB_TRK_REQUESTS.ALL "Number of coherent and non-coherent requests UNC_ARB_TRK_REQUESTS.WRITES Number of Writes allocated - any write UNC_ARB_TRK_REQUESTS.EVICTIONS Counts the number of LLC evictions allocated UNC_ARB_COH_TRK_OCCUPANCY.ALL Counts cycles weighted by the number of UNC_ARB_COH_TRK_REQUESTS.ALL Counts the number of core-outgoing entries in the UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST "Cycles with at least one request UNC_ARB_TRK_OCCUPANCY.CYCLES_OVER_HALF_FULL "Cycles with at least half of the

Dmitry_P_Intel1 · ‎01-26-2016

Hello Farhad,

Probably you might be interesting in trying the new HPC Performance Characterization analysis that we added as a tech preview feature in VTune Amplifier XE 2016 Update 2. It allows to look at CPU utilization metrics along with memory efficiency metrics like Memory Bound, Cache Bound, DRAM Bound were we count stalls in execution pipeline slots of cycles on fetching data and FPU utilization metrics (for the processors that support them).

So for your case I can expect that at some point adding more threads you will see better CPU utilization but the percent of stalls because of memory operations will start increasing.

Thanks & Regards, Dmitry