Solved: Advisor Roofline for Ideal Hardware

JNorw · ‎12-09-2020

The Advisor Roofline could provide a useful tool for evaluating ai hardware efficiency vs some ideal hardware. By "ideal", I mean analyze the neural net model number of operations and evaluate the minimum latency through the model if all operations could be executed asynchronously and with zero wait times for memory accesses. Use that as a roofline rather than the limits of the existing hardware.

Is there some configuration of the tool that could accomplish this?

Zakhar_M_Intel1 · ‎12-10-2020

Hello,

What you are asking for can be seen as "ideal throughput model", under assumption there is neither compute latency nor caches/memory latency effects et all, right?

So for example, for "memory subsystem" this would correspond to memory bandwidth peak assuming all data fits into the registers, while for compute it will correspond to e.g. FMA (or VNNI etc) benchmark with no data flow dependencies. Is this correct understanding?

How would you use this kind of roofline in your practice? Understanding your usage model may help us in better prioritization and feature definition.

I should mention that de facto, the current implementation is not so far from what you are asking:

FMA Compute benchmarks are highly optimized and implemented so that there is mostly no latency in the system
Registers-only benchmarks are not provided, but L1 benchmark is here and it is actually already much closer to the ideal, rather than practical peak, because rare workload fits into L1 even partially (that's why L1 benchmark is so good for CARM Roofline, which is focused on algorithmic and fundamental limits, compared to "MLR" or Classic Roofline in Advisor, which is oriented more towards current bottlenecks highlighting)
Some elements of "Offload Advisor" , released in oneAPI Gold, may generally also fit into what you are asking for in some future, but also depending on your usage model as I asked above.

View solution in original post

GouthamK_Intel · ‎12-09-2020

Hi,

As your issue is related to Intel Advisor tool, we are having a dedicated forum for Analyzers. So we are moving this thread to Analyzers forum for faster response.

Have a Good day!

Thanks & Regards

Goutham

Zakhar_M_Intel1 · ‎12-10-2020

Hello,

What you are asking for can be seen as "ideal throughput model", under assumption there is neither compute latency nor caches/memory latency effects et all, right?

So for example, for "memory subsystem" this would correspond to memory bandwidth peak assuming all data fits into the registers, while for compute it will correspond to e.g. FMA (or VNNI etc) benchmark with no data flow dependencies. Is this correct understanding?

How would you use this kind of roofline in your practice? Understanding your usage model may help us in better prioritization and feature definition.

I should mention that de facto, the current implementation is not so far from what you are asking:

FMA Compute benchmarks are highly optimized and implemented so that there is mostly no latency in the system
Registers-only benchmarks are not provided, but L1 benchmark is here and it is actually already much closer to the ideal, rather than practical peak, because rare workload fits into L1 even partially (that's why L1 benchmark is so good for CARM Roofline, which is focused on algorithmic and fundamental limits, compared to "MLR" or Classic Roofline in Advisor, which is oriented more towards current bottlenecks highlighting)
Some elements of "Offload Advisor" , released in oneAPI Gold, may generally also fit into what you are asking for in some future, but also depending on your usage model as I asked above.

Gopika_Intel · ‎12-10-2020

Hi,

Has your query clarified. Shall we discontinue monitoring this thread.

Regards

Gopika

Gopika_Intel · ‎12-17-2020

Hi,

We haven’t heard back from you, we won’t be monitoring this thread as the solution provided by Zakhar is accepted as the solution. For further assistance, please post a new thread.

Regards

Gopika