Solved: Memory Bandwidth Monitoring in Atom Processor

CHEEHOO_K_Intel · ‎06-30-2020

I have two platforms, Coffeelake core i7-8700 and Apollo Lake Atom E3950, both are running Linux Ubuntu.

I need to monitor the memory read and write bandwidth when running an application. CoffeeLake has IMC where I can easily observe the memory bandwidth using perf with perf event uncore_imc/data_reads/ and uncore_imc/data_writes. I validated using benchmark program and confirm that the values are correct.

However, AFAIK, Atom-class processors do not come with IMC and there is no uncore_imc event in perf. I try to look at the cache-misses, L1-dcache/icache-load-misses and LLC-store/load-misses but I still cannot see how are they related to the memory bandwidth when I am running benchmark program. I know cache-line-size * cache-misses / time gives the memory bandwidth, but the values calculated from the cache events are no where near the value given by the benchmark program.

What is the best way to monitor memory bandwidth of Atom processor?

Any suggestion or advice is welcomed.

Thank you.

HadiBrais · ‎08-05-2020

The cache-misses event in perf is mapped to the LLC Misses architectural event (Event = 0x2E, Umask = 0x41). Note that the LLC in Goldmont is the L2 cache. My understanding from the documentation is that this event doesn't include the following types of requests:

Requests to uncacheable memory types (WC and UC).
Writebacks from the L2 cache (and writebacks to the L2).
It's not clear to me how this event works with non-inclusive LLCs. If a request missed in the LLC, there are three possible sources to get the line from on Goldmont: the private L1 of the other core, the caches of one of the other modules, or memory. On the other hand, it's clear that the event offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed only accounts for misses that are sourced from memory.

A couple of points I forgot to mention earlier about offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed:

Usually requests to uncacheable memory (especially UC memory) are mapped to I/O devices rather than actual memory. It may be better to exclude them. If the program being profiled has very little I/O traffic, then it doesn't matter much anyway.
I/O devices can also send memory requests. There is no memory to account for these requests on Goldmont as far as I know. Again, this is not an issue if the program being profiled has very little I/O traffic and there is no network traffic.

View solution in original post

HadiBrais · ‎08-04-2020

Right, none of the events you mentioned can be used to measure the full memory bandwdith on Goldmont.

Use instead the OFFCORE_RESPONSE event with the request type ANY_REQUEST and the response type L2_MISS.SNOOP_MISS_OR_NO_SNOOP_NEEDED. This would include the following requests:

Any type of read or write request to any type of memory, including partial reads and writes. Since Goldmont works with DDR3 or DDR4 modules, even partial transactions effectively consume 64 bytes of the bandwdith, just like full line transactions.
Only those requests that miss the L2 cache of the originating core, the L1 cache of the other core that shares the same L2, and the L2 caches of all other modules. This means that the request goes to memory. All other requests don't go to memory.
Writebacks caused by accesses from the core are also included. However, according to the spec update documents, all Goldmont processors have a bug where L2 writebacks caused by accesses from the other core sharing the same L2 may also be counted. To avoid this possible overcounting, either don't run anything on the sibling core or only run threads that don't cause any L2 writebacks.

The memory bandwdith can then be measured by multiplying the event count by 64 and dividing by time.

Linux kernel v4.10-rc1 and later has built-in support for the event with the name offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed.

CHEEHOO_K_Intel · ‎08-05-2020

May I know what is the difference between the value measured by cache-misses in perf and offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed ?

HadiBrais · ‎08-05-2020

The cache-misses event in perf is mapped to the LLC Misses architectural event (Event = 0x2E, Umask = 0x41). Note that the LLC in Goldmont is the L2 cache. My understanding from the documentation is that this event doesn't include the following types of requests:

Requests to uncacheable memory types (WC and UC).
Writebacks from the L2 cache (and writebacks to the L2).
It's not clear to me how this event works with non-inclusive LLCs. If a request missed in the LLC, there are three possible sources to get the line from on Goldmont: the private L1 of the other core, the caches of one of the other modules, or memory. On the other hand, it's clear that the event offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed only accounts for misses that are sourced from memory.

A couple of points I forgot to mention earlier about offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed:

Usually requests to uncacheable memory (especially UC memory) are mapped to I/O devices rather than actual memory. It may be better to exclude them. If the program being profiled has very little I/O traffic, then it doesn't matter much anyway.
I/O devices can also send memory requests. There is no memory to account for these requests on Goldmont as far as I know. Again, this is not an issue if the program being profiled has very little I/O traffic and there is no network traffic.

CHEEHOO_K_Intel · ‎08-05-2020

Thanks for the reply

I just experimented a bit using offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed event in perf.

The values it measured are constantly 10-20% lower than the values measured by cache-misses event and I think it agrees to your explanation.

I also found that offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed is a non-process specific event (it cannot be used with -p option) but still it is useful for me.

Is there a way to profile read and write access separately? from the perf list, I found "offcore_response.any_read.l2_miss.snoop_miss_or_no_snoop_needed" but I did not find any for write. Will the value from the subtraction of (any_request-any_read) give me a correct value for write access?

Regards

HadiBrais · ‎08-06-2020

The values it measured are constantly 10-20% lower than the values measured by cache-misses event and I think it agrees to your explanation.

Just to be clear, you're saying that cache-misses seems to count all L2 misses that are satisfied from anywhere include memory and other caches, right? Are you profiling a multi-threaded app where there is cross-core sharing?

Is there a way to profile read and write access separately? from the perf list, I found "offcore_response.any_read.l2_miss.snoop_miss_or_no_snoop_needed" but I did not find any for write. Will the value from the subtraction of (any_request-any_read) give me a correct value for write access?

The "any_read" request type includes code reads, data reads, and RFOs. If by "write" you mean cacheable writes, these writes are executed by sending RFO-type requests, which are already accounted for in the "any_read request type. Do you want to measure code and data reads separately from RFOs or did you mean something else? Note that the "any_request" request type includes "any_read" and other request types as discussed in my earlier comments.

CHEEHOO_K_Intel · ‎08-06-2020

Just to be clear, you're saying that cache-misses seems to count all L2 misses that are satisfied from anywhere include memory and other caches, right?

Ya, seems like it is to me.

Do you want to measure code and data reads separately from RFOs or did you mean something else?

Are there any metrics or events in Atom processor which are comparable to the values measured from uncore_imc/data_reads/ and uncore_imc/data_writes/ in Core processor?

HadiBrais · ‎08-07-2020

Among all current Atom microarchitectures, only Tremont supports uncore PMUs. There is no way on Goldmont to directly measure memory bandwidth (as far as I can tell from the public documentation).

The event offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed provides rather an approximation for memory bandwdith. Consider the block diagram of a quad-core Goldmont processor:

Source: https://www.anandtech.com/show/10635/intel-quietly-launches-apollo-lake-soc

(I don't know if this image/slide is available directly from Intel)

The offcore events occur for requests between the L2 caches and the memory unit, so this is the bandwdith being measured. On the other hand, memory bandwdith usually refers to the bandwdith between the memory unit and the DRAM modules (not shown in the figure). The memory bandwdith cannot be measured directly on Goldmont. Instead, the L2-MU bandwdith can be used as the best approximation that can be measured on Goldmont.

HadiBrais · ‎08-07-2020

(I wrote a post, but I was struggling to format it properly, then my repeated edits invoked moderator intervention and who knows how long this will take. So I decided to write the post again. Writing posts on this forum really feels like defusing a bomb.)

Among all current Atom microarchitectures, only Tremont supports uncore PMUs. There is no way on Goldmont to directly measure memory bandwidth (as far as I can tell from the public documentation).

The event offcore_response.any_request.l2_miss.snoop_miss_or_no_snoop_needed provides rather an approximation for memory bandwdith. Consider the block diagram of a quad-core Goldmont processor:

Source: https://www.anandtech.com/show/10635/intel-quietly-launches-apollo-lake-soc. I don't know if this image/slide is available directly from Intel.

The offcore events occur for requests between the L2 caches and the memory unit, so this is the bandwdith being measured. On the other hand, memory bandwdith usually refers to the bandwdith between the memory unit and the DRAM modules (not shown in the figure). The memory bandwdith cannot be measured directly on Goldmont. Instead, the L2-MU bandwdith can be used as the best approximation that can be measured on Goldmont.