I am running my application on Xeon Phi processor configured in SNC-4+Flat mode. My application is trying to capture local and far memory latency. I am running my C program as "numactl --membind 7 --cpubind 0 ./myperf". I am expecting that this should change numa_miss numbers in numstat utility. But I see that there is no change in numa_miss. I am accessing memory, not in the same node so why am I not getting any numa_miss?
The statistics from numastat don't mean what you might think they mean.
In particular, "numa_miss" means that the operating system was unable to allocate a page in the domain where it was requested. You requested data placement in NUMA domain 7 and the operating system was able to allocate the page there, so the "numa_hit" statistic was incremented, not "numa_miss".
One way to see "numa_miss" on your system is to attempt to place more than 4 GiB on one of the four MCDRAM domains, using the "-preferred" option to numactl. If you try to allocate 5 GiB, for example, you should get 4 GiB of "numa_hit" (1,048,576 increments with 4KiB pages or 2048 increments with 2MiB pages), plus 1 GiB of "numa_miss" (262,144 increments with 4KiB pages or 512 increments with 2MiB pages).
If you want to measure cross-domain bandwidth, then you need a tool that measures bandwidth, not page allocations. Intel's VTune and APS should be able to do this. It may also be possible with "perf stat", but it is a lot of work to figure out how to set up all the counters with this tool.