As far as I know, if a core tries to fetch an address that is not in its local cache (by "local", it is implied L1D and L2 here), core feeds the memory address into a pseudorandom hash function that spits an integer that is "assigned CHA" of the respective address. That CHA is responsible for managing coherency of that memory address. After computing the CHA, thread running on that core consults that CHA as to learn whereabouts of the variable stored in that address, it wants to know which tile currently has that address in forward state, or is that address even in CPU die at all.
What I am not 100% sure is that what if an address is not cached in L1D or L2 cache of the core that the thread is bound to and the "assigned CHA" of that address happens to be in the same tile as this core. Is there any changes to that flow above? I am wondering if Intel optimizes this edge case somehow.
Many folks have found that the lowest latencies are for physical addresses managed by the co-located CHA. I don't recall if I have seen any evidence of a difference in the way the transactions are processed. I would not expect this to be a special case worth extra effort, since it is only intended to occur for a relatively small fraction of the address space (approximately 1/#CHAs probability).
There might be a slight benefit in the mesh latency since the transaction never needs to get on the external part of the mesh. The figure at the beginning of Section 2.2 in the Xeon Scalable Processor Family Uncore Performance Monitoring Reference Manual (document 336274 for SKX/CLX and document 639778 for ICX) shows that each mesh stop can be associated with two "Agents". For the tiles that include cores and CHAs, one of the agents is the core and the other is the CHA/SF/LLC slice. The traffic from one agent to the other may not need to get on the vertical or horizontal meshes, but it must still arbitrate for access to (at least) the ingress port from the CMS to the destination agent.
It is challenging to get latency measurements that are reliable enough to identify details of the latency components in these transactions, but I have found that varying the core frequency and uncore frequency independently can provide some insight into how much of the latency occurs in each frequency domain. This works best with cache-to-cache intervention latency -- memory access is significantly more complex (and brings in another frequency domain).