I've been using Vtune Amplifier XE to monitor the performance of some software for which the CPI goes up as I increase the numbe of threads, the LLC misses are always zero. In addition to this I have noticed that the CPU around part of the software that handles floating point operations also increases in line with the number of threads. A tool tip in the Vtune GUI hints that the CPI might be going up due to port saturation, this is where my question comes from and it is more related to the Sandybridge architecture than Vtune per se. I'm trying to get a handle on what is being referred to by ports, is Vtune Amplier alluding to scheduler ports, there might be something in this because it could be saturation around the ports associated with the processing of floating point instructions, can someone please point me in right direction as to what 'Ports' Vtune is referring to.
Have a look in Section 2.2 of the Intel Architectures Optimization Reference Manual (document 248966, revision 29, March 2014, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html)
Figure 2-3 shows how the execution ports are connected to the various different types of functional units. Ports 0, 1, and 5 can all perform ordinary integer arithmetic operations, but only Port 0 can execute floating-point vector multiply instructions, while only Port 1 can execute floating-point vector add instructions.
So "port saturation" is generally a good thing -- it means that at least some of the functional units on the processor are busy most of the time.
When using HyperThreading, for example, if both threads are trying to execute a fairly high rate of floating-point addition instructions, then running them at the same time could result in less speedup than if one thread was doing additions and the other was doing multiplications.
Many thanks for such a rapid response, perhaps you should change your nick name to Dr Low Latency ;-)
If I'm getting saturation on the port associated with floating point addition, presumably one way arounbd this would be to leverage Intel AVX / AVX 2. A threoretical question, if you could deliver data to the L2/3 cache at trendous band width, I've heard people mention that you can achieve greater bandwidth with PCIe straight into the cache via Intels data direct technology than using the conventional memory bus / root complex. Would it still be possible to achieve port saturation if you were leveraging say AVX 2 to its full potential ?.
>>>If I'm getting saturation on the port associated with floating point addition, presumably one way arounbd this would be to leverage Intel AVX / AVX >>>
You will still have the same number of Execution Ports: (Port0 and Port1) to operate on AVX stream. Haswell should be more helpful in your case when compiled code could be represented in terms of FMA instructions thus freeing second Port (Port1).
If you are actually limited by arithmetic performance, you will certainly benefit from using packed arithmetic (either SSE or AVX). Unfortunately the details may be hard to interpret from performance counter results.
Intel's Sandy Bridge/Ivy Bridge implementations use an "eager" dispatch mechanism that dispatches instructions to the execution ports even if the input arguments are not "ready" (usually due to cache misses). In these cases the uop is dispatched to the execution port, the hardware sees that one or more input arguments are not "ready", the uop is "rejected" (sent back to the reservation station), and the process starts over. In cases with many cache misses and high contention for memory bandwidth I have seen "dispatch" (or "execution") counts that are 6x to 10x higher than the actual number of operations completed.
The retry rate increases with increasing load latency, so if a multi-threaded (e.g., OpenMP) job starts running into bandwidth limits as the number of cores is increased, you will see an increase in the number of dispatches to the ports used by instructions that have input operands coming from memory. I like to start by setting up a code with a known number of arithmetic operations (e.g., STREAM), and looking at the counts (events 0x10 and 0x11, but also the corresponding dispatches on ports 0 and 1) as a function of the number of threads used. Events described with the phrases "uops executed" and "uops dispatched" are the ones that typically show increases with retries. Events described with "uops issued" or "uops retired" should not show these increases, but will show increases if a core in a multi-threaded job "spins" at a barrier waiting for the other threads to finish. (That is why I usually rewrite my OpenMP loops to let me read the performance counters before and after the barriers, as I discuss at https://software.intel.com/en-us/forums/topic/517767#comment-1793984).
If you are running one thread per core, these extra uop dispatches don't typically cause performance degradation because the processor is going to have to stall anyway until the data arrives. If you are running two threads per core, with one thread getting lots of cache hits and the other thread getting lots of cache misses, then the reject/retry activity on the second thread may cut into the dispatch slots that the first thread could have used effectively. Diagnosing this will be tricky because the threads share a lot of resources -- so understanding the details of throughput with HyperThreading requires a lot of work. If I wanted to experiment in this area, I would probably start by running STREAM on one logical processor (i.e., HyperThread) of a physical core (generating lots of cache misses and lots of FP instruction retries) and LINPACK on the other logical processor of the same physical core (getting lots of cache hits and very few FP instruction retries).