- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Q. We have a performance critical application and we want to use one Hyperthread CPU to run an OS + applications, and use the other Hyperthread CPU to run in the background polling for various events. Our desire is to minimize any performance degradation on the OS Hyperthread caused by the polling Hyperthread.
The polling hyperthread is in a small loop, and we've experimented using a variety of PAUSE (REP NOP) instructions in the loop. When the OS HT is running benchmarks that run outside the L1 cache, we see minimal performance degradation.
However, if we run a benchmark (on the OS HT) that fits in the L1 cache, the polling HT degrades performance on that benchmark by about 70%. We tried putting large number of PAUSE's (both in a loop and in-line) in the polling loop, and it seemed to have no effect on the degradation.
This indicates to us, that even when using the PAUSE instruction, there is still some serious contention for resources within the processor between the 2 HT's when each is running out of L1 cache.
As part of our testng we removed all actual polling, just so we could understand the PAUSE instruction. This is one example of the polling HT code that causes the performance issue:
while(1) {
/ /__asm __volatile("pause");
}
Is the behavior that we should expect? We assumed that the PAUSE instruction would quiesce the HT CPU for 70 cycles or so, but this doesn't appear to be the case. Does the PAUSE instruction cause some level of resource contention within the chip? Is there a way to avoid that contention?
Our fallback is to halt (HLT) the polling HT and use an interrupt mechanism, but it results in sigificantly higher latencies in handling events. So we are very interested in any mechanism that preserves the polling model.
This work was done on a 2.8GHz Intel Xeon Northwood CPU.
A. It would appear that you are experiencing resource contention for the L1 cache. When HyperThreading is active, the L1 cache (and several other resources on the processor) is evenly divided between the two threads. These divided resources are dedicated to each logical processor. A logical processor may only use that half of the L1 cache that is allocated to it. It cannot use any part of the cache that has been allocated to the other logical processor. This means that a data set which fit into L1 cache earlier (in non HT mode) no longer fits in the reduced area of the L1 that is allocated to a single logical processor. This should be the root cause of the performance degradation that you are seeing. The PAUSE instruction does not free up these dedicated resources, because the thread is still active. The HALT instruction however, causes the logical processor which executes it to stop processing altogether, and frees up the resources which had been allocated to it. When that logical processor is restarted, it again takes half the cache and half of the other resources as well. If you size your workloads to be use half of the L1 cache when running with HT enabled, you should again see the expected performance on th e benchmark applications. Unfortunately, there is no way to allocate less than half the cache to one of the logical processors, even if you know that it will not require all the resources that it normally receives.
Q. Even when the benchmark uses only 128 bytes for its data, I get 33-44%degradation while the other hyperthread is executing {while(1) pause}. This is quite a bit better than the 60% degradation when the benchmark was using 4K, which I think is half the L1 cache size. 33-44% seems high to me - does it to you?
Eventually, what I want to be able to do in a polling hyperthread is to read a series of memory locations which contain device state. There may be 1-20 such locations. I was planning on doing this:
read location A
pause
read location B
pause
...
The 50-100 cycles that pause seems to take would be acceptable. I would like to affect the other hyperthread by less than 10%, if possible.
I don't have control over the distance between location A and B, but I do have control over the number of locations. Potentially, I could even make the number of locations one, such that monitor/mwait could be used. There might be other advantages to having more locations, so I want to investigate possibilities besides mwait.
Can I use mwait when the location is being written via DMA from a device?
While I am using a Xeon/Northwood chip now, I will soon be experimenting with Xeon Prescott. Any insights that you have about my problem and the use of a Prescott would be quite welcome.
A. I believe that a 30% reduction in performance is much too much. However, keeping in mind that the resources available to a logical processor under HT are considerably less than the resources available to an ordinary processor, there are a few additional possibilities that come to mind.
It looks like your polling loop is too tight. The PAUSE instruction causes a short delay in the execution of the instruction stream for the logical CPU that executes it, but this delay is considerably shorter than 70 clock cycles. Even a small loop such as the one you originally submitted will cause the loop control instructions to execute many times in sequence. The primary purpose of the PAUSE instruction is not to cause long delays, but rather to instigate a short delay (such as 10 clock ticks). The actual time delayed will vary from one processor to the next. I would suggest inserting multiple PAUSE instructions if you want a longer delay during the spin-wait loop. Another function of the PAUSE instruction is to give the processor a hint that this is a spin-wait loop, and there is no need to search widely for the next instruction (unless the loop ends).
If you are inserting widely spaced memory references inside the spin-wait loop, another factor could be affecting performance also. In addition to the cache, the TLBs are divided evenly between the logical processors, so widely disparate memory locations can cause TLB misses and time consuming page table loads and flushes. Reducing the number of widely spaced memory references might help improve efficiency also.
We expect that an efficiently threaded application would gain a maximum of 40% through the use of a hyperthreaded processor. This assumes that both threads do the same amount of work and the increased efficiency of the utilization of the execution units would offset the performance penalty that each individual thread experiences. For example, if two threads both work at 70% of the performance level of a single threaded equivalent, then the combined efficiency of the two threads results in an overall performance level of 140%. Thus, if you are expecting one thread to do all the work, then it may not be unusual for that thread to perform at a lower level. As mentioned before, the HALT instruction will free up the resources allocated to the other logical processor, and adding more PAUSE instructions into the spin-wait loop will allow more efficient use of the available resources.
In conclusion, I would recommend two experiments. First, add more PAUSE instructions into the spin-wait loop and see how that affects performance. Dont be afraid to try several hundred such instructions inside the loop. Second, reduce the number of memory locations referenced to no more than 8 and try to ensure that their memory locations are not 2**n bytes apart (small performance issue with associative caches if this occurs). Good luck!
Message Edited by intel.software.network.support on 12-07-2005 05:02 PM
Link Copied
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page