Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements

Beginner
412 Views

Can anybody help me to understand the following situation:

I am trying to understand the processor performance for a parallel application. I use an Intel quad core with hyperthreading (4 physical cores and 4 logical cores). I get three responses. The first response, I get the maximum performance when I use the physical cores (it is normal), the second response is the minimum performance when the number of cores increase from the maximum number of physical cores to logical cores (for example from the fourth core to fifth core) and third when I use the logical cores, the performance decrease around 30% to 10%  when the number of cores are greater than six cores (it is normal).

I don’t understand the second response (see attachment), usually the performance does not increase or increase a little bit more. What happen with the processor in this situation? I get this response in different intel processors.

I use the time command and GNU compiler on linux OS to get the execution time.

12 Replies
Employee
412 Views
Black Belt
412 Views
An additional consideration to have with benchmarking is the impact of "Turbo-Boost", which in my viewpoint should be called thermal throttling. The 4-5 thread plateau could be a result of falling off of Turbo-Boost (start of thermal throttleing). A copy of your test program might aid us in answering your questions. Often cache issues can cause non-linear charting. And with HT, each core (pair of HTs) share the floating point unit (be it FPU or SSE/AVX) so you may experience a resource bottleneck. Jim Dempsey
Beginner
412 Views
Hi Roman and Jim! Thanks for your reply. I use openmp with Fortran to parallelize a set of large matricial operations where every core compute a different matrix with different time intervals (t1, t2 ,t3, …tn) . The computations for every core are written in a file. For example ---------- Core 1: | Core 2: | Core 3 Z(t1)=A(t1)xB(t1) | Z(t2)=A(t2)xB(t2) | Z(t3)=A(t3)xB(t3) Write( Z(t1)) | Write( Z(t2)) | Write(Z(t3)) --------- With Vtune I am sure that the serial sections of my code are too small compared with the parallel sections and I get a linear behavior when the number of threads are less than or equal to number of physical cores. About that Jim said, it is possible that turbo-boost decrease the processor frequency when all physical cores are running. I don’t know if there are other processor features that decrease the performance. Can I avoid that by some command lines without bios settings? Thanks!
Beginner
412 Views
Hi Roman and Jim! Thanks for your reply. I use openmp with Fortran to parallelize a set of large matricial operations where every core compute a different matrix with different time intervals (t1, t2 ,t3, …tn) . The computations for every core are written in a file. For example ---------- Core 1: | Core 2: | Core 3 Z(t1)=A(t1)xB(t1) | Z(t2)=A(t2)xB(t2) | Z(t3)=A(t3)xB(t3) Write( Z(t1)) | Write( Z(t2)) | Write(Z(t3)) --------- With Vtune I am sure that the serial sections of my code are too small compared with the parallel sections and I get a linear behavior when the number of threads are less than or equal to number of physical cores. About that Jim said, it is possible that turbo-boost decrease the processor frequency when all physical cores are running. I don’t know if there are other processor features that decrease the performance. Can I avoid that by some command lines without bios settings? Thanks!
Black Belt
412 Views

abdule m. wrote:
]Write( Z(t1)) | Write( Z(t2)) | Write(Z(t3))
Is the output file required to be written in order? (t1, t2, t3, ...) If so, and if additional speed-up is desired, then consider a parallel pipeline approach. This will also help the unordered output file to some extent. Create a list of buffers, greater than number of cores, that are a shared resource (with thread-safe queue), Or for each thread, create N buffers (N>1), which need not be thread-safe. Then, oversubscribe by 1 thread. Learning something new about OpenMP, use the OpenMP task, in a loop to enqueue either a task to write (when results present) or compute (when buffer available). I will let you experiment with your first attempt(s). When you get stuck, come back. Note, this may benefit from using !\$OMP ATOMIC for counter. Jim Dempsey

Employee
412 Views
Abdule, it would be also definitely interesting to see hot function profile (using VTune Amplifier XE for example) for your application with 4 and 5 threads. Do you see any new hotspots for the case of 5 threads? I remember for an odd number of active HT threads some OpenMP implementations spent lots of CPU cycles in barrier synchronization functions. The reason was that OpenMP implementation was not aware of HT and just split the work equally statically between the hyper threads. The threads mapped to the same physical core were obviously slower that ones having physical cores alone: these threads were spinning in barrier sync OpenMP functions until the other slower threads were done. A dynamic work-stealing distribution would work better for such situations. Roman
Beginner
412 Views
Thanks for your reply Roman and Jim. I have a doubt about the shared FPU among core units how Jim said in his first comments. When multiple cores are executing a large numerical computations for example, Is the cpu usage around 50% on a dual core processor and never reach 100%? Do both cores cannot execute numerical computations at the same time? What is the idea of intel designer to share the FPU unit? Thanks
Beginner
412 Views
YES, i am a huge fan of hyper threaded CPU's as linux seems to love them, but if you would like a visual representation then here is a video toms hardware did using 3.0ghz hyper threaded P4 and a 3.6ghz non hyperthreaded social media marketing india
Black Belt
412 Views
>>When multiple cores are executing a large numerical computations for example, Is the cpu usage around 50% on a dual core processor and never reach 100%? Do both cores cannot execute numerical computations at the same time? What is the idea of intel designer to share the FPU unit? On a single core with HT processor (2 hardware threads, one Floating Point Unit), assuming code section where both threads are performing all FPU instructions, the performance monitor will show both threads at 100%. An FPU stalled processor does not appear as idle. >>50% This would indicate that your are in a serial portion of your code which extends past the duration of the KMP_BLOCKTIME or similar SpinWait duration. IOW longer than the SpinWait time (defaults are ~200ms but can be 0:nnnnnn). Or it can mean that each thread is waiting for I/O or event. Jim Dempsey
Black Belt
412 Views
As to whether linux or Windows works better with hyperthreading, the only definite answer I know is that the Windows "hyperthreading aware" scheduler was introduced in Win7, long after the value was established in linux. Also, linux may have better affinity tools, but you would normally use those which come with the compiler (KMP_AFFINITY for Intel software products, GOMP_CPU_AFFINITY for gnu). As you will see if you study documents about MKL, it's sometimes possible to get higher usage of the CPU when only 1 thread per core is accessing the FPU. If your only goal is to see 100% busy threads, you may not like the extra performance. On the other hand, when a thread is stalled waiting for a floating point divide to complete, you have a typical opportunity for hyperthreading to show gains. The corei7-3 "ivy bridge" CPUs greatly reduce those latencies of divide and sqrt.
Beginner
412 Views
I believe that TSX suppresses all relevant exceptions. By "bad instruction" I meant not an illegal instruction, but an instruction that can't be executed in a transaction
Black Belt
412 Views
>>>On the other hand, when a thread is stalled waiting for a floating point divide to complete, you have a typical opportunity for hyperthreading to show gains.>>> What gains could be achieved in purely hypotetical case when for example two executing floating-point code threads are competing for Port0 and Port1 , beside the exploiting parallelism and for example prefetching data?