I'm working on a complicated project, that requires a parallel calculations in order to achieve good time performance.
Our company bought for this purpose intel xeon platinum 8168 processor (96 CORES - name 96). Also, we have a computer with intel core i9 7960x processor(16 CORES - name 16).
I'm using "omp pragma for" directive, as all calculations happen in FOR loops. And at this point i got strange results.
I'm running my code on 16 PC, with number of iterations in FOR loop less than 16. That's mean, that number of threads & number of CORES that are used less than 16. At this point, I got almost same time results(I mean, 5 iterations, 10 iterations, and 15 iterations complete with almost same time). And this is correct, since NOT ALL CPU power was used.
At this point, I try and run SAME code on 96. And I see strange time performance results. If I run 40 iterations(see 40 threads), the time is almost twice against 1 iteration. And if I run 90 iterations(still, NOT full power!!), time increase almost 4 times.
My questions is, does it have some issue to this specific processor (intel xeon platinum 8168 processor) working with IPP libraries ?
What could be the possible reason for such a time complexity increase? I am aware about dynamic memory allocations , we have some, and time needed to create large number of threads, but still that's seem not the real reason.
The Core i9-7960x processor has 16 physical cores in one chip. It is capable of supporting HyperThreading, which if enabled would give the appearance of 32 cores.
The Xeon Platinum 8168 processor has 24 physical cores in one chip. It is capable of supporting HyperThreading, which if enabled would give the appearance of 48 cores. The Xeon Platinum 8168 supports systems with up to 8 chips, so a system that reports 96 Logical Processors is either a 2-socket system with HyperThreading enabled, or a 4-socket system with HyperThreading disabled.
The easiest way to get ~1/2 performance is to run one thread per physical core on the Core i9-7960x, but to run two threads per physical core on the Xeon Platinum 8168.
The binding of threads to cores is difficult to control in a completely system-independent fashion. If you want to compare single-socket performance, you will need to limit the code to running on a single socket of the Xeon Platinum 8168. In Linux systems, this is typically done with a command like:
numactl --membind=0 --cpunodebind=0 a.out
This forces all memory and processors to be allocated from socket "0". You should then set OMP_NUM_THREADS to the number of threads you want to test, and (MOST IMPORTANT) set OMP_PROC_BIND=spread. The "spread" option will cause the OpenMP runtime to distribute the threads as far apart as possible. Assuming that the Xeon Platinum 8168 system has HyperThreading enabled, the "spread" option will cause one thread to be bound to each *physical* core until all physical cores have been used. For this single socket case you will be able to use up to 24 threads without "doubling up" on any physical core in socket 0.
Some OpenMP jobs will run better if the threads are spread uniformly across the sockets. Again assuming that the Xeon Platinum 8168 system has two sockets and HyperThreading enabled, you simply set OMP_PROC_BIND=spread, and then set OMP_NUM_THREADS to the desired number of threads. Since there are two sockets, you probably want to set the number of threads to an even value. The "spread" option will place 1/2 of the threads on separate physical cores in the first socket and will place the other half of the threads on separate physical cores in the second socket.
Thanks for your response.
I have checked option of HyperThreading in BIOS and it was disabled. But the strange thing, that after enabling it, it is even worse performance in time.