I have a FD application for wave propagation. Executed it on KNC and got linear speedup till 16 threads and after that got sub-linear speedup till 180 threads but from 180 to 240 threads there was drop in speedup. Can someone let me know reason behind this behaviour.
Linear speedup up to 16 threads (scatter) indicates you may have hit a memory bandwidth issue, trail off after 180 usually indicates hitting the vector unit FPU pipeline throughput capacity. I suggest you try to optimize L1 and L2 cache utilization.
It is not uncommon for high-bandwidth codes to experience performance losses when using "too many" threads. This is primarily due to DRAM bank conflicts.
On KNC there are 16 32-bit GDDR5 channels, with a pair of 16-bit-wide GDDR5 chips on each channel. The GDDR5 DRAM chips each have 16 banks, so there are a total of 256 DRAM banks available. (The two chips on each channel operate in lockstep, so their banks combine into a single set of wider banks instead of acting as independent banks.)
Using the STREAM benchmark, I get the best performance on KNC when using one thread per core. For the STREAM Triad kernel, there are two read streams and one write stream for each thread, so a 61-thread job generates 183 memory access streams on the Xeon Phi SE10P. These 183 memory access streams fit reasonably well into the 256 available banks. When I go to 2 threads per core, the code is now generating 366 memory access streams, and these do not fit into the 256 available banks, so performance drops as the banks have to be repeatedly opened and closed to satisfy the competing/conflicting accesses.
The same sort of behavior occurs on mainstream Xeon processors (e.g., Xeon E5-2690 v3), and in these processors there are enough hardware performance counter events in the memory controllers to verify that too many memory access streams will cause large increases in the DRAM bank conflict rates.
Most finite difference codes generate a lot more memory access streams per thread than the STREAM benchmark. If the number of memory access streams is too high, you may run into performance problems before you even reach one thread per core. If I recall correctly, my preferred version of the "SWIM" 2D shallow water model reached its maximum performance on KNC with somewhere between 32 and 48 threads (spread across 32 to 48 cores). Part of this was due to increased DRAM bank conflicts as the thread count increased and part was due to the increasing overhead of parallel synchronization for the short parallel loops handling the boundaries of the 2D grid.
This reduction in performance with increasing thread counts is not a problem if the bandwidth obtained in the best cases is close to the expected maximum sustainable values. For KNC, well-behaved codes using transparent huge pages could typically sustain bandwidths of over 140 GB/s using the best configurations. STREAM did a bit better (about 175 GB/s), and some sparse matrix-vector codes (with more reads and few writes) delivered over 180 GB/s sustained bandwidth.