I have a C++ code that detects cars and pedestrians on videostream using optimized models (xml + bin). When I run this code on 2-cores i3, i get 14fps. On Xeon Gold 6132 - 17fps. Why the difference is so small for this CPUs? Second case: multiple copies of this code are running on Xeon Golds 6132 and used 4 threads. 1 copy - 14fps, 2 copies - 11fps, 3 copies - 8fps. With 20 threads I have: 1 copy - 17fps, 2 copies - 11fps, 3 copies - 7fps. CPU cores average load is less than 100% on any used thread (~60 - 80%). Why can the difference be so small even if i use 5x more threads? What are the reasons of this problems and how can i solve them?
Please read the following blog post. I think it will help you. It covers many performance topics including "Throughput Mode". Long story short - try to make your "Infer requests in flight" be matched by the same number of available physical CPU cores.
Also kindly take a look at the following document: