Obviously I'm not using the same CPU so Im not expecting identical results. However I'm seeing negative scaling when multi-threading.
I build Caffe2 with MKL BLAS and OpenMP enabled. I'm using the same benchmark mentioned in the blog post: convnet_benchmark.py (https://github.com/pytorch/pytorch/blob/master/caffe2/python/convnet_benchmarks.py)
Through various reading I found out that it's often best to set OMP_NUM_THREADS to 1 and MKL_NUM_THREADS to no more than the maximum number of physical cores. So I run the benchmark like so:
export MKL_NUM_THREADS="8" export OMP_NUM_THREADS="1" python convnet_benchmarks.py --batch_size 8 --model AlexNet --iterations 10 --warmup_iterations 1 --cpu
I use mpstat to monitor core usage and confirm that it's in fact running on multiple cores (and it is) and yet the performance drops, even if I run the benchmark on only 2 threads. It seems to me that there is a lot of overhead with using MKL_NUM_THREADS. Has anyone else ran into similar issues? I've noticed the topic of overhead come up here and there on the forms but it doesn't seem to be the same issue.
If it is possible, could you please try export MKL_VERBOSE=1 before run the two performance and copy the result here?
Second, how about if you unset MKL_NUM_THREADS and just try OMP_NUM_THREADS = 2 or 8 as the article and copy the result?