Solved: Vanilla tensorflow vs Intel optimized tensorflow

Girin · ‎10-27-2020

Training a model using the MNIST dataset with TF and Keras on the DevCloud, and noticing that performance with Vanilla TF is much faster when compared to Intel optimized TF.

1 . Installed vanilla TF using 'pip install tensorflow'. This installed TF 2.3.1. Selected Python 3.7 kernel. Training takes ~2s/epoch.

2. Installed Intel optimized TF with 'pip install intel-tensorflow-avx512'. Installed TF 2.3.0. Selected Python 3.7 kernel. Training takes ~10-15s/epoch.

Tried this multiple times. Similar results. Any idea why?

ArunJ_Intel · ‎10-29-2020

Hi Girin,

The out of box performance of intel tensorflow might not be always better than the stock tensorflow. To leverage performance improvement with intel tensorflow you should set a few environment variables and few minor code changes would be required.

Environment variables

Set KMP_BLOCKTIME=1,OMP_NUM_THREADS={number of cpu cores in your compute node}(you could get this by using lscpu command)

eg:

export KMP_BLOCKTIME=1

export OMP_NUM_THREADS= #physical cores

export KMP_AFFINITY=granularity=fine,verbose,compact,1,0

Code

Also set the inter_op_parallelism_threads to a value 2. This can be set adding the below line in your code.

tf.config.threading.set_intra_op_parallelism_threads(
  2 
)

Please refer to more BKMS and recommended settings when using intel tensorflow at below link

https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Please try out the above steps and let us know if you are still having issues.

Thanks

Arun Jose

View solution in original post

ArunJ_Intel · ‎10-29-2020

Hi Girin,

The out of box performance of intel tensorflow might not be always better than the stock tensorflow. To leverage performance improvement with intel tensorflow you should set a few environment variables and few minor code changes would be required.

Environment variables

Set KMP_BLOCKTIME=1,OMP_NUM_THREADS={number of cpu cores in your compute node}(you could get this by using lscpu command)

eg:

export KMP_BLOCKTIME=1

export OMP_NUM_THREADS= #physical cores

export KMP_AFFINITY=granularity=fine,verbose,compact,1,0

Code

Also set the inter_op_parallelism_threads to a value 2. This can be set adding the below line in your code.

tf.config.threading.set_intra_op_parallelism_threads(
  2 
)

Please refer to more BKMS and recommended settings when using intel tensorflow at below link

https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Please try out the above steps and let us know if you are still having issues.

Thanks

Arun Jose

Girin · ‎10-29-2020

great, thanks for this Arun. With these settings, i now get a training speed of ~1s/epoch with the MNIST dataset and intel optimized tensorflow on the DevCloud

ArunJ_Intel · ‎10-29-2020

Hey Girin,

Glad to know the solution provided helps. We wouldn't be monitoring this thread further. Please feel free to raise a new thread in case of further issues.

Thanks

Arun