How to optimize tensorflow2/keras on a machine with two XEON Gold 6230 CPUs?

davideps · ‎04-25-2021

I'm running on a Windows 10 Enterprise 64bit machine with two XEON Gold 6230 CPUs (20 physical cores each) and Anaconda Python 3.8.8 64bit. I installed the packages with

conda install tensorflow-mkl keras -c anaconda

I'm using mnist_convnet.py to experiment with configurations with the goal of maximizing usage of both CPUs.

By default, the code uses all cores on a single CPU. I then added "config" to the imports and these lines to the code:

config.threading.set_inter_op_parallelism_threads(0)
config.threading.set_intra_op_parallelism_threads(0)
config.set_soft_device_placement(True)

This had no impact. Changing "set_inter_op_parallelism_threads" to 2 (the value I expected to trigger usage of both CPUs) had no effect either. All other settings I tried greatly reduced performance. I have several interrelated questions:

1. How can I get tensorflow/keras to use both CPUs?
2. Did I chose a poor example for multiCPU execution? If so, what is a better example?
3. Despite specifying `tensorflow-mkl` on install, the sanity check fails (result is False). Does that explain this problem? If so, how can I fix it?

JoseH_Intel · ‎04-25-2021

Hello davideps,

Thank you for joining the Intel community

Please allow us some time to research on your question. We will get back to you as soon as we have updates.

Regards

Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios

davideps · ‎04-27-2021

Hi Jose, thank you for your response. Can you tell me if anyone else has reported this issue on machines with two chips (any model) and whether you can recreate the problem based on the code I supplied?

AthiraM_Intel · ‎04-27-2021

Hi,

To maximize Tensorflow performance on CPU, you could use some parameter settings such as intra_/inter_op_parallelism_threads,Data Layout, KMP_AFFINITY, KMP_BLOCKTIME, OMP_NUM_THREADS etc. The recommended settings are available in the below link:

https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Please follow this article for openmp settings.

Regarding the installation, you could use installation option from the below link:

https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html

For windows , you can use any of the below commands or you can build tensorflow from source

conda install tensorflow-mkl

conda install tensorflow-mkl -c anaconda

The steps to build tensorflow from source is available in the above documentation.

We are checking on your other queries internally, will get back to you soon with an update.

Thanks.

AthiraM_Intel · ‎04-29-2021

Hi,

Regarding your second query, "Did I chose a poor example for multi CPU execution? If so, what is a better example?":

You could use the same sample (mnist_convnet.py), it will work fine with multi-threading.

Regarding the sanity check, we are checking from our end, will let you know the updates soon.

Could you please let us know the version of tensorflow you are using?

Thanks.

davideps · ‎05-02-2021

Hi Athira,

"conda list" shows:

tensorflow 2.3.0 mkl_py38h37f7ee5_0
tensorflow-base 2.3.0 eigen_py38h75a453f_0
tensorflow-estimator 2.3.0 pyheb71bc4_0 anaconda
tensorflow-mkl 2.3.0 h93d2e19_0

AthiraM_Intel · ‎05-03-2021

Hi,

Regarding the sanity check, we are able to reproduce the issue. We are checking internally on the issue, will let you know the updates soon.

Thanks.

davideps · ‎05-03-2021

Thanks Athira. Good to know that it wasn't me failing the sanity check

Would the problem cause config settings like this (below) to misbehave?

config.threading.set_inter_op_parallelism_threads(2)
config.threading.set_intra_op_parallelism_threads(0)

One of my initial questions was whether I need distributed workers to get both XEON CPUs on a single machine to share the load or whether the XEON platform should do this without distributed workers. Of course, I'm hoping distributed workers aren't necessary on a single machine since I believe that approach is designed for multiple machines in a network and will be slower than two CPUs that already share memory.

davideps · ‎05-11-2021

Hi Athira. Is there any update on this issue or an estimate of when it might be resolved?

Ying_H_Intel · ‎05-18-2021

Hi David,

Sorry for the delay. Could you please help to check your python version and tensorflow version?

>python

exit()

We did the investigation on the problem and find that

For Windows with Python 3.8, we got TF v2.3, but oneDNN is not enabled in this binary.

For Windows with python 3.7, we got TF v2.1, but oneDNN is enabled in this binary

Therefore, for got the intel optimized TF, you may have python 3.7 and TF 2.1 installed.

Thanks

Ying

davideps · ‎05-19-2021

I'm using Python 3.8 and TF 2.3. I'll downgrade both. Thank you!

Ying_H_Intel · ‎06-01-2021

Hi David,

is the new version work? Please feel free to let us know if any further problem.

further, just for your reference: using Python 3.8 , intel TF 2. 4 is ready now, which can be installed by

pip install intel-tensorflow==2.4.0

Ref link: install guide: https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html

performance considerations:

https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Best Regards,

Ying

Ying_H_Intel · ‎09-23-2021

Hi David,

Hope everything goes well.

It is my pleasure to notify you that the release of Intel® Optimizations for Tensorflow v2.6.0 for Linux and windows platforms are available. You are welcomed to try the latest version and let us know if any issues.

And as the issue was open for a few months, i go ahead to close the issue. Please feel free to update us if any news.

Thanks

Ying

How to optimize tensorflow2/keras on a machine with two XEON Gold 6230 CPUs?

TensorFlow