Intel® Optimized AI Frameworks
Receive community support for questions related to PyTorch* and TensorFlow* frameworks.
76 Discussions

Run more docker containters with Inter-optimized-tensorflow on One 8 physical core 16cores Cpu

OosakiKaNa
Beginner
5,154 Views
hello, I find the inter-optimized-tensorflow has the great increasing on train phase. but i want to run 3 docker containters in 8 physical core 16cores Cpu, i set every containter with 4 logical core how i set the param intra_/inter_op_parallelism_threads and OMP_NUM_THREADS? when one containter runs, the train time cost 17s every epoch, but when i run 3 containters, in every containter the train time cost 50s/epoch. by the way i set intra_/inter_op_parallelism_threads =2, OMP_NUM_THREADS= 2 ,KMP_BLOCKTIME=1 in containter. please tell me why?
0 Kudos
1 Solution
Jianyu_Z_Intel
Employee
4,798 Views

Hi,

  In K8S case, Intel provide the solution for CPU pinning: CPU Manager for Kubernetes* (also called CMK).

 

  Here is the guide for it.

  https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf

 

  If you have more question about CMK, please create new issue for CMK in Intel Community.

 

  Good luck!

 

  Thank you! 

   

 

View solution in original post

0 Kudos
18 Replies
AdrianM_Intel
Employee
5,136 Views

Hello OosakiKaNa,

 

Thank you for posting on the Intel® communities.

 

To better assist you, we have moved your question to another forum.

 

Regards,

 

Adrian M.

Intel Customer Support Technician

0 Kudos
AthiraM_Intel
Moderator
5,116 Views

Hi,


Could you please share the following details:


1) Docker images you used?

2) Complete steps to reproduce the issue including the commands you used

3) Intel tensorflow version used

4) OS details



Thanks




0 Kudos
OosakiKaNa
Beginner
5,113 Views

docker images: intel/intel-optimized-tensorflow:2.2.0-centos-8-mpich-horovod

my os: centos8

docker run -itd --cpuset-cpus=1,2,3,4 -v /home/liangliang/nfscontent/:/tf/tft/output tft:v1

tft:v1 is my program iamge

 

thanks

0 Kudos
AthiraM_Intel
Moderator
5,074 Views

Hi,

 

Thanks for sharing the details.

Could you please share the log file by enabling KMP_AFFINITY verbose.

ie, KMP_AFFINITY=verbose

 

Please find the below link for more information:

 

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

 

Also you can try by increasing the OMP_NUM_THREADS , set OMP_NUM_THREADS = 8 and check whether there is any improvement?

 

 

Thanks.

 

 

 

0 Kudos
OosakiKaNa
Beginner
5,062 Views

Hi!

Thanks for you advice 

I should share more details

my Inter-optimized-tensorflow containter Environment variables is 
ENV OMP_NUM_THREADS='4'
ENV KMP_BLOCKTIME='1'
ENV KMP_AFFINITY=granularity=fine,verbose,compact,1,0

i run the CMD docker run -itd --cpuset-cpus=7, 8, 9 , 10

also i set tf.config  intra_/inter_op_parallelism_threads =4, 2

this is the verbose when i run one containter:

image.png

the train phase cost time is 23s, it is very fast!

when I set OMP_NUM_THREADS = '8', and other param is fixed,  I find the train speed is very slow. it set 4 the train speed is fast.

 

but when i run two containters:(the another is run cpu1,2,3,4)

OosakiKaNa_0-1628232167130.png

you can find the train phase cost time is increasing,  i dont know why

and this is my host Htop status

OosakiKaNa_1-1628232241797.png

 

thanks.

0 Kudos
Louie_T_Intel
Moderator
4,896 Views

Hi

 

From the KMP verbose log, you could see 8 threads bound to cpu 7-10 when you set OMP_NUM_THREADS = '4'.

If you have hyperthreading on, each thread could use 1 hyper thread because number of hyper threading is 8 in this case.

 

However, when you set set OMP_NUM_THREADS = '8', you will have 16 threads to compete 8 hyper threads. the performance will be impacted.

 

 

For the two container case, do you run your workloads on a system with2 sockets?

If yes, you might need to use numactl to make all threads within a container to run on one socket instead of two sockets to reduce some NUMA remote access issue.

 

regards

 

 

0 Kudos
OosakiKaNa
Beginner
4,886 Views

Hi~

Thanks for your reply

I don't run my workloads on a system with 2 sockets 

This is My computer cpu information

OosakiKaNa_0-1629939364467.png

But tomorrow my company buy 10 computers with Gold 6248R 2sockets 24C/48T 

Actually I use k8s manage my model at 29 computers, so Do you know how can I make all threads within a container to run on one socket instead of two sockets with k8s setting?

My English is poor, sorry.

Regards

 

0 Kudos
AthiraM_Intel
Moderator
5,005 Views

Hi,


We are checking on your issue. Could you please share the sample reproducer and complete steps to try out the same from our end.


Thanks


0 Kudos
OosakiKaNa
Beginner
4,994 Views

Hi!

What should I do? send you my program and dataset?
I dont know how to do, please tell me
Thanks

0 Kudos
AthiraM_Intel
Moderator
4,961 Views

Hi,


Yes, you can share your sample reproducer and commands used. Regarding this we will contact you through private message shortly.


Thanks


0 Kudos
OosakiKaNa
Beginner
4,914 Views
Hi
I am sorry to reply you for a so long time
My company doesn't let me share the Program and Data
 
Actually, I have gived up at the Issue,I think maybe it's Hardware Limitation,So It can't solve this problem with Software Setting.
The Model is not so complex, it just have 220K parameter,The data is just a excel file with 10K row and 13 columns.
but this code is not for running in the docker Containter.
I run the model with inter-optimized-inter 2.2.0  but i doesn't using the Tensorflow2 property
I import tensorflow.compat.v1 as tf so I think maybe use tf2.0 can bring some advancement
But recently I can't do the experiment with this setting, if i have time i will try. and i will contact you.
So the issue maybe is over
My English is poor, sorry. 
Thanks for your help ! 
0 Kudos
Jianyu_Z_Intel
Employee
4,828 Views

Hi,

  To simplify the description, we use physical cores in this topic.

  I think in your case, set the same cores numbers to each container, but the containers share some cores in same time. So, the performance is reduced to 1/3 of one container.

 To resolve this issue, please assign different cores to different containers. Like:  

docker run -it --cpus="1,2" ubuntu /bin/bash
docker run -it --cpus="3,4" ubuntu /bin/bash
docker run -it --cpus="5,6" ubuntu /bin/bash

Refer to: https://docs.docker.com/config/containers/resource_constraints/

 

Thank you!

0 Kudos
OosakiKaNa
Beginner
4,820 Views

Hi!

Thanks for your reply

Please take a look on my reply at ‎08-05-2021 11:46 PM

I run the two docker containter on cpu 7,8,9,10 and 1,2,3,4

My computer RAM is 32GB, I set they run different cpu, but the issue is still exists

My English is poor, sorry

Regards

0 Kudos
Jianyu_Z_Intel
Employee
4,817 Views

Hi,

  Don't warry! I fully understand your words. 

  In your CPU, there are 8 cores. The cores 0-7 are the index of them.

  Index 8 and index 0 are same core in fact.

  In your case: cpu 7,8,9,10 and 1,2,3,4

      1,9 & 2, 10, they are same cores in fact.

  That means they share 2 cores (1(9), 2(10)). That will impact the performance.

  If you want to use 4 cores per container, please use 0-3, 4-7.

  Avoid to assign one core to more than one container.

 

  Thank you!

  In my example:

  

  

  

0 Kudos
OosakiKaNa
Beginner
4,809 Views

Hi!

Thanks for your so fast reply

I will do the experiment with this setting

But Actually I manage my model on 34 computers with k8s, the k8s control the docker containter with Cgroups, It can't assign physical core(maybe can't, at now i don't know this)


So if this issue is about the cpu share(means hardware issue), it can't solve by software setting(I guess).

I just want to know what brings this problem.

Thanks

0 Kudos
Jianyu_Z_Intel
Employee
4,799 Views

Hi,

  In K8S case, Intel provide the solution for CPU pinning: CPU Manager for Kubernetes* (also called CMK).

 

  Here is the guide for it.

  https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf

 

  If you have more question about CMK, please create new issue for CMK in Intel Community.

 

  Good luck!

 

  Thank you! 

   

 

0 Kudos
OosakiKaNa
Beginner
4,792 Views

Hi!

From your reply I know the reason cause the Issue and the tools to solve it 

Thus, the issue is over!

Thank you and community's everyone!

Thank Intel!

0 Kudos
Jianyu_Z_Intel
Employee
4,784 Views

Hi,

  It's our pleasure! 

 

 Thank your support!

  

  

0 Kudos
Reply