Solved: Run more docker containters with Inter-optimized-tensorflow on One 8 physical core 16cores Cpu

OosakiKaNa · ‎08-03-2021

hello, I find the inter-optimized-tensorflow has the great increasing on train phase. but i want to run 3 docker containters in 8 physical core 16cores Cpu, i set every containter with 4 logical core how i set the param intra_/inter_op_parallelism_threads and OMP_NUM_THREADS? when one containter runs, the train time cost 17s every epoch, but when i run 3 containters, in every containter the train time cost 50s/epoch. by the way i set intra_/inter_op_parallelism_threads =2, OMP_NUM_THREADS= 2 ,KMP_BLOCKTIME=1 in containter. please tell me why?

Jianyu_Z_Intel · ‎09-02-2021

Hi,

In K8S case, Intel provide the solution for CPU pinning: CPU Manager for Kubernetes* (also called CMK).

Here is the guide for it.

https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf

If you have more question about CMK, please create new issue for CMK in Intel Community.

Good luck!

Thank you!

View solution in original post

AdrianM_Intel · ‎08-03-2021

Hello OosakiKaNa,

Thank you for posting on the Intel® communities.

To better assist you, we have moved your question to another forum.

Regards,

Adrian M.

Intel Customer Support Technician

AthiraM_Intel · ‎08-03-2021

Hi,

Could you please share the following details:

1) Docker images you used?

2) Complete steps to reproduce the issue including the commands you used

3) Intel tensorflow version used

4) OS details

Thanks

OosakiKaNa · ‎08-04-2021

docker images: intel/intel-optimized-tensorflow:2.2.0-centos-8-mpich-horovod

my os: centos8

docker run -itd --cpuset-cpus=1,2,3,4 -v /home/liangliang/nfscontent/:/tf/tft/output tft:v1

tft:v1 is my program iamge

thanks

AthiraM_Intel · ‎08-05-2021

Hi,

Thanks for sharing the details.

Could you please share the log file by enabling KMP_AFFINITY verbose.

ie, KMP_AFFINITY=verbose

Please find the below link for more information:

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

Also you can try by increasing the OMP_NUM_THREADS , set OMP_NUM_THREADS = 8 and check whether there is any improvement?

Thanks.

OosakiKaNa · ‎08-05-2021

Hi!

Thanks for you advice

I should share more details

my Inter-optimized-tensorflow containter Environment variables is
ENV OMP_NUM_THREADS='4'
ENV KMP_BLOCKTIME='1'
ENV KMP_AFFINITY=granularity=fine,verbose,compact,1,0

i run the CMD docker run -itd --cpuset-cpus=7, 8, 9 , 10

also i set tf.config intra_/inter_op_parallelism_threads =4, 2

this is the verbose when i run one containter:

the train phase cost time is 23s, it is very fast!

when I set OMP_NUM_THREADS = '8', and other param is fixed, I find the train speed is very slow. it set 4 the train speed is fast.

but when i run two containters:(the another is run cpu1,2,3,4)

you can find the train phase cost time is increasing, i dont know why

and this is my host Htop status

thanks.

Louie_T_Intel · ‎08-25-2021

Hi

From the KMP verbose log, you could see 8 threads bound to cpu 7-10 when you set OMP_NUM_THREADS = '4'.

If you have hyperthreading on, each thread could use 1 hyper thread because number of hyper threading is 8 in this case.

However, when you set set OMP_NUM_THREADS = '8', you will have 16 threads to compete 8 hyper threads. the performance will be impacted.

For the two container case, do you run your workloads on a system with2 sockets?

If yes, you might need to use numactl to make all threads within a container to run on one socket instead of two sockets to reduce some NUMA remote access issue.

regards

OosakiKaNa · ‎08-25-2021

Hi~

Thanks for your reply

I don't run my workloads on a system with 2 sockets

This is My computer cpu information

But tomorrow my company buy 10 computers with Gold 6248R 2sockets 24C/48T

Actually I use k8s manage my model at 29 computers, so Do you know how can I make all threads within a container to run on one socket instead of two sockets with k8s setting?

My English is poor, sorry.

Regards

AthiraM_Intel · ‎08-12-2021

Hi,

We are checking on your issue. Could you please share the sample reproducer and complete steps to try out the same from our end.

Thanks

OosakiKaNa · ‎08-12-2021

Hi!

What should I do? send you my program and dataset?
I dont know how to do, please tell me
Thanks

AthiraM_Intel · ‎08-17-2021

Hi,

Yes, you can share your sample reproducer and commands used. Regarding this we will contact you through private message shortly.

Thanks

OosakiKaNa · ‎08-24-2021

Hi

I am sorry to reply you for a so long time

My company doesn't let me share the Program and Data

Actually， I have gived up at the Issue，I think maybe it's Hardware Limitation，So It can't solve this problem with Software Setting.

The Model is not so complex, it just have 220K parameter，The data is just a excel file with 10K row and 13 columns.

The model source code is https://github.com/google-research/google-research/tree/master/tft

but this code is not for running in the docker Containter.

I run the model with inter-optimized-inter 2.2.0 but i doesn't using the Tensorflow2 property

I import tensorflow.compat.v1 as tf so I think maybe use tf2.0 can bring some advancement

But recently I can't do the experiment with this setting, if i have time i will try. and i will contact you.

So the issue maybe is over

My English is poor, sorry.

Thanks for your help !

Jianyu_Z_Intel · ‎09-01-2021

Hi,

To simplify the description, we use physical cores in this topic.

I think in your case, set the same cores numbers to each container, but the containers share some cores in same time. So, the performance is reduced to 1/3 of one container.

To resolve this issue, please assign different cores to different containers. Like:

docker run -it --cpus="1,2" ubuntu /bin/bash
docker run -it --cpus="3,4" ubuntu /bin/bash
docker run -it --cpus="5,6" ubuntu /bin/bash

Refer to: https://docs.docker.com/config/containers/resource_constraints/

Thank you!

OosakiKaNa · ‎09-01-2021

Hi!

Thanks for your reply

Please take a look on my reply at ‎08-05-2021 11:46 PM

I run the two docker containter on cpu 7,8,9,10 and 1,2,3,4

My computer RAM is 32GB, I set they run different cpu, but the issue is still exists

My English is poor, sorry

Regards

Jianyu_Z_Intel · ‎09-01-2021

Hi,

Don't warry! I fully understand your words.

In your CPU, there are 8 cores. The cores 0-7 are the index of them.

Index 8 and index 0 are same core in fact.

In your case: cpu 7,8,9,10 and 1,2,3,4

1,9 & 2, 10, they are same cores in fact.

That means they share 2 cores (1(9), 2(10)). That will impact the performance.

If you want to use 4 cores per container, please use 0-3, 4-7.

Avoid to assign one core to more than one container.

Thank you!

In my example:

OosakiKaNa · ‎09-01-2021

Hi!

Thanks for your so fast reply

I will do the experiment with this setting

But Actually I manage my model on 34 computers with k8s, the k8s control the docker containter with Cgroups, It can't assign physical core(maybe can't, at now i don't know this)

So if this issue is about the cpu share(means hardware issue), it can't solve by software setting(I guess).

I just want to know what brings this problem.

Thanks

Jianyu_Z_Intel · ‎09-02-2021

Hi,

In K8S case, Intel provide the solution for CPU pinning: CPU Manager for Kubernetes* (also called CMK).

Here is the guide for it.

https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf

If you have more question about CMK, please create new issue for CMK in Intel Community.

Good luck!

Thank you!

OosakiKaNa · ‎09-02-2021

Hi!

From your reply I know the reason cause the Issue and the tools to solve it

Thus, the issue is over!

Thank you and community's everyone!

Thank Intel!

Jianyu_Z_Intel · ‎09-02-2021

Hi,

It's our pleasure!

Thank your support!