Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Student Ambassador
8 Views

Queries regarding walltime, multithreading, performance

Hi,

I had some queries

 

1. Whenever I use dataloader in Pytorch (multithreading), I see this error multiple times:

 

=================================================================

Traceback (most recent call last):

  File "/glob/intel-python/python3/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers

    finalizer()

  File "/glob/intel-python/python3/lib/python3.6/multiprocessing/util.py", line 186, in __call__

    res = self._callback(*self._args, **self._kwargs)

  File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 480, in rmtree

    _rmtree_safe_fd(fd, path, onerror)

  File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 438, in _rmtree_safe_fd

    onerror(os.unlink, fullname, sys.exc_info())

  File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 436, in _rmtree_safe_fd

    os.unlink(name, dir_fd=topfd)

OSError: [Errno 16] Device or resource busy: '.nfs00000038015dac06000002a7'

==================================================================

What can I do to solve this ?

 

2. How can I reduce training time ? Training is slow

During training I use this command

=====================================================================

qsub -I -l nodes=4:ppn=2,walltime=24:00:00,mem=196gb

=====================================================================

But this doesn't help much. What more can I do to get improved performance ?

 

3. I also want to increase the walltime. Training take weeks.

 

4. Earlier I used to use a docker and training would happen in my container. This allowed me to close my laptop. But if training takes place for multiple days, then what do you suggest for tracking the interactive session (similar to counterpart of entering into docker session) Can you suggest a way for this problem ?

 

Tags (1)
0 Kudos
20 Replies
Highlighted
Moderator
8 Views

Hi,

Hi,

Thanks for reaching out to us.

1.  We tried out one code sample and did not get any issues as you mentioned.
Could you please share the workload that you are trying out if possible along with the steps you followed so that we can try out the same from our end.

2. You can improve the performance by setting the OMP_NUM_THREADS,Numactl,GOMP_CPU_AFFINITY/KMP_AFFINITY
Please refer the following url (https://software.intel.com/en-us/articles/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration) to get more insight on improving the performance using pytorch.

3. The walltime can be set to only a maximum of 24 hours to ensure a fair utilization of resources by all the DevCloud users. Hence we cannot increase the walltime greater than 24 hours.However,you can try the optimization methods mentioned above to reduce training time.


4. For the purpose of training we recommend you to restore the last saved checkpoint and continue training from that point.
 

0 Kudos
Highlighted
Moderator
8 Views

Hi,

Hi,

Could you please confirm whether the solution provided was helpful.

0 Kudos
Highlighted
Student Ambassador
8 Views

Hi,

Hi,

 

Not much helpful actually. I already knew what you have mentioned.

0 Kudos
Highlighted
Moderator
8 Views

Hi,

Hi,

Could you please share your workload if possible so that we can help you out in fixing the issue that you are facing.

0 Kudos
Highlighted
Student Ambassador
8 Views

I'm using the following

I'm using the following commands:

export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="0-55"
export KMP_AFFINITY=granularity=fine,proclist=[0-55],explicit
export OMP_NUM_THREADS=24
numactl --cpunodebind=0 --membind=0 python train.py

 

It's taking a lot of time to train on a minibatch (165 iterations/second). I'm using a transformer btw.

0 Kudos
Highlighted
Moderator
8 Views

Hi,

Hi,

Could you please clarify the following questions highlighted in bold :

1. Whenever I use dataloader in Pytorch (multithreading), I see this error multiple times:

What can I do to solve this ?

Q: Are you still facing dataloader issue in Pytorch?

2. How can I reduce training time ? Training is slow

During training I use this command

=====================================================================

qsub -I -l nodes=4:ppn=2,walltime=24:00:00,mem=196gb

=====================================================================

But this doesn't help much. What more can I do to get improved performance ?

Q:Is it possible to share your python scripts, so that we can recreate it from our end and see if the training time improves.

3. I also want to increase the walltime. Training take weeks.

Q: As mentioned earlier ,Wall time limit is 24hrs. Hope you are aware about it.

4. Earlier I used to use a docker and training would happen in my container. This allowed me to close my laptop. But if training takes place for multiple days, then what do you suggest for tracking the interactive session (similar to counterpart of entering into docker session) Can you suggest a way for this problem ?

Q: Did you give a try restoring checkpoint?

0 Kudos
Highlighted
Student Ambassador
8 Views

1. It's more a flag than an

1. It's more a flag than an issue. It would be good if this message is removed by default.

2. Sure, it's based on this https://github.com/airsplay/lxmert .

4. I know about checkpointing, there should be a better way to deal with extended training.

0 Kudos
Highlighted
Moderator
8 Views

Hi,

Hi,

Thanks for sharing the github repo. Unfortunately, we couldn't find the train.py file that you have mentioned in the previous post in this repo.

Could you please let us know which file from this repo you are trying out?

 

Thanks.

 

0 Kudos
Highlighted
Student Ambassador
8 Views

bash run/vqa_finetune.bash 0

bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny

I've switched all the code to CPU. It doesn't have to be this repo to reproduce the issue.

 

PS: It can be a transformer architecture in general.

0 Kudos
Highlighted
Moderator
8 Views

Hi,

Hi,

We are trying out the steps mentioned in the github repo you shared from our end ,will let you know once it is completed.

0 Kudos
Highlighted
Student Ambassador
8 Views

As I said earlier, it doesn't

As I said earlier, it doesn't have to be this repo. It can any transformer or any model for that matter in general. My question is more about performance/speeding things up in general rather than specific query. Rest is up to you.

0 Kudos
Highlighted
Moderator
8 Views

Hi  Prajjwal Bhargava,

Hi  Prajjwal Bhargava,

 

Regarding the training time we will contact an SME and get back to you soon.

 

0 Kudos
Highlighted
Moderator
8 Views

Hi  Prajjwal Bhargava,,

Hi  Prajjwal Bhargava,,

We have forwarded this case to the SME, meanwhile could you please let us know what do you target in terms of training time improvement

 

Arun Jose

0 Kudos
Highlighted
Employee
8 Views

Hi,

Hi,

Engineer team is working on this issue.

I'll get back to you once I got updates.

Thank you.

0 Kudos
Highlighted
Employee
8 Views

We ran this test case on

We ran this test case on Intel® Xeon® Platinum 8180 Processor: 

bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny

1. This model has many layernorm, the backward of layernorm is not optimized now, this work is under progress, after optimized, we can get a better performance than before.

2. Use NUMA to bind instance to one socket:

 before:  49/19753 [01:31<10:04:44,  1.84s/it]

After using following patch:

diff --git a/run/vqa_finetune.bash b/run/vqa_finetune.bash
index 09c47e8..7f3a6c8 100644
--- a/run/vqa_finetune.bash
+++ b/run/vqa_finetune.bash
@@ -1,4 +1,18 @@
# The name of this experiment.
+
+CORES=`lscpu | grep Core | awk '{print $4}'`
+SOCKETS=`lscpu | grep Socket | awk '{print $2}'`
+TOTAL_CORES=`expr $CORES`
+
+KMP_SETTING="KMP_AFFINITY=granularity=fine,compact,1,0"
+
+export OMP_NUM_THREADS=$TOTAL_CORES
+export $KMP_SETTING
+
+echo -e "### using OMP_NUM_THREADS=$TOTAL_CORES"
+echo -e "### using $KMP_SETTING"
+sleep 3
+
name=$2

# Save logs and models under snap/vqa; make backup.
@@ -7,9 +21,11 @@ mkdir -p $output/src
cp -r src/* $output/src/
cp $0 $output/run.bash

+let CORES-=1
+
# See Readme.md for option details.
CUDA_VISIBLE_DEVICES=$1 PYTHONPATH=$PYTHONPATH:./src \
- python src/tasks/vqa.py \
+ numactl -C0-$CORES -m0 python src/tasks/vqa.py \
--train train,nominival --valid minival \
--llayers 9 --xlayers 5 --rlayers 5 \
--loadLXMERTQA snap/pretrained/model \

we can get

42/19753 [00:57<7:20:27,  1.34s/it]

3. You can use Jemalloc to get a good performance:  

  Install Jemalloc: https://github.com/jemalloc/jemalloc/wiki/Getting-Started

     a):  download from release: https://github.com/jemalloc/jemalloc/releases

     b): tar -jxvf jemalloc-5.2.0.tar.bz2

     c): ./configure && make

     d): cd $install path$/jemalloc-5.2.0/bin &&  chmod 777 jemalloc-config

export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000";
export LD_PRELOAD=$install path$/jemalloc-5.2.0/lib/libjemalloc.so

Please note jemalloc is not NUMA aware which means it is not fully functional on dual socket machine, you may use torch.DataParallel to launch 2 processes for training on a dual socket machine. AKA treat 2 sockets of one CPU as 2 nodes in distributed learning.

0 Kudos
Highlighted
Student Ambassador
8 Views

Thanks for replying. I am not

Thanks for replying. I am not aware of NUMA and Jemalloc. I think the performance is not specific to this transformer architecture but in general. I have tried architectures which don't have a layernorm. Anyways, can you please provide a documentation or README for now as in what do I need to run or do to get the best possible performance. 

Currently, I only do 

qsub -I -l nodes=4:ppn=2,walltime=24:00:00, mem=196gb"

 

0 Kudos
Highlighted
Employee
8 Views

Hi,

Hi,

Please check the following steps.

1. Follow "3. You can use Jemalloc to get a good performance:" of my previous post to checkout jemalloc and compile it on devcloud.

2. Apply the patch mentioned in my previous post.

3. Run the test code

export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
export LD_PRELOAD=$install path$/jemalloc-5.2.0/lib/libjemalloc.so
bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny

Please check Advanced Queue section of DevCloud Manual to run this on DevCloud.

0 Kudos
Highlighted
Student Ambassador
8 Views

Hi, thanks for your reply.

Hi, thanks for your reply.

I built jemalloc. How to use NUMA and can you provide a Pytorch example on how to use all the cores to speed up performance ? I use this

qsub -I -l nodes=4:ppn=2,walltime=24:00:00, mem=196gb"

How can I make sure that I'm using all the cores for my program ?

0 Kudos
Highlighted
Employee
8 Views

Hi,

Hi,

That patch does everything for you already. You only need to patch it to your code, and run. (For Jemalloc, you need to set up it by yourself by exporting those 2 environment variables mentioned in previous posts.)

To describe the patch in details,

The following detects hardware configuration
+CORES=`lscpu | grep Core | awk '{print $4}'`
+SOCKETS=`lscpu | grep Socket | awk '{print $2}'`
+TOTAL_CORES=`expr $CORES`

Set CPU AFFINITY configuration
+KMP_SETTING="KMP_AFFINITY=granularity=fine,compact,1,0"

Use all cores for OpenMP threads
+export OMP_NUM_THREADS=$TOTAL_CORES
+export $KMP_SETTING

Use numactl to run the application on all physical cores on node #0.
+ numactl -C0-$CORES -m0 python src/tasks/vqa.py \
--train train,nominival --valid minival \
--llayers 9 --xlayers 5 --rlayers 5 \
--loadLXMERTQA snap/pretrained/model \

Please note that even it runs OpenMP threads on all physical cores, CPU usage on individual core may not reach 100%. According to tasks, this could be normal.

 

I see you applied 4 nodes with nodes=4:ppn=2, probably you can try distributed computation.

Please check Advanced Queue section of DevCloud Manual to run this on DevCloud.

0 Kudos