- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am currently working on a Keras reimplementation of the Jasper speech-to-text network from NYU and NVIDIA labs. I am going off of the information available in their Arxiv paper in order to reconstruct the network as faithfully as possible. I am currently using an Intel distribution of TensorFlow 1.14 on Devcloud with one GPU node in order to train the model and an Intel NUC for inference.
However, I am running into quite a large hurdle. When I tried to get a version of the smallest model (19 layers containing several residual connections) training on the smallest training set (~100 hours of speech), I get an estimated time per epoch of 40-45 hours. Given the maximum wall time of 24 hours on the Devcloud, training this network as is does not appear to be feasible. At this point in time I am unware as to what areas I could optimize in order to drop that training time down to something more manageable. Is this just a situation where I should just throw more GPUs at it? If I were to upgrade to a more current version of TensorFlow, how much gain in training time should I realistically expect?
Thanks for your time,
Nick
___
Training details:
- 4 1D conv stacks (1D Conv, Batch norm, Relu activation)
- 5 residual blocks, each 3 deep (containing the described conv stack)
- 19 total conv stacks
- CTC loss
- SGD optimizer
- batches of 16 from generator
- 20 epochs
Edit: It appears that the Intel optimized 1.14 TensorFlow package is not gpu enabled. Is there an optimized version of TensorFlow that is gpu enabled and is available to be used on the Devcloud?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Thanks for sharing the model files and training files along with the shell script.
We are unable to recreate the Xbyak error that you mentioned previously as we don't have the dataset and json file referenced in the train.py file.
Could you please provide a subset of the dataset along with json file so that we can recreate the error that you are getting from our end.
Meanwhile, could you please try submitting the script as blow:
#PBS -l nodes=1:gpu #PBS -l walltime=24:00:00 #PBS -N Specs2Text_small cd $PBS_O_WORKDIR export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=0 export OMP_NUM_THREADS=6 export KMP_SETTINGS=TRUE cd ~/NUC_project/Specs2text echo Starting training... python train.py
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for contacting us.
DevCloud does not have dedicated GPU but has iGPU installed. iGPU is different from dedicated GPU and is not currently supported by Intel Optimized Tensorflow.
However , you can try the optimizations on CPU itself to get improved performance.
Please follow the below urls for more details on Optimizing Tensorflow workloads on CPU.
Distributed training using Horovod can also help to distribute the workload on multiple cores on same CPU node and extend to multiple nodes. https://github.com/horovod/horovod ; will give you more details.
Hope this clarifies your query. Please feel free to reach out to us if you have any further queries. Thank You.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Lakshmi,
Thanks for your response!
I went ahead and incorporated the suggestions made in first two Intel articles you linked. Specifically, the settings I changed were as follows:
- KMP_AFFINITY=granularity=fine,compact,1,0
- KMP_BLOCKTIME=0
- KMP_DUPLICATE_LIB_OK=True
- KMP_SETTINGS=True
- OMP_NUM_THREADS=6
I ran these on an interactive node in order to get a time estimate and, unfortunately, I am still getting excessively large training times (Started at 40 hours per epoch, got as low as 32 hours after 14 batches of 32 items). It appears to be steadily decreasing, but not to a reasonable time.
Additionally, I tried packaging these directives up and submitting them as a job to the server. I have attached my error file, which says that the code was terminated after throwing a Xbyak error. I am unsure as to why I am able to run the code on an interactive node and not as a submission.
I am currently working on incorporating Horovod, but their readme doesn't make it clear on how to use it with distributed CPUs, so that is taking some time.
--Nick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Would it be possible to share the workload along with the steps you have followed to submit the job script for further debugging?
Thanks,
Lakshmi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure thing!
Attached are the model and training files alongside the shell script that I submitted as a job.
The job was submitted via qsub:
qsub devcloud_Specs2text.sh
When this did not work, I switched to an interactive node via:
qsub -I -l nodes=1:gpu:ppn=2 -d .
and ran the training script from there after exporting the necessary environment variables.
I don't think I am exactly sure with what you mean by workload, but the network is currently being trained on 28,539 files taking up 11 GB of data across 20 epochs in batches of 16. I am currenting batching through a generator to try to save some room in onboard memory.
Hopefully this helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Thanks for sharing the model files and training files along with the shell script.
We are unable to recreate the Xbyak error that you mentioned previously as we don't have the dataset and json file referenced in the train.py file.
Could you please provide a subset of the dataset along with json file so that we can recreate the error that you are getting from our end.
Meanwhile, could you please try submitting the script as blow:
#PBS -l nodes=1:gpu #PBS -l walltime=24:00:00 #PBS -N Specs2Text_small cd $PBS_O_WORKDIR export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=0 export OMP_NUM_THREADS=6 export KMP_SETTINGS=TRUE cd ~/NUC_project/Specs2text echo Starting training... python train.py
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Identifying a GPU node as a PBS directive appears to have at least allowed me to run the job outside an interactive node, so that's good!
I have attached a folder containing a small dataset (388 files), the json, and all of the necessary scripts to run the model. You will need a couple uncommon packages, unidecode and inflect, which can be pip installed.
Given that you have the correct packages installed, you should only have to run the train.py file from the master directory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Thanks for sharing the datasets along with the json file.
We created a conda environment and installed all the necessary packages.In the 1st epoch after completing almost 22 iterations we are getting the following error.
Traceback (most recent call last): File "train.py", line 41, in <module> verbose=1) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator steps_name='steps_per_epoch') File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 220, in model_iteration batch_data = _get_next_batch(generator, mode) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 362, in _get_next_batch generator_output = next(generator) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 918, in get six.reraise(*sys.exc_info()) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/six.py", line 696, in reraise raise value File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 894, in get inputs = self.queue.get(block=True).get() File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 828, in next_sample return six.next(_SHARED_SEQUENCES[uid]) File "/home/uXXXXX/for_debug/data_gen.py", line 91, in next_batch self.genshuffle() File "/home/uXXXXX/for_debug/data_gen.py", line 102, in genshuffle self.wavpath, self.transcript, self.finish = shuffle(self.wavpath, AttributeError: 'BatchGen' object has no attribute 'wavpath' Traceback (most recent call last): File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers finalizer() File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/util.py", line 186, in __call__ res = self._callback(*self._args, **self._kwargs) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 486, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd os.unlink(name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfs000000130031880a000000cd'
As mentioned in the error trace we couldn't find an attribute wavpath initialized in the BatchGen class of data_gen.py file. Please let us know whether we are missing any other code snippets.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
We were able to run the python files without the Xbyak error that you have mentioned earlier. Also, we are able to complete training in almost 9 hours.
However, we are getting an OS error at the end after creating the model.The output model generated is 414 MB.
Please follow the steps given below to create a conda environment and install the necessary packages.
conda create -n env_speech -c intel python=3.6 source activate env_speech pip install numpy pip install matplotlib pip install scipy pip install sklearn pip install tensorflow==0.14.0
Please find the attached script along with the output and error file generated.
You can try optimization in the same code after installing Intel Optimized Tensorflow instead of normal tensorflow and using other OMP/KMP settings as well.
Please feel free to get back to us if you are facing any errors.Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Lakshmi,
Thanks for putting in time on this. The Xbyak error might have arisen from my lack of creating a new environment. I made a new one for my development with Horovod and have yet to see that error come up.
It doesn't appear that the files are attached. Would you mind trying to attach them again?
Thanks,
Nick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Attached the zip file again in the previous post itself. Please let us know if you still face any issues.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Could you please confirm if the solution provided is helpful.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page