I had some queries
1. Whenever I use dataloader in Pytorch (multithreading), I see this error multiple times:
Traceback (most recent call last):
File "/glob/intel-python/python3/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
File "/glob/intel-python/python3/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 480, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 438, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/glob/intel-python/python3/lib/python3.6/shutil.py", line 436, in _rmtree_safe_fd
OSError: [Errno 16] Device or resource busy: '.nfs00000038015dac06000002a7'
What can I do to solve this ?
2. How can I reduce training time ? Training is slow
During training I use this command
qsub -I -l nodes=4:ppn=2,walltime=24:00:00,mem=196gb
But this doesn't help much. What more can I do to get improved performance ?
3. I also want to increase the walltime. Training take weeks.
4. Earlier I used to use a docker and training would happen in my container. This allowed me to close my laptop. But if training takes place for multiple days, then what do you suggest for tracking the interactive session (similar to counterpart of entering into docker session) Can you suggest a way for this problem ?
- General Support