- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I was trying to run in the IDC the aforementioned notebook. However, I came out with the following error by running the "Step 7: Finetuning the Model" - cell. I've added the hugging face API key, as well as, edited the "WANDB_PROJECT" env. var. with the wandb project.
Finetuning for max number of steps: 1480
max_steps is given, it will override any value given in num_train_epochs
XPU Name: Intel(R) Data Center GPU Max 1100 XPU Memory: Reserved=9.486 GB, Allocated=9.482 GB, Max Reserved=9.486 GB, Max Allocated=9.482 GB[2024-05-21 07:56:25,577] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to xpu (auto detect)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[10], line 63 61 print_memory_usage() 62 torch.xpu.empty_cache() ---> 63 results = trainer.train() 64 print_training_summary(results) 65 wandb.finish() File ~/.local/lib/python3.9/site-packages/trl/trainer/sft_trainer.py:361, in SFTTrainer.train(self, *args, **kwargs) 358 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune: 359 self.model = self._trl_activate_neftune(self.model) --> 361 output = super().train(*args, **kwargs) 363 # After training we make sure to retrieve back the original forward pass method 364 # for the embedding layer by removing the forward post hook. 365 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune: File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1876, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1873 try: 1874 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout 1875 hf_hub_utils.disable_progress_bars() -> 1876 return inner_training_loop( 1877 args=args, 1878 resume_from_checkpoint=resume_from_checkpoint, 1879 trial=trial, 1880 ignore_keys_for_eval=ignore_keys_for_eval, 1881 ) 1882 finally: 1883 hf_hub_utils.enable_progress_bars() File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:2022, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2018 gradient_checkpointing_kwargs = args.gradient_checkpointing_kwargs 2020 self.model.gradient_checkpointing_enable(gradient_checkpointing_kwargs=gradient_checkpointing_kwargs) -> 2022 model = self._wrap_model(self.model_wrapped) 2024 # as the model is wrapped, don't use `accelerator.prepare` 2025 # this is for unhandled cases such as 2026 # FSDP-XLA, SageMaker MP/DP, DataParallel, IPEX 2027 use_accelerator_prepare = True if model is self.model else False File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1640, in Trainer._wrap_model(self, model, training, dataloader) 1637 return smp.DistributedModel(model, backward_passes_per_step=self.args.gradient_accumulation_steps) 1639 # train/eval could be run multiple-times - if already wrapped, don't re-wrap it again -> 1640 if self.accelerator.unwrap_model(model) is not model: 1641 return model 1643 # Mixed precision training with apex (torch < 1.6) File ~/.local/lib/python3.9/site-packages/accelerate/accelerator.py:2506, in Accelerator.unwrap_model(self, model, keep_fp32_wrapper) 2475 def unwrap_model(self, model, keep_fp32_wrapper: bool = True): 2476 """ 2477 Unwraps the `model` from the additional layer possible added by [`~Accelerator.prepare`]. Useful before saving 2478 the model. (...) 2504 ``` 2505 """ -> 2506 return extract_model_from_parallel(model, keep_fp32_wrapper) File ~/.local/lib/python3.9/site-packages/accelerate/utils/other.py:80, in extract_model_from_parallel(model, keep_fp32_wrapper, recursive) 77 model = model._orig_mod 79 if is_deepspeed_available(): ---> 80 from deepspeed import DeepSpeedEngine 82 options += (DeepSpeedEngine,) 84 if is_torch_version(">=", FSDP_PYTORCH_VERSION) and is_torch_distributed_available(): File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/__init__.py:22 19 HAS_TRITON = False 21 from . import ops ---> 22 from . import module_inject 24 from .accelerator import get_accelerator 25 from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py:6 1 # Copyright (c) Microsoft Corporation. 2 # SPDX-License-Identifier: Apache-2.0 3 4 # DeepSpeed Team ----> 6 from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection 7 from .module_quantize import quantize_transformer_layer 8 from .replace_policy import HFBertLayerPolicy File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py:607 603 replaced_module, _ = _replace_module(model, policy, state_dict=sd) 604 return replaced_module --> 607 from ..pipe import PipelineModule 609 import re 612 def skip_level_0_prefix(model, state_dict): File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/pipe/__init__.py:6 1 # Copyright (c) Microsoft Corporation. 2 # SPDX-License-Identifier: Apache-2.0 3 4 # DeepSpeed Team ----> 6 from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py:6 1 # Copyright (c) Microsoft Corporation. 2 # SPDX-License-Identifier: Apache-2.0 3 4 # DeepSpeed Team ----> 6 from .module import PipelineModule, LayerSpec, TiedLayerSpec 7 from .topology import ProcessTopology File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py:19 17 from deepspeed.utils import logger 18 from .. import utils as ds_utils ---> 19 from ..activation_checkpointing import checkpointing 20 from .topology import PipeDataParallelTopology, PipelineParallelGrid 21 from deepspeed.runtime.state_dict_factory import SDLoaderFactory File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py:26 23 import mmap 24 from torch import _C ---> 26 from deepspeed.runtime.config import DeepSpeedConfig 27 from deepspeed.utils import logger 28 from deepspeed.runtime.utils import copy_to_device, move_to_device, see_memory_usage, bwc_tensor_model_parallel_rank File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/config.py:61 58 from ..autotuning.config import DeepSpeedAutotuningConfig 59 from ..nebula.config import DeepSpeedNebulaConfig ---> 61 from ..compression.config import get_compression_config, get_quantize_enabled 62 from ..compression.constants import * 63 from .swap_tensor.aio_config import get_aio_config File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/__init__.py:6 1 # Copyright (c) Microsoft Corporation. 2 # SPDX-License-Identifier: Apache-2.0 3 4 # DeepSpeed Team ----> 6 from .compress import init_compression, redundancy_clean 7 from .scheduler import compression_scheduler 8 from .helper import convert_conv1d_to_linear File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/compress.py:7 1 # Copyright (c) Microsoft Corporation. 2 # SPDX-License-Identifier: Apache-2.0 3 4 # DeepSpeed Team 6 import re ----> 7 from .helper import compression_preparation, fix_compression, recursive_getattr, is_module_compressible 8 from .config import get_compression_config 9 from ..runtime.config_utils import dict_raise_error_on_duplicate_keys File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/helper.py:12 9 from deepspeed.utils import logger 11 try: ---> 12 from neural_compressor.compression import pruner as nc_pruner 13 except ImportError as e: 14 nc_pruner = None File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/__init__.py:28 20 # we need to set a global 'NA' backend, or Model can't be used 21 from .config import ( 22 DistillationConfig, 23 PostTrainingQuantConfig, (...) 26 MixedPrecisionConfig, 27 ) ---> 28 from .contrib import * 29 from .model import * 30 from .metric import * File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/__init__.py:18 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # (...) 15 # See the License for the specific language governing permissions and 16 # limitations under the License. 17 """Built-in strategy for multiple framework backends.""" ---> 18 from .strategy import * File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/strategy/__init__.py:25 23 for f in modules: 24 if isfile(f) and not f.startswith("__") and not f.endswith("__init__.py"): ---> 25 __import__(basename(f)[:-3], globals(), locals(), level=1) File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/strategy/sigopt.py:21 18 import copy 19 from collections import OrderedDict ---> 21 from neural_compressor.strategy.strategy import TuneStrategy, strategy_registry 22 from neural_compressor.strategy.utils.tuning_sampler import OpWiseTuningSampler 23 from neural_compressor.strategy.utils.tuning_structs import OpTuningConfig File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/strategy/__init__.py:19 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # (...) 15 # See the License for the specific language governing permissions and 16 # limitations under the License. 17 """Intel Neural Compressor Strategy.""" ---> 19 from .strategy import STRATEGIES 20 from os.path import dirname, basename, isfile, join 21 import glob File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/strategy/strategy.py:35 32 import numpy as np 33 import yaml ---> 35 from neural_compressor.adaptor.tensorflow import TensorFlowAdaptor 37 from ..adaptor import FRAMEWORKS 38 from ..algorithm import ALGORITHMS, AlgorithmScheduler File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/adaptor/__init__.py:26 24 for f in modules: 25 if isfile(f) and not f.startswith("__") and not f.endswith("__init__.py"): ---> 26 __import__(basename(f)[:-3], globals(), locals(), level=1) 28 __all__ = ["FRAMEWORKS"] File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/adaptor/pytorch.py:43 40 torch_utils = LazyImport("neural_compressor.adaptor.torch_utils") 41 ipex = LazyImport("intel_extension_for_pytorch") ---> 43 REDUCE_RANGE = False if CpuInfo().vnni else True 44 logger.debug("Reduce range is {}".format(str(REDUCE_RANGE))) 47 def get_torch_version(): File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:129, in singleton.<locals>._singleton(*args, **kw) 127 """Create a singleton object.""" 128 if cls not in instances: --> 129 instances[cls] = cls(*args, **kw) 130 return instances[cls] File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:255, in CpuInfo.__init__(self) 253 self._sockets = 1 254 else: --> 255 self._sockets = self.get_number_of_sockets() 256 self._cores = psutil.cpu_count(logical=False) 257 self._cores_per_socket = int(self._cores / self._sockets) File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:290, in CpuInfo.get_number_of_sockets(self) 288 if proc.stdout: 289 for line in proc.stdout: --> 290 return int(line.decode("utf-8", errors="ignore").strip()) 291 return 0 ValueError: invalid literal for int() with base 10: "ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored."
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi FRYoussef,
Thank you for reaching out to us.
To ensure we can assist you effectively, please provide the following details. If not applicable, please use 'N/A'. We appreciate your cooperation, and we look forward to resolving this matter promptly.
1. Severity level of the issue: Low/Medium/High/Critical
2. Intel® Developer Cloud account ID:
3. Intel® Developer Cloud account tier: Standard/Premium/Enterprise
4. JupyterLab ID: uxxxxxxxxxxxxxxxxxxxxxxxxxxx
5. Full screenshot of the error on gemma_xpu_finetuning notebook.
6. Kernel used on gemma_xpu_finetuning notebook.
Additionally, you mentioned that you "edited the 'WANDB_PROJECT' environment variable with the wandb project". Could you please provide detailed steps on how you did that? This information will help us replicate the issue on our end.
Regards,
Faiz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Faiz,
- Severity level of the issue: High
- Intel® Developer Cloud account ID: 574310785319
- Intel® Developer Cloud account tier: Standard
- JupyterLab ID: Is it this one "u11be638b4d6b42d06a88486ae0006d7"? Idk how to get it.
- Full screenshot of the error on gemma_xpu_finetuning notebook (You have the full error in my previous comment)
- Kernel used on gemma_xpu_finetuning notebook.
For the "WANDB_PROJECT" var, I've changed it to the name I used when creating a new project in https://wandb.ai/
The other thing I changed is the var "finetuned_model_id", as I followed in Eduardo Alvarez's workshop, I set the var to my Hugging Face user name.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi FRYoussef,
Thank you for your response. We've informed the appropriate team for further investigation of this matter and will provide you with an update soon. We greatly appreciate your patience.
Regards,
Faiz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This thread will no longer be monitored since we have provided a solution via email. If you need any additional information from Intel, please submit a new question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Same issue with unmodified notebook.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
got some ideas from https://github.com/block-hczhai/block2-preview/issues/8
so I commented out some lines
# Set the LD_PRELOAD environment variable
#ld_preload = os.environ.get("LD_PRELOAD", "")
conda_prefix = os.environ.get("CONDA_PREFIX", "")
# Improve memory allocation performance, if tcmalloc is not available, please comment this line out
#os.environ["LD_PRELOAD"] = f"{ld_preload}:{conda_prefix}/lib/libtcmalloc.so"

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page