Intel® Tiber Developer Cloud
Help connecting to or getting started on Intel® Tiber Developer Cloud
285 Discussions

Issue running gemma_xpu_finetuning notebook

FRYoussef
Beginner
1,493 Views

Hi,

 

I was trying to run in the IDC the aforementioned notebook. However, I came out with the following error by running the "Step 7: Finetuning the Model" - cell. I've added the hugging face API key, as well as, edited the "WANDB_PROJECT" env. var. with the wandb project.

 

Finetuning for max number of steps: 1480
 
Generating train split: 
 
 976/0 [00:01<00:00, 1069.52 examples/s]
 
Generating train split: 
 
 264/0 [00:00<00:00,  3.80 examples/s]
 
max_steps is given, it will override any value given in num_train_epochs
 
XPU Name: Intel(R) Data Center GPU Max 1100
XPU Memory: Reserved=9.486 GB, Allocated=9.482 GB, Max Reserved=9.486 GB, Max Allocated=9.482 GB[2024-05-21 07:56:25,577] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to xpu (auto detect)
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 63
     61     print_memory_usage()
     62     torch.xpu.empty_cache()
---> 63 results = trainer.train()
     64 print_training_summary(results)
     65 wandb.finish()

File ~/.local/lib/python3.9/site-packages/trl/trainer/sft_trainer.py:361, in SFTTrainer.train(self, *args, **kwargs)
    358 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    359     self.model = self._trl_activate_neftune(self.model)
--> 361 output = super().train(*args, **kwargs)
    363 # After training we make sure to retrieve back the original forward pass method
    364 # for the embedding layer by removing the forward post hook.
    365 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1876, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1873 try:
   1874     # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
   1875     hf_hub_utils.disable_progress_bars()
-> 1876     return inner_training_loop(
   1877         args=args,
   1878         resume_from_checkpoint=resume_from_checkpoint,
   1879         trial=trial,
   1880         ignore_keys_for_eval=ignore_keys_for_eval,
   1881     )
   1882 finally:
   1883     hf_hub_utils.enable_progress_bars()

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:2022, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2018         gradient_checkpointing_kwargs = args.gradient_checkpointing_kwargs
   2020     self.model.gradient_checkpointing_enable(gradient_checkpointing_kwargs=gradient_checkpointing_kwargs)
-> 2022 model = self._wrap_model(self.model_wrapped)
   2024 # as the model is wrapped, don't use `accelerator.prepare`
   2025 # this is for unhandled cases such as
   2026 # FSDP-XLA, SageMaker MP/DP, DataParallel, IPEX
   2027 use_accelerator_prepare = True if model is self.model else False

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1640, in Trainer._wrap_model(self, model, training, dataloader)
   1637     return smp.DistributedModel(model, backward_passes_per_step=self.args.gradient_accumulation_steps)
   1639 # train/eval could be run multiple-times - if already wrapped, don't re-wrap it again
-> 1640 if self.accelerator.unwrap_model(model) is not model:
   1641     return model
   1643 # Mixed precision training with apex (torch < 1.6)

File ~/.local/lib/python3.9/site-packages/accelerate/accelerator.py:2506, in Accelerator.unwrap_model(self, model, keep_fp32_wrapper)
   2475 def unwrap_model(self, model, keep_fp32_wrapper: bool = True):
   2476     """
   2477     Unwraps the `model` from the additional layer possible added by [`~Accelerator.prepare`]. Useful before saving
   2478     the model.
   (...)
   2504     ```
   2505     """
-> 2506     return extract_model_from_parallel(model, keep_fp32_wrapper)

File ~/.local/lib/python3.9/site-packages/accelerate/utils/other.py:80, in extract_model_from_parallel(model, keep_fp32_wrapper, recursive)
     77     model = model._orig_mod
     79 if is_deepspeed_available():
---> 80     from deepspeed import DeepSpeedEngine
     82     options += (DeepSpeedEngine,)
     84 if is_torch_version(">=", FSDP_PYTORCH_VERSION) and is_torch_distributed_available():

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/__init__.py:22
     19     HAS_TRITON = False
     21 from . import ops
---> 22 from . import module_inject
     24 from .accelerator import get_accelerator
     25 from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py:6
      1 # Copyright (c) Microsoft Corporation.
      2 # SPDX-License-Identifier: Apache-2.0
      3 
      4 # DeepSpeed Team
----> 6 from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
      7 from .module_quantize import quantize_transformer_layer
      8 from .replace_policy import HFBertLayerPolicy

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py:607
    603     replaced_module, _ = _replace_module(model, policy, state_dict=sd)
    604     return replaced_module
--> 607 from ..pipe import PipelineModule
    609 import re
    612 def skip_level_0_prefix(model, state_dict):

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/pipe/__init__.py:6
      1 # Copyright (c) Microsoft Corporation.
      2 # SPDX-License-Identifier: Apache-2.0
      3 
      4 # DeepSpeed Team
----> 6 from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py:6
      1 # Copyright (c) Microsoft Corporation.
      2 # SPDX-License-Identifier: Apache-2.0
      3 
      4 # DeepSpeed Team
----> 6 from .module import PipelineModule, LayerSpec, TiedLayerSpec
      7 from .topology import ProcessTopology

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py:19
     17 from deepspeed.utils import logger
     18 from .. import utils as ds_utils
---> 19 from ..activation_checkpointing import checkpointing
     20 from .topology import PipeDataParallelTopology, PipelineParallelGrid
     21 from deepspeed.runtime.state_dict_factory import SDLoaderFactory

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py:26
     23 import mmap
     24 from torch import _C
---> 26 from deepspeed.runtime.config import DeepSpeedConfig
     27 from deepspeed.utils import logger
     28 from deepspeed.runtime.utils import copy_to_device, move_to_device, see_memory_usage, bwc_tensor_model_parallel_rank

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/runtime/config.py:61
     58 from ..autotuning.config import DeepSpeedAutotuningConfig
     59 from ..nebula.config import DeepSpeedNebulaConfig
---> 61 from ..compression.config import get_compression_config, get_quantize_enabled
     62 from ..compression.constants import *
     63 from .swap_tensor.aio_config import get_aio_config

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/__init__.py:6
      1 # Copyright (c) Microsoft Corporation.
      2 # SPDX-License-Identifier: Apache-2.0
      3 
      4 # DeepSpeed Team
----> 6 from .compress import init_compression, redundancy_clean
      7 from .scheduler import compression_scheduler
      8 from .helper import convert_conv1d_to_linear

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/compress.py:7
      1 # Copyright (c) Microsoft Corporation.
      2 # SPDX-License-Identifier: Apache-2.0
      3 
      4 # DeepSpeed Team
      6 import re
----> 7 from .helper import compression_preparation, fix_compression, recursive_getattr, is_module_compressible
      8 from .config import get_compression_config
      9 from ..runtime.config_utils import dict_raise_error_on_duplicate_keys

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/deepspeed/compression/helper.py:12
      9 from deepspeed.utils import logger
     11 try:
---> 12     from neural_compressor.compression import pruner as nc_pruner
     13 except ImportError as e:
     14     nc_pruner = None

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/__init__.py:28
     20 # we need to set a global 'NA' backend, or Model can't be used
     21 from .config import (
     22     DistillationConfig,
     23     PostTrainingQuantConfig,
   (...)
     26     MixedPrecisionConfig,
     27 )
---> 28 from .contrib import *
     29 from .model import *
     30 from .metric import *

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/__init__.py:18
      1 #!/usr/bin/env python
      2 # -*- coding: utf-8 -*-
      3 #
   (...)
     15 # See the License for the specific language governing permissions and
     16 # limitations under the License.
     17 """Built-in strategy for multiple framework backends."""
---> 18 from .strategy import *

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/strategy/__init__.py:25
     23 for f in modules:
     24     if isfile(f) and not f.startswith("__") and not f.endswith("__init__.py"):
---> 25         __import__(basename(f)[:-3], globals(), locals(), level=1)

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/contrib/strategy/sigopt.py:21
     18 import copy
     19 from collections import OrderedDict
---> 21 from neural_compressor.strategy.strategy import TuneStrategy, strategy_registry
     22 from neural_compressor.strategy.utils.tuning_sampler import OpWiseTuningSampler
     23 from neural_compressor.strategy.utils.tuning_structs import OpTuningConfig

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/strategy/__init__.py:19
      1 #!/usr/bin/env python
      2 # -*- coding: utf-8 -*-
      3 #
   (...)
     15 # See the License for the specific language governing permissions and
     16 # limitations under the License.
     17 """Intel Neural Compressor Strategy."""
---> 19 from .strategy import STRATEGIES
     20 from os.path import dirname, basename, isfile, join
     21 import glob

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/strategy/strategy.py:35
     32 import numpy as np
     33 import yaml
---> 35 from neural_compressor.adaptor.tensorflow import TensorFlowAdaptor
     37 from ..adaptor import FRAMEWORKS
     38 from ..algorithm import ALGORITHMS, AlgorithmScheduler

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/adaptor/__init__.py:26
     24 for f in modules:
     25     if isfile(f) and not f.startswith("__") and not f.endswith("__init__.py"):
---> 26         __import__(basename(f)[:-3], globals(), locals(), level=1)
     28 __all__ = ["FRAMEWORKS"]

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/adaptor/pytorch.py:43
     40 torch_utils = LazyImport("neural_compressor.adaptor.torch_utils")
     41 ipex = LazyImport("intel_extension_for_pytorch")
---> 43 REDUCE_RANGE = False if CpuInfo().vnni else True
     44 logger.debug("Reduce range is {}".format(str(REDUCE_RANGE)))
     47 def get_torch_version():

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:129, in singleton.<locals>._singleton(*args, **kw)
    127 """Create a singleton object."""
    128 if cls not in instances:
--> 129     instances[cls] = cls(*args, **kw)
    130 return instances[cls]

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:255, in CpuInfo.__init__(self)
    253     self._sockets = 1
    254 else:
--> 255     self._sockets = self.get_number_of_sockets()
    256 self._cores = psutil.cpu_count(logical=False)
    257 self._cores_per_socket = int(self._cores / self._sockets)

File /opt/intel/oneapi/intelpython/envs/pytorch-gpu/lib/python3.9/site-packages/neural_compressor/utils/utility.py:290, in CpuInfo.get_number_of_sockets(self)
    288     if proc.stdout:
    289         for line in proc.stdout:
--> 290             return int(line.decode("utf-8", errors="ignore").strip())
    291 return 0

ValueError: invalid literal for int() with base 10: "ERROR: ld.so: object '/opt/intel/oneapi/intelpython/lib/libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored."
0 Kudos
6 Replies
Faiz_Intel
Moderator
1,448 Views

Hi FRYoussef,

Thank you for reaching out to us.

 

To ensure we can assist you effectively, please provide the following details. If not applicable, please use 'N/A'. We appreciate your cooperation, and we look forward to resolving this matter promptly.

 

1. Severity level of the issue: Low/Medium/High/Critical

2. Intel® Developer Cloud account ID: 

3. Intel® Developer Cloud account tier: Standard/Premium/Enterprise

4. JupyterLab ID: uxxxxxxxxxxxxxxxxxxxxxxxxxxx

5. Full screenshot of the error on gemma_xpu_finetuning notebook.

6. Kernel used on gemma_xpu_finetuning notebook.

 

Additionally, you mentioned that you "edited the 'WANDB_PROJECT' environment variable with the wandb project". Could you please provide detailed steps on how you did that? This information will help us replicate the issue on our end.

 

Regards,

Faiz

 

0 Kudos
FRYoussef
Beginner
1,412 Views

Hi Faiz,

  1. Severity level of the issue: High
  2.  Intel® Developer Cloud account ID: 574310785319
  3. Intel® Developer Cloud account tier: Standard
  4. JupyterLab ID: Is it this one "u11be638b4d6b42d06a88486ae0006d7"? Idk how to get it.
  5. Full screenshot of the error on gemma_xpu_finetuning notebook (You have the full error in my previous comment)

Captura de pantalla_22-5-2024_152435_idcbetabatch.eglb.intel.com.jpeg

 

  1. Kernel used on gemma_xpu_finetuning notebook.

Captura de pantalla_22-5-2024_15247_idcbetabatch.eglb.intel.com.jpeg

 

For the "WANDB_PROJECT" var, I've changed it to the name I used when creating a new project in https://wandb.ai/

The other thing I changed is the var "finetuned_model_id", as I followed in Eduardo Alvarez's workshop, I set the var to my Hugging Face user name.

 

0 Kudos
Faiz_Intel
Moderator
1,371 Views

Hi FRYoussef,


Thank you for your response. We've informed the appropriate team for further investigation of this matter and will provide you with an update soon. We greatly appreciate your patience.


Regards,

Faiz


0 Kudos
Faiz_Intel
Moderator
1,140 Views

This thread will no longer be monitored since we have provided a solution via email. If you need any additional information from Intel, please submit a new question.


0 Kudos
TFXGAME
Beginner
949 Views

Same issue with unmodified notebook.

0 Kudos
TFXGAME
Beginner
943 Views

got some ideas from https://github.com/block-hczhai/block2-preview/issues/8

so I commented out some lines

# Set the LD_PRELOAD environment variable
#ld_preload = os.environ.get("LD_PRELOAD", "")
conda_prefix = os.environ.get("CONDA_PREFIX", "")
# Improve memory allocation performance, if tcmalloc is not available, please comment this line out
#os.environ["LD_PRELOAD"] = f"{ld_preload}:{conda_prefix}/lib/libtcmalloc.so"

0 Kudos
Reply