You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:
04/01/2024 09:12:23 - INFO - __main__ - ***** Running Labelling *****
04/01/2024 09:12:23 - INFO - __main__ - Instantaneous batch size per device = 8
04/01/2024 09:12:23 - INFO - __main__ - Total eval batch size (w. parallel & distributed) = 16
04/01/2024 09:12:23 - INFO - __main__ - Predict labels with timestamps = True
Evaluating train...: 0%| | 0/52 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
main()
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
eval_step_with_save(split=split)
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
for step, batch in enumerate(batches):
File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1169, in __iter__
for obj in iterable:
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
current_batch = send_to_device(current_batch, self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
return tensor.to(device)
^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
main()
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
eval_step_with_save(split=split)
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
for step, batch in enumerate(batches):
File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
current_batch = send_to_device(current_batch, self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
return tensor.to(device)
^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/myenv/distil_whisper/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
do_one_step()
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
fd = df.detach()
^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
^^^^^^^^^^^^^^^^^^^wandb: You can sync this run to the cloud by running:
wandb: wandb sync /alghome/craig.hsin/framework/distil-whisper/training/wandb/offline-run-20240401_091200-0oe1zyh2
wandb: Find logs at: ./wandb/offline-run-20240401_091200-0oe1zyh2/logs
[2024-04-01 09:12:31,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2703427) of binary: /myenv/distil_whisper/bin/python
Traceback (most recent call last):
File "/myenv/distil_whisper/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
multi_gpu_launcher(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
distrib_run.run(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_pseudo_labelling.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-01_09:12:31
host : alg4
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2703428)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-01_09:12:31
host : alg4
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2703427)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
some of my environment info.
Name Version Build Channel
python 3.11.8 h955ad1f_0
torch 2.1.1+cu118 pypi_0 pypi
transformers 4.39.1 pypi_0 pypi
May you provide some suggestion on how could I proceed the investigations? Thanks.
The text was updated successfully, but these errors were encountered:
Dear Author,
Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:
Name Version Build Channel
python 3.11.8 h955ad1f_0
torch 2.1.1+cu118 pypi_0 pypi
transformers 4.39.1 pypi_0 pypi
May you provide some suggestion on how could I proceed the investigations? Thanks.
The text was updated successfully, but these errors were encountered: