The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Markkk111 · 2023-04-19T10:14:22Z

You're welcome! I'm glad to assist you with this question！

The error message is as follows：
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'loss': 0.0185, 'learning_rate': 7.973102785782901e-06, 'epoch': 5.04}
{'loss': 0.0185, 'learning_rate': 6.9724623759205894e-06, 'epoch': 5.16}
{'loss': 0.0188, 'learning_rate': 5.971821966058277e-06, 'epoch': 5.28}
{'loss': 0.0178, 'learning_rate': 4.971181556195966e-06, 'epoch': 5.4}
wandb: Network error (ReadTimeout), entering retry loop.
{'loss': 0.0183, 'learning_rate': 3.970541146333654e-06, 'epoch': 5.52}
{'loss': 0.018, 'learning_rate': 2.9699007364713415e-06, 'epoch': 5.64}
{'loss': 0.0179, 'learning_rate': 1.96926032660903e-06, 'epoch': 5.76}
{'loss': 0.0174, 'learning_rate': 9.68619916746718e-07, 'epoch': 5.88}
[INFO|trainer.py:1901] 2023-04-19 17:43:28,689 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1749.401, 'train_samples_per_second': 142.815, 'train_steps_per_second': 14.281, 'train_loss': 0.023130248267032992, 'epoch': 6.0}
[INFO|trainer.py:2709] 2023-04-19 17:43:28,693 >> Saving model checkpoint to classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None
[INFO|configuration_utils.py:453] 2023-04-19 17:43:28,694 >> Configuration saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/config.json
[INFO|modeling_utils.py:1704] 2023-04-19 17:43:29,841 >> Model weights saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/pytorch_model.bin
***** train metrics *****
epoch = 6.0
train_loss = 0.0231
train_runtime = 0:29:09.40
train_samples = 41640
train_samples_per_second = 142.815
train_steps_per_second = 14.281
04/19/2023 17:43:29 - INFO - main - *** Evaluate ***
[INFO|trainer.py:710] 2023-04-19 17:43:29,848 >> The following columns in the evaluation set don't have a corresponding argument in Classifier_Tree.forward and have been ignored: chart_lst. If chart_lst are not expected by Classifier_Tree.forward, you can safely ignore this message.
[INFO|trainer.py:2964] 2023-04-19 17:43:29,850 >> ***** Running Evaluation *****
[INFO|trainer.py:2966] 2023-04-19 17:43:29,850 >> Num examples = 421
[INFO|trainer.py:2969] 2023-04-19 17:43:29,851 >> Batch size = 10
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'eval_runtime': 1.4868, 'eval_samples_per_second': 283.16, 'eval_steps_per_second': 28.921, 'epoch': 6.0}
Traceback (most recent call last):
File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1704, in
main()
File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1675, in main
perplexity = math.exp(metrics["eval_loss"])
KeyError: 'eval_loss'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Exception ignored in atexit callback: <function _Manager._atexit_setup.. at 0x7f2f280f1fc0>
Traceback (most recent call last):
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 166, in
self._atexit_lambda = lambda: self._atexit_teardown()
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 175, in _atexit_teardown
self._teardown(exit_code)
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 186, in _teardown
result = self._service.join()
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 216, in join
ret = self._internal_proc.wait()
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1204, in wait
return self._wait(timeout=timeout)
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1938, in _wait
(pid, sts) = self._try_wait(0)
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1896, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt:
(diffusion-LM) name@taizun-SYS-4029GP-TRT:~~/diffusion-LM$ wandb: - 0.010 MB of 0.010 MB uploaded (0.(diffusion-LM) name@taizun-SYS-4029GP-TRT:~~/diffusion-LM$ wandb: / 0.010 MB of 0.010 MB uploaded (0.wandb: \ 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)

The relevant code that caused the error is as follows：

Training

if training_args.do_train:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
```
 metrics = train_result.metrics


 max_train_samples = (
     data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
 )
 metrics["train_samples"] = min(max_train_samples, len(train_dataset))

 trainer.log_metrics("train", metrics)
 trainer.save_metrics("train", metrics)
 trainer.save_state()
```
Evaluation

if training_args.do_eval:
logger.info("*** Evaluate ***")
```
 metrics = trainer.evaluate()

 max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
 metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
 try:
     perplexity = math.exp(metrics["eval_loss"])
 except OverflowError:
     perplexity = float("inf")
 metrics["perplexity"] = perplexity

 trainer.log_metrics("eval", metrics)
 trainer.save_metrics("eval", metrics)
```
kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"}
if data_args.dataset_name is not None:
kwargs["dataset_tags"] = data_args.dataset_name
if data_args.dataset_config_name is not None:
kwargs["dataset_args"] = data_args.dataset_config_name
kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}"
else:
kwargs["dataset"] = data_args.dataset_name

if training_args.push_to_hub:
trainer.push_to_hub(**kwargs)
else:
trainer.create_model_card(**kwargs)

The text was updated successfully, but these errors were encountered:

25018528927 · 2023-07-26T08:22:03Z

Hello! I see that you have the same problem as me, did you solve it? If solved, how to solve it? Looking forward to your answer, very anxious！ @Markkk111

heychhavi · 2023-11-25T16:23:38Z

Hi @Markkk111 @25018528927 , I am also facing this similar issue, anyone of you able to solve this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Markkk111 commented Apr 19, 2023

Training

Evaluation

25018528927 commented Jul 26, 2023

heychhavi commented Nov 25, 2023

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Comments

Markkk111 commented Apr 19, 2023

Training

Evaluation

25018528927 commented Jul 26, 2023

heychhavi commented Nov 25, 2023