Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Open
Markkk111 opened this issue Apr 19, 2023 · 2 comments

Comments

@Markkk111
Copy link

You're welcome! I'm glad to assist you with this question!

  1. The error message is as follows:
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    {'loss': 0.0185, 'learning_rate': 7.973102785782901e-06, 'epoch': 5.04}
    {'loss': 0.0185, 'learning_rate': 6.9724623759205894e-06, 'epoch': 5.16}
    {'loss': 0.0188, 'learning_rate': 5.971821966058277e-06, 'epoch': 5.28}
    {'loss': 0.0178, 'learning_rate': 4.971181556195966e-06, 'epoch': 5.4}
    wandb: Network error (ReadTimeout), entering retry loop.
    {'loss': 0.0183, 'learning_rate': 3.970541146333654e-06, 'epoch': 5.52}
    {'loss': 0.018, 'learning_rate': 2.9699007364713415e-06, 'epoch': 5.64}
    {'loss': 0.0179, 'learning_rate': 1.96926032660903e-06, 'epoch': 5.76}
    {'loss': 0.0174, 'learning_rate': 9.68619916746718e-07, 'epoch': 5.88}
    [INFO|trainer.py:1901] 2023-04-19 17:43:28,689 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1749.401, 'train_samples_per_second': 142.815, 'train_steps_per_second': 14.281, 'train_loss': 0.023130248267032992, 'epoch': 6.0}
[INFO|trainer.py:2709] 2023-04-19 17:43:28,693 >> Saving model checkpoint to classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None
[INFO|configuration_utils.py:453] 2023-04-19 17:43:28,694 >> Configuration saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/config.json
[INFO|modeling_utils.py:1704] 2023-04-19 17:43:29,841 >> Model weights saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/pytorch_model.bin
***** train metrics *****
epoch = 6.0
train_loss = 0.0231
train_runtime = 0:29:09.40
train_samples = 41640
train_samples_per_second = 142.815
train_steps_per_second = 14.281
04/19/2023 17:43:29 - INFO - main - *** Evaluate ***
[INFO|trainer.py:710] 2023-04-19 17:43:29,848 >> The following columns in the evaluation set don't have a corresponding argument in Classifier_Tree.forward and have been ignored: chart_lst. If chart_lst are not expected by Classifier_Tree.forward, you can safely ignore this message.
[INFO|trainer.py:2964] 2023-04-19 17:43:29,850 >> ***** Running Evaluation *****
[INFO|trainer.py:2966] 2023-04-19 17:43:29,850 >> Num examples = 421
[INFO|trainer.py:2969] 2023-04-19 17:43:29,851 >> Batch size = 10
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'eval_runtime': 1.4868, 'eval_samples_per_second': 283.16, 'eval_steps_per_second': 28.921, 'epoch': 6.0}
Traceback (most recent call last):
File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1704, in
main()
File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1675, in main
perplexity = math.exp(metrics["eval_loss"])
KeyError: 'eval_loss'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Exception ignored in atexit callback: <function _Manager._atexit_setup.. at 0x7f2f280f1fc0>
Traceback (most recent call last):
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 166, in
self._atexit_lambda = lambda: self._atexit_teardown()
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 175, in _atexit_teardown
self._teardown(exit_code)
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 186, in _teardown
result = self._service.join()
File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 216, in join
ret = self._internal_proc.wait()
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1204, in wait
return self._wait(timeout=timeout)
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1938, in _wait
(pid, sts) = self._try_wait(0)
File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1896, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt:
(diffusion-LM) name@taizun-SYS-4029GP-TRT:/diffusion-LM$ wandb: - 0.010 MB of 0.010 MB uploaded (0.(diffusion-LM) name@taizun-SYS-4029GP-TRT:/diffusion-LM$ wandb: / 0.010 MB of 0.010 MB uploaded (0.wandb: \ 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)

  1. The relevant code that caused the error is as follows:

    Training

    if training_args.do_train:
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
    checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
    checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    trainer.save_model() # Saves the tokenizer too for easy upload

     metrics = train_result.metrics
    
    
     max_train_samples = (
         data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
     )
     metrics["train_samples"] = min(max_train_samples, len(train_dataset))
    
     trainer.log_metrics("train", metrics)
     trainer.save_metrics("train", metrics)
     trainer.save_state()
    

    Evaluation

    if training_args.do_eval:
    logger.info("*** Evaluate ***")

     metrics = trainer.evaluate()
    
     max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
     metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
     try:
         perplexity = math.exp(metrics["eval_loss"])
     except OverflowError:
         perplexity = float("inf")
     metrics["perplexity"] = perplexity
    
     trainer.log_metrics("eval", metrics)
     trainer.save_metrics("eval", metrics)
    

    kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"}
    if data_args.dataset_name is not None:
    kwargs["dataset_tags"] = data_args.dataset_name
    if data_args.dataset_config_name is not None:
    kwargs["dataset_args"] = data_args.dataset_config_name
    kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}"
    else:
    kwargs["dataset"] = data_args.dataset_name

    if training_args.push_to_hub:
    trainer.push_to_hub(**kwargs)
    else:
    trainer.create_model_card(**kwargs)

@25018528927
Copy link

Hello! I see that you have the same problem as me, did you solve it? If solved, how to solve it? Looking forward to your answer, very anxious! @Markkk111

@heychhavi
Copy link

Hi @Markkk111 @25018528927 , I am also facing this similar issue, anyone of you able to solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants