Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce the performance on the ReDial dataset? #8

Open
dandyxxxx opened this issue Nov 4, 2023 · 4 comments
Open

How to reproduce the performance on the ReDial dataset? #8

dandyxxxx opened this issue Nov 4, 2023 · 4 comments

Comments

@dandyxxxx
Copy link

dandyxxxx commented Nov 4, 2023

I trained according to the code provided on GitHub, but since the dataset link you provided cannot be opened, I used mapping based objects_ Lang=en_ 202112.ttl dataset. The final results of my training are as follows:

conv:
'test/dist@2': 0.310709750246931, 'test/dist@3': 0.49851841399746016, 'test/dist@4': 0.6383519119514605
rec:
'test/recall@1': 0.029324894514767934, 'test/recall@10': 0.16729957805907172, 'test/recall@50': 0.37953586497890296

(1)These results differ greatly from the results presented in the paper. Can you give me some guidance? I hope to reproduce results similar to yours. Thank you very much.
(2)According to your paper, do I need to set --n_prefix_conv 50 in the train_conv.py and --use_resp in the train_rec. py?

@linshan-79
Copy link

I also encountered the same problem as you. Since the author provided the missing files, I trained according to the guidance. However the metric results also similar to yours. Here are the detail of redial dataset metrics:

conv:

'test/dist@2': 0.26710879074361504, 'test/dist@3': 0.4199238041484408, 'test/dist@4': 0.5233526174686045,

rec:

'test/recall@1': 0.035443037974683546, 'test/recall@10': 0.1729957805907173, 'test/recall@50': 0.3744725738396624, 

Here is my config of conversational task:

accelerate launch train_conv.py \
           --dataset redial \
           --tokenizer ~/model/DialoGPT-small \
           --model ~/model/DialoGPT-small \
           --text_tokenizer ~/model/roberta-base \
           --text_encoder ~/model/roberta-base \
           --n_prefix_conv 50 \
           --prompt_encoder ${prompt_encoder_dir}/final \
           --num_train_epochs 10 \
           --gradient_accumulation_steps 1 \
           --ignore_pad_token_for_loss \
           --per_device_train_batch_size 8 \
           --per_device_eval_batch_size 16 \
           --num_warmup_steps 6345 \
           --context_max_length 200 \
           --resp_max_length 183 \
           --prompt_max_length 200 \
           --entity_max_length 32 \
           --learning_rate 1e-4 \
           --output_dir ${output_dir} \
           --log_all

(1)@dandyxxxx, you can see that I set 'n_prefix_conv=50,' but the results don't match the paper. Could you share your configuration details? Maybe we can work together to solve the problem. Thank you very much!
(2)@wxl1999 Thanks for your work! I learned a lot from your paper and code as a beginner. I'm thinking the issue might be related to the kg module not correctly capturing relations from the dataset. Could you provide some guidance? Thank you very much!

@wxl1999
Copy link
Owner

wxl1999 commented Nov 27, 2023

Sorry for the late reply!

  • The pre-training stage is very important for the final performance. You should observe a very good performance since the answer is actually provided in the response.
    Recall@50

    • If your pre-training is not so good, you cannot observe a continuous drop in the loss curve.
      loss
  • Once your pre-training is well conducted, you will observe similar performance for the recommendation task with fine-tuning.

  • As for the conversation task, since distinct is not a very reliable metric (you can observe continuous performance gain if you do not stop training), I suggest you do not pay too much attention to this, but focus more on human evaluation. This is also the practice for large language models.

  • About the evaluation for conversational recommendation, you can also refer to this paper: Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Hope this can help you!

@linshan-79
Copy link

Thanks for your replying! This help me a lot.

@careerists
Copy link

@linshan-79 I have the same problem. Did you finally solve it? Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants