Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't shuffle the dataset when num_epochs=1 #118

Open
JohnGiorgi opened this issue Jul 9, 2020 · 8 comments
Open

Don't shuffle the dataset when num_epochs=1 #118

JohnGiorgi opened this issue Jul 9, 2020 · 8 comments
Labels
enhancement New feature or request
Milestone

Comments

@JohnGiorgi
Copy link
Owner

Currently, the dataset reader will shuffle the dataset during every epoch. In order to do this, it reads the entire dataset into memory, shuffles it, then yields instances one-by-one. This was the only way I could figure out how to shuffle a lazy AllenNLP dataset reader.

Unfortunately, for large datasets this means we need a lot of memory. Fortunately, for large datasets, really good performance can be achieved in only 1 epoch (as we found in the paper). Therefore, I think the DatasetReader should be updated such that shuffling only happens when num_epochs > 1. I am not sure how the DatasetReader could get access to num_epochs, so the user may just have to provide a shuffle argument.

@JohnGiorgi JohnGiorgi added this to the v0.1.0 milestone Jul 13, 2020
@JohnGiorgi JohnGiorgi added the enhancement New feature or request label Jul 13, 2020
@JohnGiorgi JohnGiorgi pinned this issue Jan 27, 2021
@JohnGiorgi
Copy link
Owner Author

This problem is solved by migrating to AllenNLP>=2.0.0. I will close this once I have merged the migration.

@repodiac
Copy link

repodiac commented Sep 27, 2021

Still a question: So you DO shuffle between epochs - great. But why is it that the loss signficantly drops when I restart (from_archive) training with an existing model? The first epoch after restarting seems to lead to a much bigger drop in the resulting loss - that's why I was assuming that maybe you only shuffle at the BEGINNING of epoch 1 and then for the next epochs (in the same run!) no shuffling at all takes place!?

EDIT: I just checked that you use AllenNLP 1.1.0 still... does that mean that my assumption is - still - correct?

@JohnGiorgi
Copy link
Owner Author

@repodiac

RE dataset shuffling:

With the current code, shuffling should happen at the beginning of every epoch.

RE big drop in loss after first epoch:

I think the big drop you are seeing has to do with how AllenNLP presents the loss. I believe it is a running average during that epoch. Loss starts off very high, naturally, so during the first epoch, the average loss is correspondingly high. When a new epoch begins, the loss is much lower because a new running average (for that epoch only) is being computed.

Just a note that I am going off memory here and could be wrong.

@repodiac
Copy link

Hi @JohnGiorgi , thanks for your quick response and help!

Regarding the loss: I am not quite sure I understand correctly. What I mean is (example-wise):

1 - I start with loss = 10 in epoch 0, then it reaches 7 at the end of epoch 9 (10 epochs in total). The model is saved.
2 - I restart/resume using the model from epoch 9 in a new run, also with 10 epochs in total. The loss after epoch 0 (i.e. "epoch 10"...) is not close to 7 (see previous run), but let's say 5.
3 ... and so forth...

I have now run for 30 epochs in total, but divided into 3 runs with 10 epochs each. Every time, I had this behaviour (as described above) This cannot have happened by accident.

To my understanding, it should not make a difference, that I run 3x10 epochs instead of 30 epochs in one go. OK, I agree, disregarding that due to stochastic behaviour (gradient descent etc.) we can get different results, but not to this extent.

Where am I wrong?

@JohnGiorgi
Copy link
Owner Author

Did you confirm that the loss is significantly different when you train for 30 straight epochs vs 3 X 10 epoch runs?

I never broke training up with multiple steps and continued using from_archive, so I am not sure if there are any gotchas with that approach.

@repodiac
Copy link

repodiac commented Sep 27, 2021

No, I did not try out for a full 30 epoch run yet. But due to overfitting or local minima for instance the gap between run 1, epoch 9 and run 2, epoch 0 is way too big to be by chance. Same applies to run 2 -> run 3.

The "advantage" is/might be that I have a more fine grained control over training runtime (cloud GPU is not for free :) and can decide to continue or not between intermediate runs.

Btw. why is there no validation set or do I use allennlp train command in a wrong manner? Every epoch is considered "best" because it seems it does not validate against a holdout dataset (though, the ongoing loss average and the current loss is displayed and stored)

@JohnGiorgi
Copy link
Owner Author

JohnGiorgi commented Sep 27, 2021

Ah, I just remembered that there is a learning rate scheduler that is likely being reset every time training restarts. This could explain the big drop in loss across restarts.

Kind of annoying but, you may need to manually set some of the parameters of the learning rate scheduler so it works as expected for restarting runs (like num_epochs and num_steps_per_epoch).

Btw. why is there no validation set or do I use allennlp train command in a wrong manner?

There is no validation set. We validate on the development sets of SentEval after training is complete. See #190

@repodiac
Copy link

Ok, thanks a lot - I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants