Don't shuffle the dataset when num_epochs=1 #118

JohnGiorgi · 2020-07-09T21:34:16Z

Currently, the dataset reader will shuffle the dataset during every epoch. In order to do this, it reads the entire dataset into memory, shuffles it, then yields instances one-by-one. This was the only way I could figure out how to shuffle a lazy AllenNLP dataset reader.

Unfortunately, for large datasets this means we need a lot of memory. Fortunately, for large datasets, really good performance can be achieved in only 1 epoch (as we found in the paper). Therefore, I think the DatasetReader should be updated such that shuffling only happens when num_epochs > 1. I am not sure how the DatasetReader could get access to num_epochs, so the user may just have to provide a shuffle argument.

The text was updated successfully, but these errors were encountered:

JohnGiorgi · 2021-02-16T20:18:57Z

This problem is solved by migrating to AllenNLP>=2.0.0. I will close this once I have merged the migration.

repodiac · 2021-09-27T11:52:42Z

Still a question: So you DO shuffle between epochs - great. But why is it that the loss signficantly drops when I restart (from_archive) training with an existing model? The first epoch after restarting seems to lead to a much bigger drop in the resulting loss - that's why I was assuming that maybe you only shuffle at the BEGINNING of epoch 1 and then for the next epochs (in the same run!) no shuffling at all takes place!?

EDIT: I just checked that you use AllenNLP 1.1.0 still... does that mean that my assumption is - still - correct?

JohnGiorgi · 2021-09-27T13:59:03Z

@repodiac

RE dataset shuffling:

With the current code, shuffling should happen at the beginning of every epoch.

RE big drop in loss after first epoch:

I think the big drop you are seeing has to do with how AllenNLP presents the loss. I believe it is a running average during that epoch. Loss starts off very high, naturally, so during the first epoch, the average loss is correspondingly high. When a new epoch begins, the loss is much lower because a new running average (for that epoch only) is being computed.

Just a note that I am going off memory here and could be wrong.

repodiac · 2021-09-27T14:09:48Z

Hi @JohnGiorgi , thanks for your quick response and help!

Regarding the loss: I am not quite sure I understand correctly. What I mean is (example-wise):

1 - I start with loss = 10 in epoch 0, then it reaches 7 at the end of epoch 9 (10 epochs in total). The model is saved.
2 - I restart/resume using the model from epoch 9 in a new run, also with 10 epochs in total. The loss after epoch 0 (i.e. "epoch 10"...) is not close to 7 (see previous run), but let's say 5.
3 ... and so forth...

I have now run for 30 epochs in total, but divided into 3 runs with 10 epochs each. Every time, I had this behaviour (as described above) This cannot have happened by accident.

To my understanding, it should not make a difference, that I run 3x10 epochs instead of 30 epochs in one go. OK, I agree, disregarding that due to stochastic behaviour (gradient descent etc.) we can get different results, but not to this extent.

Where am I wrong?

JohnGiorgi · 2021-09-27T14:35:31Z

Did you confirm that the loss is significantly different when you train for 30 straight epochs vs 3 X 10 epoch runs?

I never broke training up with multiple steps and continued using from_archive, so I am not sure if there are any gotchas with that approach.

repodiac · 2021-09-27T14:44:01Z

No, I did not try out for a full 30 epoch run yet. But due to overfitting or local minima for instance the gap between run 1, epoch 9 and run 2, epoch 0 is way too big to be by chance. Same applies to run 2 -> run 3.

The "advantage" is/might be that I have a more fine grained control over training runtime (cloud GPU is not for free :) and can decide to continue or not between intermediate runs.

Btw. why is there no validation set or do I use allennlp train command in a wrong manner? Every epoch is considered "best" because it seems it does not validate against a holdout dataset (though, the ongoing loss average and the current loss is displayed and stored)

JohnGiorgi · 2021-09-27T16:07:29Z

Ah, I just remembered that there is a learning rate scheduler that is likely being reset every time training restarts. This could explain the big drop in loss across restarts.

Kind of annoying but, you may need to manually set some of the parameters of the learning rate scheduler so it works as expected for restarting runs (like num_epochs and num_steps_per_epoch).

Btw. why is there no validation set or do I use allennlp train command in a wrong manner?

There is no validation set. We validate on the development sets of SentEval after training is complete. See #190

repodiac · 2021-09-28T07:16:24Z

Ok, thanks a lot - I'll look into it.

JohnGiorgi added this to the v0.1.0 milestone Jul 13, 2020

JohnGiorgi added the enhancement New feature or request label Jul 13, 2020

JohnGiorgi pinned this issue Jan 27, 2021

repodiac mentioned this issue Sep 28, 2021

Wrong training procedure? #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't shuffle the dataset when num_epochs=1 #118

Don't shuffle the dataset when num_epochs=1 #118

JohnGiorgi commented Jul 9, 2020

JohnGiorgi commented Feb 16, 2021

repodiac commented Sep 27, 2021 •

edited

Loading

JohnGiorgi commented Sep 27, 2021

repodiac commented Sep 27, 2021

JohnGiorgi commented Sep 27, 2021

repodiac commented Sep 27, 2021 •

edited

Loading

JohnGiorgi commented Sep 27, 2021 •

edited

Loading

repodiac commented Sep 28, 2021

Don't shuffle the dataset when num_epochs=1 #118

Don't shuffle the dataset when num_epochs=1 #118

Comments

JohnGiorgi commented Jul 9, 2020

JohnGiorgi commented Feb 16, 2021

repodiac commented Sep 27, 2021 • edited Loading

JohnGiorgi commented Sep 27, 2021

repodiac commented Sep 27, 2021

JohnGiorgi commented Sep 27, 2021

repodiac commented Sep 27, 2021 • edited Loading

JohnGiorgi commented Sep 27, 2021 • edited Loading

repodiac commented Sep 28, 2021

repodiac commented Sep 27, 2021 •

edited

Loading

repodiac commented Sep 27, 2021 •

edited

Loading

JohnGiorgi commented Sep 27, 2021 •

edited

Loading