-
Notifications
You must be signed in to change notification settings - Fork 615
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 & Windows 10
- TensorFlow version and how it was installed (source or binary): 2.3.0 from source
- TensorFlow-Addons version and how it was installed (source or binary): 0.11.1 from source
- Python version: 3.7
- Is GPU used? (yes/no): yes
Describe the bug
Resume a training process needs the restoration of the optimizer states to continue training RIGHT from the previous state without any loss of accuracy. Currently, the keras interface of saving model keras.Model.save_weights
checkpoints both the network parameters and the optimizer weights. However, when an optimizer is wrapped inside another, its weights can not be saved by this mean.
For example, when I was trying to use the Ranger optimizer, which is constructed by wrapping RAdam with Lookahead:
optimizer = tfa.optimizers.Lookahead(
tfa.optimizers.RectifiedAdam()
)
I noticed a performance drop on resuming training. I found that the weights of the inner RAdam were not saved into the checkpoint. (I checked the .index
file in the checkpoint folder and there are no variable names like "m" and "v", only "slow", which is the weights of Lookahead). Therefore, after loading the weights from file and restart fitting, the weights of RAdam are randomly reinitialized. This could because the weights of the inner optimizer are not automatically tracked.
Experiments
I trained the two LeNets on the FashionMNIST dataset. All the configurations are the same except for the optimizers. Both training are interrupted in the middle and then resumed.
Fig. TensorBoard. Blue: Ranger (Lookahead+RAdam), orange: RAdam.
Note the "bump" of the Ranger curve caused by the reinitialization of RAdam weights. Apparently, the weights of the inner optimizer are not correctly saved.