Skip to content

L2 Decay for Optimizers#269

Open
cangumeli wants to merge 2 commits intodenizyuret:masterfrom
cangumeli:wdecay
Open

L2 Decay for Optimizers#269
cangumeli wants to merge 2 commits intodenizyuret:masterfrom
cangumeli:wdecay

Conversation

@cangumeli
Copy link
Copy Markdown
Collaborator

I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I started with Adam.
l2 decay is used in the model I'm currently replicating, so I'll be using it.

@denizyuret
Copy link
Copy Markdown
Owner

denizyuret commented Feb 13, 2018 via email

@cangumeli
Copy link
Copy Markdown
Collaborator Author

L2 decay is part of the optimizers in many frameworks like MxNet and PyTorch (they simply call it weight decay).

Adding l2 penalty to objective function means incrementing each gradient by decay_rate * w. Doing this addition in optimizers will save us from the overhead of squaring and reductions.

@denizyuret
Copy link
Copy Markdown
Owner

denizyuret commented Feb 14, 2018 via email

@Evizero
Copy link
Copy Markdown

Evizero commented Feb 14, 2018

One side note worth mentioning. This looks like it applies weight decay to all learned parameters equally; including the bias terms

@denizyuret
Copy link
Copy Markdown
Owner

denizyuret commented Feb 14, 2018 via email

@cangumeli
Copy link
Copy Markdown
Collaborator Author

Technically, there is an optimizer for each parameter, so user will be able to modify weight decay for each parameter. On the other hand, when optimizers is called with decay option, we will be decaying all parameters including biases. There are studies decaying all parameters, but making this the default behaviour might be misleading.

Also, we will be reporting losses with weight decay penalty excluded. I'm not sure is this a feature or a bug.

I think we may consider weight decay in optimizers as a performance trick to be used by people who know what they are doing.

@Evizero
Copy link
Copy Markdown

Evizero commented Jul 14, 2018

Very relevant paper on this topic: https://arxiv.org/abs/1711.05101

Particularly it argues that L2 regularization and weight decay are not identical for Adam. Furthermore it argues that L2 regularization is not effective in Adam.

@denizyuret denizyuret self-assigned this Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants