L2 Decay for Optimizers by cangumeli · Pull Request #269 · denizyuret/Knet.jl

cangumeli · 2018-02-13T14:47:40Z

I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I started with Adam.
l2 decay is used in the model I'm currently replicating, so I'll be using it.

denizyuret · 2018-02-13T15:50:07Z

Isn't weight decay part of the objective function and not the optimizer? See the overfitting notebook for an example.

…

On Tue, Feb 13, 2018, 17:47 cangumeli ***@***.***> wrote: I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I started with Adam. l2 decay is used in the model I'm currently replicating, so I'll be using it. ------------------------------ You can view, comment on, or merge this pull request online at: #269 Commit Summary - L2 Decay added to adam - Empty space removal File Changes - *M* src/update.jl <https://github.com/denizyuret/Knet.jl/pull/269/files#diff-0> (12) Patch Links: - https://github.com/denizyuret/Knet.jl/pull/269.patch - https://github.com/denizyuret/Knet.jl/pull/269.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#269>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpmO2bxDjwg2JR2ObNhGl3cm0ZfNPks5tUaCNgaJpZM4SD3PI> .

cangumeli · 2018-02-13T19:00:00Z

L2 decay is part of the optimizers in many frameworks like MxNet and PyTorch (they simply call it weight decay).

Adding l2 penalty to objective function means incrementing each gradient by decay_rate * w. Doing this addition in optimizers will save us from the overhead of squaring and reductions.

denizyuret · 2018-02-14T06:09:51Z

Mathematically you are changing the optimization objective, not the algorithm. If you are claiming there is a computational advantage let's discuss face to face.

…

On Tue, Feb 13, 2018 at 10:00 PM cangumeli ***@***.***> wrote: L2 decay is part of the optimizers in many frameworks like MxNet <https://mxnet.incubator.apache.org/api/python/optimization.html#api-reference> and PyTorch <http://pytorch.org/docs/master/optim.html> (they simply call it weight decay). Adding l2 penalty to objective function means incrementing each gradient by decay_rate * w. Doing this addition in optimizers will save us from the overhead of squaring and reductions. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#269 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpuFUi2GlayIq47nveSrgbgFgrvtRks5tUduxgaJpZM4SD3PI> .

Evizero · 2018-02-14T07:16:43Z

One side note worth mentioning. This looks like it applies weight decay to all learned parameters equally; including the bias terms

denizyuret · 2018-02-14T07:20:12Z

Good point, people usually do not want to apply weight decay to biases. There is also the popular L1 regularization.

…

On Wed, Feb 14, 2018 at 10:16 AM Christof Stocker ***@***.***> wrote: One side note worth mentioning. This looks like it applies weight decay to all learned parameters equally; including the bias terms — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#269 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNphVuLJZ0s5ey86VEbWjPgyHs4CsCks5tUohegaJpZM4SD3PI> .

cangumeli · 2018-02-14T08:56:47Z

Technically, there is an optimizer for each parameter, so user will be able to modify weight decay for each parameter. On the other hand, when optimizers is called with decay option, we will be decaying all parameters including biases. There are studies decaying all parameters, but making this the default behaviour might be misleading.

Also, we will be reporting losses with weight decay penalty excluded. I'm not sure is this a feature or a bug.

I think we may consider weight decay in optimizers as a performance trick to be used by people who know what they are doing.

Evizero · 2018-07-14T16:44:03Z

Very relevant paper on this topic: https://arxiv.org/abs/1711.05101

Particularly it argues that L2 regularization and weight decay are not identical for Adam. Furthermore it argues that L2 regularization is not effective in Adam.

cangumeli added 2 commits February 13, 2018 17:35

L2 Decay added to adam

96d4ffb

Empty space removal

7cd74db

denizyuret added the interface label Feb 25, 2018

denizyuret self-assigned this Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2 Decay for Optimizers#269

L2 Decay for Optimizers#269
cangumeli wants to merge 2 commits intodenizyuret:masterfrom
cangumeli:wdecay

cangumeli commented Feb 13, 2018

Uh oh!

denizyuret commented Feb 13, 2018 via email

Uh oh!

cangumeli commented Feb 13, 2018

Uh oh!

denizyuret commented Feb 14, 2018 via email

Uh oh!

Evizero commented Feb 14, 2018

Uh oh!

denizyuret commented Feb 14, 2018 via email

Uh oh!

cangumeli commented Feb 14, 2018

Uh oh!

Evizero commented Jul 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cangumeli commented Feb 13, 2018

Uh oh!

denizyuret commented Feb 13, 2018 via email

Uh oh!

cangumeli commented Feb 13, 2018

Uh oh!

denizyuret commented Feb 14, 2018 via email

Uh oh!

Evizero commented Feb 14, 2018

Uh oh!

denizyuret commented Feb 14, 2018 via email

Uh oh!

cangumeli commented Feb 14, 2018

Uh oh!

Evizero commented Jul 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants