本文件根据../textclf/config/optimizer.py自动生成
OptimizerConfig有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
embedding_lr | float | 1e-3 | learning rate for embedding layer |
lr | float | 1e-3 | learning rate for other layer |
static_epoch | int | 0 | embedding层在前static_epoch将不会进行训练在static_epoch之后,embedding将会以embedding_lr作为学习率进行训练 |
Adam algorithm. It has been proposed in Adam: A Method for Stochastic Optimization.
AdamConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
betas | Tuple[float, float] | (0.9, 0.999) | coefficients used for computing running averages of gradientand its square(default: (0.9, 0.999)) |
eps | float | 1e-8 | term added to the denominator to improve numerical stability(default: 1e-8) |
weight_decay | float | 0. | weight decay(L2 penalty)(default: 0) |
amsgrad | bool | False | whether to use the AMSGrad variant of this algorithm from the paperOn the Convergence of Adam and Beyond |
Adadelta algorithm. It has been proposed in ADADELTA: An Adaptive Learning Rate Method.
AdadeltaConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
rho | float | 0.9 | coefficient used for computing a running average of squared gradients(default: 0.9) |
eps | float | 1e-6 | term added to the denominator to improve numerical stability(default: 1e-6) |
weight_decay | float | 0. | weight decay(L2 penalty)(default: 0) |
Adagrad algorithm. It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
AdagradConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
lr_decay | float | 0.0 | learning rate decay (default: 0) |
weight_decay | float | 0.0 | weight decay (L2 penalty) (default: 0) |
eps | float | 1e-10 | term added to the denominator to improve numerical stability (default: 1e-10) |
AdamW algorithm. The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.
AdamWConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
betas | Tuple[float, float] | (0.9, 0.999) | coefficients used for computing running averages of gradient and its square(default: (0.9, 0.999)) |
eps | float | 1e-8 | term added to the denominator to improve numerical stability(default: 1e-8) |
weight_decay | float | 1e-2 | weight decay coefficient(default: 1e-2) |
amsgrad | bool | False | whether to use the AMSGrad variant of this algorithm from the paperOn the Convergence of Adam and Beyond(default: False) |
Adamax algorithm (a variant of Adam based on infinity norm). It has been proposed in Adam: A Method for Stochastic Optimization.
AdamaxConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
betas | Tuple[float, float] | (0.9, 0.999) | coefficients used for computing running averages of gradient and its square(default: (0.9, 0.999)) |
eps | float | 1e-8 | term added to the denominator to improve numerical stability(default: 1e-8) |
weight_decay | float | 1e-2 | weight decay coefficient(default: 1e-2) |
Averaged Stochastic Gradient Descent. It has been proposed in Acceleration of stochastic approximation by averaging.
ASGDConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
lambd | float | 1e-4 | decay term (default: 1e-4) |
alpha | float | 0.75 | power for eta update (default: 0.75) |
t0 | float | 1e6 | point at which to start averaging (default: 1e6) |
weight_decay | float | 0 | weight decay (L2 penalty) (default: 0) |
RMSprop algorithm. Proposed by G. Hinton in his course.
RMSpropConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
momentum | float | 0.0 | momentum factor(default: 0) |
alpha | float | 0.99 | smoothing constant(default: 0.99) |
eps | float | 1e-8 | term added to the denominator to improve numerical stability(default: 1e-8) |
centered | bool | False | if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance |
weight_decay | float | 0. | weight decay(L2 penalty)(default: 0) |
The resilient backpropagation algorithm.
RpropConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
etas | Tuple[float, float] | (0.5, 1.2) | pair of (etaminus, etaplis), that are multiplicative increase and decrease factors (default: (0.5, 1.2)) |
step_sizes | Tuple[float, float] | (1e-6, 50) | a pair of minimal and maximal allowed step sizes (default: (1e-6, 50)) |
stochastic gradient descent (optionally with momentum). Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
SGDConfig继承OptimizerConfig的所有属性,同时它还有以下属性:
Attribute name | Type | Default | Description |
---|---|---|---|
momentum | float | 0. | momentum factor (default: 0) |
weight_decay | float | 0. | weight decay (L2 penalty) (default: 0) |
dampening | float | 0. | dampening for momentum (default: 0) |
nesterov | bool | False | enables Nesterov momentum (default: False) |