AdamW optimizer and cosine learning rate annealing with restarts

Last update: Dec 20, 2022

Overview

AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

triangular schedule is enabled by passing policy="triangular" parameter.
triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)

AdamW optimizer and cosine learning rate annealing with restarts

Related tags

Overview

AdamW optimizer and cosine learning rate annealing with restarts

Cyclical Learning Rates

Example:

Owner

Maksym Pyrozhok

DNA sequence classification by Deep Neural Network

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

DeepVoxels is an object-specific, persistent 3D feature embedding.

Repo for code associated with Modeling the Mitral Valve.

code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing.

AI Virtual Calculator: This is a simple virtual calculator based on Artificial intelligence.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

DC540 hacking challenge 0x00005a.

Implementation of FitVid video prediction model in JAX/Flax.

Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling

Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks.

Log4j JNDI inj. vuln scanner

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

Code for our CVPR 2021 paper "MetaCam+DSCE"

Fast SHAP value computation for interpreting tree-based models

AdamW optimizer and cosine learning rate annealing with restarts

Related tags

Overview

AdamW optimizer and cosine learning rate annealing with restarts

Cyclical Learning Rates

Example:

Owner

Maksym Pyrozhok

DNA sequence classification by Deep Neural Network

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

DeepVoxels is an object-specific, persistent 3D feature embedding.

Repo for code associated with Modeling the Mitral Valve.

code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

AI Virtual Calculator: This is a simple virtual calculator based on Artificial intelligence.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

DC540 hacking challenge 0x00005a.

Implementation of FitVid video prediction model in JAX/Flax.

Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling

Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks.

Log4j JNDI inj. vuln scanner

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

Code for our CVPR 2021 paper "MetaCam+DSCE"

Fast SHAP value computation for interpreting tree-based models

code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing.