Keras implementation of AdaBound

Overview

AdaBound for Keras

Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Usage

Add the adabound.py script to your project, and import it. Can be a dropin replacement for Adam Optimizer.

Also supports AMSBound variant of the above, equivalent to AMSGrad from Adam.

from adabound import AdaBound

optm = AdaBound(lr=1e-03,
                final_lr=0.1,
                gamma=1e-03,
                weight_decay=0.,
                amsbound=False)

Results

With a wide ResNet 34 and horizontal flips data augmentation, and 100 epochs of training with batchsize 128, it hits 92.16% (called v1).

Weights are available inside the Releases tab

NOTE

  • The smaller ResNet 20 models have been removed as they did not perform as expected and were depending on a flaw during the initial implementation. The ResNet 32 shows the actual performance of this optimizer.

With a small ResNet 20 and width + height data + horizontal flips data augmentation, and 100 epochs of training with batchsize 1024, it hits 89.5% (called v1).

On a small ResNet 20 with only width and height data augmentations, with batchsize 1024 trained for 100 epochs, the model gets close to 86% on the test set (called v3 below).

Train Set Accuracy

Train Set Loss

Test Set Accuracy

Test Set Loss

Requirements

  • Keras 2.2.4+ & Tensorflow 1.12+ (Only supports TF backend for now).
  • Numpy
Comments
  • suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

    same as my PR https://github.com/keras-team/keras-contrib/pull/478 works only with TF backend

    class AdaBound(Optimizer):
        """AdaBound optimizer.
        Default parameters follow those provided in the original paper.
        # Arguments
            lr: float >= 0. Learning rate.
            final_lr: float >= 0. Final learning rate.
            beta_1: float, 0 < beta < 1. Generally close to 1.
            beta_2: float, 0 < beta < 1. Generally close to 1.
            gamma: float >= 0. Convergence speed of the bound function.
            epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
            decay: float >= 0. Learning rate decay over each update.
            weight_decay: Weight decay weight.
            amsbound: boolean. Whether to apply the AMSBound variant of this
                algorithm.
            tf_cpu_mode: only for tensorflow backend
                  0 - default, no changes.
                  1 - allows to train x2 bigger network on same VRAM consuming RAM
                  2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                      and CPU power.
        # References
            - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
              (https://openreview.net/forum?id=Bkg3g2R9FX)
            - [Adam - A Method for Stochastic Optimization]
              (https://arxiv.org/abs/1412.6980v8)
            - [On the Convergence of Adam and Beyond]
              (https://openreview.net/forum?id=ryQu7f-RZ)
        """
    
        def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                     epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
            super(AdaBound, self).__init__(**kwargs)
    
            if not 0. <= gamma <= 1.:
                raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")
    
            with K.name_scope(self.__class__.__name__):
                self.iterations = K.variable(0, dtype='int64', name='iterations')
                self.lr = K.variable(lr, name='lr')
                self.beta_1 = K.variable(beta_1, name='beta_1')
                self.beta_2 = K.variable(beta_2, name='beta_2')
                self.decay = K.variable(decay, name='decay')
    
            self.final_lr = final_lr
            self.gamma = gamma
    
            if epsilon is None:
                epsilon = K.epsilon()
            self.epsilon = epsilon
            self.initial_decay = decay
            self.amsbound = amsbound
    
            self.weight_decay = float(weight_decay)
            self.base_lr = float(lr)
            self.tf_cpu_mode = tf_cpu_mode
    
        def get_updates(self, loss, params):
            grads = self.get_gradients(loss, params)
            self.updates = [K.update_add(self.iterations, 1)]
    
            lr = self.lr
            if self.initial_decay > 0:
                lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                          K.dtype(self.decay))))
    
            t = K.cast(self.iterations, K.floatx()) + 1
    
            # Applies bounds on actual learning rate
            step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                              (1. - K.pow(self.beta_1, t)))
    
            final_lr = self.final_lr * lr / self.base_lr
            lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
            upper_bound = final_lr * (1. + 1. / (self.gamma * t))
    
            e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
            if e: e.__enter__()
            ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            if self.amsbound:
                vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
            else:
                vhats = [K.zeros(1) for _ in params]
            if e: e.__exit__(None, None, None)
            
            self.weights = [self.iterations] + ms + vs + vhats
    
            for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
                # apply weight decay
                if self.weight_decay != 0.:
                    g += self.weight_decay * K.stop_gradient(p)
    
                e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
                if e: e.__enter__()                    
                m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
                v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
                if self.amsbound:
                    vhat_t = K.maximum(vhat, v_t)
                    self.updates.append(K.update(vhat, vhat_t))
                if e: e.__exit__(None, None, None)
                
                if self.amsbound:
                    denom = (K.sqrt(vhat_t) + self.epsilon)
                else:
                    denom = (K.sqrt(v_t) + self.epsilon)                        
    
                # Compute the bounds
                step_size_p = step_size * K.ones_like(denom)
                step_size_p_bound = step_size_p / denom
                bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                         lower_bound), upper_bound)
    
                p_t = p - bounded_lr_t
    
                self.updates.append(K.update(m, m_t))
                self.updates.append(K.update(v, v_t))
                new_p = p_t
    
                # Apply constraints.
                if getattr(p, 'constraint', None) is not None:
                    new_p = p.constraint(new_p)
    
                self.updates.append(K.update(p, new_p))
            return self.updates
    
        def get_config(self):
            config = {'lr': float(K.get_value(self.lr)),
                      'final_lr': float(self.final_lr),
                      'beta_1': float(K.get_value(self.beta_1)),
                      'beta_2': float(K.get_value(self.beta_2)),
                      'gamma': float(self.gamma),
                      'decay': float(K.get_value(self.decay)),
                      'epsilon': self.epsilon,
                      'weight_decay': self.weight_decay,
                      'amsbound': self.amsbound}
            base_config = super(AdaBound, self).get_config()
            return dict(list(base_config.items()) + list(config.items()))
    
    opened by iperov 13
  • AdaBound.iterations

    AdaBound.iterations

    this param is not saved.

    I looked at official pytorch implementation from original paper. https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

    it has

    # State initialization
    if len(state) == 0:
        state['step'] = 0
    

    state is saved with the optimizer.

    also it has

    # Exponential moving average of gradient values
    state['exp_avg'] = torch.zeros_like(p.data)
    # Exponential moving average of squared gradient values
    state['exp_avg_sq'] = torch.zeros_like(p.data)
    

    these values should also be saved

    So your keras implementation is wrong.

    opened by iperov 10
  • Using SGDM with lr=0.1 leads to not learning

    Using SGDM with lr=0.1 leads to not learning

    Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below. image

    I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

    opened by syorami 10
  • clip by value

    clip by value

    https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

    K.minimum(K.maximum(step, lower_bound), upper_bound)

    will not work?

    opened by iperov 2
  • Unexpected keyword argument passed to optimizer: amsbound

    Unexpected keyword argument passed to optimizer: amsbound

    I installed with pip install keras-adabound imported with: from keras_adabound import AdaBound and declared the optimizer as: opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Then, I'm getting the error: TypeError: Unexpected keyword argument passed to optimizer: amsbound

    changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error: TypeError: __init__() missing 1 required positional argument: 'params'

    Am I mixing something up here or missing something?

    opened by stabilus 0
  • Unclear how to import and use tf.keras version

    Unclear how to import and use tf.keras version

    I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

    opened by mnweaver1 0
  • about lr

    about lr

    Thanks for a good optimizer According to usage optm = AdaBound(lr=1e-03, final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False) Does the learning rate gradually increase by the number of steps?


    final lr is described as Final learning rate. but it actually is leaning rate relative to base lr and current klearning rate? https://github.com/titu1994/keras-adabound/blob/5ce819b6ca1cd95e32d62e268bd2e0c99c069fe8/adabound.py#L72

    opened by tanakataiki 1
Releases(0.1)
Owner
Somshubra Majumdar
Interested in Machine Learning, Deep Learning and Data Science in general
Somshubra Majumdar
なりすまし検出(anti-spoof-mn3)のWebカメラ向けデモ

FaceDetection-Anti-Spoof-Demo なりすまし検出(anti-spoof-mn3)のWebカメラ向けデモです。 モデルはPINTO_model_zoo/191_anti-spoof-mn3からONNX形式のモデルを使用しています。 Requirement mediapipe

KazuhitoTakahashi 8 Nov 18, 2022
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks

MEAL-V2 This is the official pytorch implementation of our paper: "MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tric

Zhiqiang Shen 653 Dec 19, 2022
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
Unified learning approach for egocentric hand gesture recognition and fingertip detection

Unified Gesture Recognition and Fingertip Detection A unified convolutional neural network (CNN) algorithm for both hand gesture recognition and finge

Mohammad 227 Dec 25, 2022
StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking Datasets You can download datasets that have been pre-pr

25 May 29, 2022
This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

Spherical Gaussian Optimization This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization. This code has b

41 Dec 14, 2022
Accelerated deep learning R&D

Accelerated deep learning R&D PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and

Catalyst-Team 3.1k Jan 06, 2023
Anonymous implementation of KSL

k-Step Latent (KSL) Implementation of k-Step Latent (KSL) in PyTorch. Representation Learning for Data-Efficient Reinforcement Learning [Paper] Code i

1 Nov 10, 2021
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Generating Digital Painting Lighting Effects via RGB-space Geometry (SIGGRAPH2020/TOG2020)

Project PaintingLight PaintingLight is a project conducted by the Style2Paints team, aimed at finding a method to manipulate the illumination in digit

651 Dec 29, 2022
Hierarchical User Intent Graph Network for Multimedia Recommendation

Hierarchical User Intent Graph Network for Multimedia Recommendation This is our Pytorch implementation for the paper: Hierarchical User Intent Graph

6 Jan 05, 2023
A New Approach to Overgenerating and Scoring Abstractive Summaries

We provide the source code for the paper "A New Approach to Overgenerating and Scoring Abstractive Summaries" accepted at NAACL'21. If you find the code useful, please cite the following paper.

Kaiqiang Song 4 Apr 03, 2022
Novel and high-performance medical image classification pipelines are heavily utilizing ensemble learning strategies

An Analysis on Ensemble Learning optimized Medical Image Classification with Deep Convolutional Neural Networks Novel and high-performance medical ima

14 Dec 18, 2022
In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

liziyu0104 1 Dec 30, 2021
Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games How to run our algorithm? Create the new environment using: conda

MARL @ SJTU 8 Dec 27, 2022
TensorFlow implementation of Deep Reinforcement Learning papers

Deep Reinforcement Learning in TensorFlow TensorFlow implementation of Deep Reinforcement Learning papers. This implementation contains: [1] Playing A

Taehoon Kim 1.6k Jan 03, 2023
Image-Stitching - Panorama composition using SIFT Features and a custom implementaion of RANSAC algorithm

About The Project Panorama composition using SIFT Features and a custom implementaion of RANSAC algorithm (Random Sample Consensus). Author: Andreas P

Andreas Panayiotou 3 Jan 03, 2023
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Laura Smith 70 Dec 07, 2022
Understanding the Generalization Benefit of Model Invariance from a Data Perspective

Understanding the Generalization Benefit of Model Invariance from a Data Perspective This is the code for our NeurIPS2021 paper "Understanding the Gen

1 Jan 15, 2022
Synthetic LiDAR sequential point cloud dataset with point-wise annotations

SynLiDAR dataset: Learning From Synthetic LiDAR Sequential Point Cloud This is official repository of the SynLiDAR dataset. For technical details, ple

78 Dec 27, 2022