Custom implementation of Corrleation Module

Last update: Dec 12, 2022

Overview

Pytorch Correlation module

this is a custom C++/Cuda implementation of Correlation module, used e.g. in FlowNetC

This tutorial was used as a basis for implementation, as well as NVIDIA's cuda code

Build and Install C++ and CUDA extensions by executing python setup.py install,
Benchmark C++ vs. CUDA by running python benchmark.py {cpu, cuda},
Run gradient checks on the code by running python grad_check.py --backend {cpu, cuda}.

Requirements

This module is expected to compile for Pytorch 1.6.

Installation

this module is available on pip

pip install spatial-correlation-sampler

For a cpu-only version, you can install from source with

python setup_cpu.py install

Known Problems

This module needs compatible gcc version and CUDA to be compiled. Namely, CUDA 9.1 and below will need gcc5, while CUDA 9.2 and 10.0 will need gcc7 See this issue for more information

Usage

API has a few difference with NVIDIA's module

output is now a 5D tensor, which reflects the shifts horizontal and vertical.

input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW)

Output sizes oH and oW are no longer dependant of patch size, but only of kernel size and padding
Patch size patch_size is now the whole patch, and not only the radii.
stride1 is now stride andstride2 is dilation_patch, which behave like dilated convolutions
equivalent max_displacement is then dilation_patch * (patch_size - 1) / 2.
dilation is a new parameter, it acts the same way as dilated convolution regarding the correlation kernel
to get the right parameters for FlowNetC, you would have

kernel_size=1
patch_size=21,
stride=1,
padding=0,
dilation=1
dilation_patch=2

Example

import torch
from spatial_correlation_sampler import SpatialCorrelationSampler, 

device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32

input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)

#You can either use the function or the module. Note that the module doesn't contain any parameter tensor.

#function

out = spatial_correlation_sample(input1,
	                         input2,
                                 kernel_size=3,
                                 patch_size=1,
                                 stride=2,
                                 padding=0,
                                 dilation=2,
                                 dilation_patch=1)

#module

correlation_sampler = SpatialCorrelationSampler(
    kernel_size=3,
    patch_size=1,
    stride=2,
    padding=0,
    dilation=2,
    dilation_patch=1)
out = correlation_sampler(input1, input2)

Benchmark

default parameters are from benchmark.py, FlowNetC parameters are same as use in FlowNetC with a batch size of 4, described in this paper, implemented here and here.
Feel free to file an issue to add entries to this with your hardware !

CUDA Benchmark

See here for a benchmark script working with NVIDIA's code, and Pytorch.
Benchmark are launched with environment variable CUDA_LAUNCH_BLOCKING set to 1.
Only float32 is benchmarked.
FlowNetC correlation parameters where launched with the following command:

CUDA_LAUNCH_BLOCKING=1 python benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 cuda -d float

CUDA_LAUNCH_BLOCKING=1 python NV_correlation_benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256

implementation	Correlation parameters	device	pass	min time	avg time
ours	default	980 GTX	forward	5.745 ms	5.851 ms
ours	default	980 GTX	backward	77.694 ms	77.957 ms
NVIDIA	default	980 GTX	forward	13.779 ms	13.853 ms
NVIDIA	default	980 GTX	backward	73.383 ms	73.708 ms

ours	FlowNetC	980 GTX	forward	26.102 ms	26.179 ms
ours	FlowNetC	980 GTX	backward	208.091 ms	208.510 ms
NVIDIA	FlowNetC	980 GTX	forward	35.363 ms	35.550 ms
NVIDIA	FlowNetC	980 GTX	backward	283.748 ms	284.346 ms

Notes

The overhead of our implementation regarding kernel_size > 1 during backward needs some investigation, feel free to dive in the code to improve it !
The backward pass of NVIDIA is not entirely correct when stride1 > 1 and kernel_size > 1, because not everything is computed, see here.

CPU Benchmark

No other implementation is avalaible on CPU.
It is obviously not recommended to run it on CPU if you have a GPU.

Correlation parameters	device	pass	min time	avg time
default	E5-2630 v3 @ 2.40GHz	forward	159.616 ms	188.727 ms
default	E5-2630 v3 @ 2.40GHz	backward	282.641 ms	294.194 ms
FlowNetC	E5-2630 v3 @ 2.40GHz	forward	2.138 s	2.144 s
FlowNetC	E5-2630 v3 @ 2.40GHz	backward	7.006 s	7.075 s

Custom implementation of Corrleation Module

Related tags

Overview

Pytorch Correlation module

Requirements

Installation

Known Problems

Usage

Example

Benchmark

CUDA Benchmark

Notes

CPU Benchmark

Owner

Clément Pinard

ICML 21 - Voice2Series: Reprogramming Acoustic Models for Time Series Classification

Implementation of Restricted Boltzmann Machine (RBM) and its variants in Tensorflow

A Peer-to-peer Platform for Secure, Privacy-preserving, Decentralized Data Science

Official Implementation of "Transformers Can Do Bayesian Inference"

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Code to reproduce the results for Compositional Attention

Code for "Localization with Sampling-Argmax", NeurIPS 2021

MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift

Learned Token Pruning for Transformers

Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration (CVPR 2021)

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Code for paper "ASAP-Net: Attention and Structure Aware Point Cloud Sequence Segmentation"

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

This is a collection of simple PyTorch implementations of neural networks and related algorithms. These implementations are documented with explanations,

Python Multi-Agent Reinforcement Learning framework

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Code

Official implementations of PSENet, PAN and PAN++.

Dense Prediction Transformers