Dense Prediction Transformers

Related tags

Deep LearningDPT
Overview

Vision Transformers for Dense Prediction

This repository contains code and models for our paper:

Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

Changelog

  • [March 2021] Initial release of inference code and models

Setup

  1. Download the model weights and place them in the weights folder:

Monodepth:

Segmentation:

  1. Set up dependencies:

    pip install -r requirements.txt

    The code was tested with Python 3.7, PyTorch 1.8.0, OpenCV 4.5.1, and timm 0.4.5

Usage

  1. Place one or more input images in the folder input.

  2. Run a monocular depth estimation model:

    python run_monodepth.py

    Or run a semantic segmentation model:

    python run_segmentation.py
  3. The results are written to the folder output_monodepth and output_semseg, respectively.

Use the flag -t to switch between different models. Possible options are dpt_hybrid (default) and dpt_large.

Additional models:

Run with

python run_monodepth -t [dpt_hybrid_kitti|dpt_hybrid_nyu] 

Evaluation

Hints on how to evaluate monodepth models can be found here: https://github.com/intel-isl/DPT/blob/main/EVALUATION.md

Citation

Please cite our papers if you use this code or any of the models.

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ArXiv preprint},
	year      = {2021},
}
@article{Ranftl2020,
	author    = {Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun},
	title     = {Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer},
	journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
	year      = {2020},
}

Acknowledgements

Our work builds on and uses code from timm and PyTorch-Encoding. We'd like to thank the authors for making these libraries available.

License

MIT License

Comments
  • Can dpt models be traced?

    Can dpt models be traced?

    I try to trace "dpt_hybrid_midas" by calling

    torch.jit.trace(model, example_input)

    However, it failed with error messages below. Any pointer on how to do it properly?

    /usr/local/lib/python3.9/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) /mnt/data/git/DPT/dpt/vit.py:154: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results. gs_old = int(math.sqrt(len(posemb_grid))) /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py:3609: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( Traceback (most recent call last): File "/mnt/data/git/DPT/export_model.py", line 112, in convert(in_model_path, out_model_path) File "/mnt/data/git/DPT/export_model.py", line 64, in convert sm = torch.jit.trace(model, example_input) File "/usr/local/lib/python3.9/dist-packages/torch/jit/_trace.py", line 735, in trace return trace_module( File "/usr/local/lib/python3.9/dist-packages/torch/jit/_trace.py", line 952, in trace_module module._c._create_method_from_trace( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1039, in _slow_forward result = self.forward(*input, **kwargs) File "/mnt/data/git/DPT/dpt/models.py", line 115, in forward inv_depth = super().forward(x).squeeze(dim=1) File "/mnt/data/git/DPT/dpt/models.py", line 72, in forward layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x) File "/mnt/data/git/DPT/dpt/vit.py", line 120, in forward_vit nn.Unflatten( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/flatten.py", line 102, in init self._require_tuple_int(unflattened_size) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/flatten.py", line 125, in _require_tuple_int raise TypeError("unflattened_size must be tuple of ints, " + TypeError: unflattened_size must be tuple of ints, but found element of type Tensor at pos 0

    opened by 3togo 18
  • Results on Kitti dataset are not reproducible

    Results on Kitti dataset are not reproducible

    Hi! Thanks again about your work!

    Recently, I tried to check accuracy of pre-trained models on KITTI (Eigen split) and found that it is differ from paper results.

    Снимок экрана 2021-06-08 в 13 08 50

    On this screenshoot you can see basic metrics used in depth prediction on Eigen split (files for split I take from this repo). For ground truth i used raw data from velodyne (used loader like this)

    I hope, you can explain this results. Thanks!

    opened by RuslanOm 18
  • Unit of the Absolute Depth

    Unit of the Absolute Depth

    Thank you very much for your awesome work first. I am just wondering, what is the unit/metrics of the predicted absolute depth? Is it in meter? Thanks!

    opened by mohanhanmo 8
  • How to make depth estimation consistence for a sequence of images?

    How to make depth estimation consistence for a sequence of images?

    Thanks for the great works. I just want to get some idea from you.

    Here is: I am running a robot move forward to a chair. I get one frame image per every half seconds. then detect the depth estimation by the dpt nyu model provided between robot and chair. The depth suppose to become smaller and smaller. but the actual result does not like that.

    For the sample images, It works like below. metric as meter. it is absolute depth. Leve is meter meter

    There are three reasons I guess lead something wrong.

    1. even people hard to see the difference between image 1 and 2. how does the computer know.
    2. thanks @nicolasugrinovic. It is absolute value as meter show above. so this point does not make any sense. "It is relative metric for every time. so not work well cross a sequence of images."
    3. Image 4 and 5 is barely have any other reference. so it barely works.

    Here are the question:

    1. Any suggestion or reference paper I can take a look to work on depth along with a sequence of image?

    Thank you very much!

    im2 im3 im4 im5 im6

    opened by angrysword 8
  • Training Requirements

    Training Requirements

    Hi, I was trying to use your code for training in another dataset for the depth prediction task. I noticed that during training I could not increase the batch size beyond 2. With a batch size of 2 and images of size 224x448 it takes almost 9GB of memory. Can you comment on the memory requirement? Like how did you train the model and how much memory it took? It will be really helpful if you can share some insights on training.

    Thanks

    opened by krantiparida 8
  • Reconcile depth maps from VT and usual midas model

    Reconcile depth maps from VT and usual midas model

    Hi! Thanks for a great job. I'm trying to reconcile output from usual midas model and vt model, but have some problems. I need this for open3d visualization: usually a take inverse midas output and get normal 3d point clouds, but for vt this pipeline breakes.

    Can you explain, please, how can i fix this? Thanks!

    opened by RuslanOm 6
  • Values in Tables 2 and 3 in paper

    Values in Tables 2 and 3 in paper

    I'm a bit confused by the absolute numbers in Tables 2 and 3 in the arxiv release. should be a percentage of pixels out of 100 and lower should be better. However, the range of numbers in the tables is very low and the higher numbers are highlighted. Could you please clarify what the numbers represent?

    opened by tarashakhurana 6
  • Training Error

    Training Error

    I want to train your model.

    When i didn't use nn.DataParallel then training is ok.

    But, when i use nn.DataParallel i got this error.

    RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

    I want to train with multi gpu. How can i do?

    opened by kimsunkyung 5
  • Question on the results on Pascal Context in Table. 5

    Question on the results on Pascal Context in Table. 5

    Hi,

    Thank you for your great work and for sharing the code.

    I have a slight question on results on Pascal Context in Table. 5: It seems that the DPT-Hybrid model is firstly pre-trained on the ADE20K dataset then finetuned on Pascal Context, am I right?

    Thanks.

    opened by LayneH 5
  • Error running model on c++

    Error running model on c++

    I converted model using dpt_scriptable branch, like this:

        model.eval()
    
        if optimize == True and device == torch.device("cuda"):
            model = torch.jit.script(model)
            model = model.to(memory_format=torch.channels_last)
            model = model.half()
    
        model.to(device)
    
        model.save(model_path[:-3] + ".torchscript.pt")
    

    then I tried to using it in C++, loading Mat image and converting it to PyTorch:

    	cv::Mat ch_first = data.clone();
    
    	if (data.type() != CV_32FC3) cout << "wrong type" << endl;
    
    	float* feed_data = (float*)data.data;
    	float* ch_first_data = (float*)ch_first.data;
    
    	for (int p = 0; p < (int)data.total(); ++p)
    	{
    		// R
    		ch_first_data[p] = feed_data[p * 3];
    		// G
    		ch_first_data[p + (int)data.total()] = feed_data[p * 3 + 1];
    		// B
    		ch_first_data[p + 2 * (int)data.total()] = feed_data[p * 3 + 2];
    	}
    
    
    	torch::Tensor image_input = torch::from_blob((float*)ch_first.data, { 1, data.rows, data.cols, 3 });
    	image_input = image_input.toType(torch::kFloat16);
    
    	image_input = image_input.to((*device));
    	
    
    
    	auto net_out = module.forward({ image_input });
    

    data height is 384 and width is 672. In for Im just unpacking values from OpenCV byte order to pytorch byte order.

    And in forward function I recieve exception:

     	KernelBase.dll!00007ffcea784f99()	Unknown
     	vcruntime140d.dll!00007ffc5afab460()	Unknown
    >	torch_cpu.dll!torch::jit::InterpreterStateImpl::handleError(const torch::jit::ExceptionMessage & msg, bool is_jit_exception, c10::NotImplementedError * not_implemented_error) Line 665	C++
     	torch_cpu.dll!`torch::jit::InterpreterStateImpl::runImpl'::`1'::catch$81() Line 639	C++
     	[External Code]	
     	torch_cpu.dll!torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue,std::allocator<c10::IValue>> & stack) Line 251	C++
     	torch_cpu.dll!torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue,std::allocator<c10::IValue>> & stack) Line 728	C++
     	torch_cpu.dll!torch::jit::InterpreterState::run(std::vector<c10::IValue,std::allocator<c10::IValue>> & stack) Line 841	C++
     	torch_cpu.dll!torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue,std::allocator<c10::IValue>> & stack) Line 544	C++
     	torch_cpu.dll!torch::jit::GraphExecutor::run(std::vector<c10::IValue,std::allocator<c10::IValue>> & inputs) Line 767	C++
     	torch_cpu.dll!torch::jit::GraphFunction::run(std::vector<c10::IValue,std::allocator<c10::IValue>> & stack) Line 36	C++
     	torch_cpu.dll!torch::jit::GraphFunction::operator()(std::vector<c10::IValue,std::allocator<c10::IValue>> stack, const std::unordered_map<std::string,c10::IValue,std::hash<std::string>,std::equal_to<std::string>,std::allocator<std::pair<std::string const ,c10::IValue>>> & kwargs) Line 53	C++
     	torch_cpu.dll!torch::jit::Method::operator()(std::vector<c10::IValue,std::allocator<c10::IValue>> stack, const std::unordered_map<std::string,c10::IValue,std::hash<std::string>,std::equal_to<std::string>,std::allocator<std::pair<std::string const ,c10::IValue>>> & kwargs) Line 225	C++
     	torch_cpu.dll!torch::jit::Module::forward(std::vector<c10::IValue,std::allocator<c10::IValue>> inputs) Line 114	C++
     	pytorch_test.exe!main() Line 128	C++
    
    

    I connected source file for debugging and got exception string:

    The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript, serialized code (most recent call last):
      File "code/__torch__/dpt/models.py", line 14, in forward
      def forward(self: __torch__.dpt.models.DPTDepthModel,
        x: Tensor) -> Tensor:
        inv_depth = torch.squeeze((self).forward_features(x, ), 1)
                                   ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        if self.invert:
          depth = torch.add(torch.mul(inv_depth, self.scale), self.shift)
      File "code/__torch__/dpt/models.py", line 28, in forward_features
      def forward_features(self: __torch__.dpt.models.DPTDepthModel,
        x: Tensor) -> Tensor:
        layer_1, layer_2, layer_3, layer_4, = (self.pretrained).forward(x, )
                                               ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        layer_1_rn = (self.scratch.layer1_rn).forward(layer_1, )
        layer_2_rn = (self.scratch.layer2_rn).forward(layer_2, )
      File "code/__torch__/dpt/vit.py", line 22, in forward
        x: Tensor) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        _0, _1, h, w, = torch.size(x)
        layers = (self).forward_flex(x, )
                  ~~~~~~~~~~~~~~~~~~ <--- HERE
        layer_1, layer_2, layer_3, layer_4, = layers
        layer_10 = (self.readout_oper1).forward(layer_1, )
      File "code/__torch__/dpt/vit.py", line 54, in forward_flex
        _15 = torch.floordiv(H, (self.patch_size)[1])
        _16 = torch.floordiv(W, (self.patch_size)[0])
        pos_embed = (self)._resize_pos_embed(_14, _15, _16, )
                     ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        B0 = (torch.size(x))[0]
        _17 = (self.model.patch_embed.proj).forward(x, )
      File "code/__torch__/dpt/vit.py", line 220, in _resize_pos_embed
        _68 = torch.reshape(posemb_grid, [1, gs_old, gs_old, -1])
        posemb_grid0 = torch.permute(_68, [0, 3, 1, 2])
        posemb_grid1 = _67(posemb_grid0, [gs_h, gs_w], None, "bilinear", False, None, )
                       ~~~ <--- HERE
        _69 = torch.permute(posemb_grid1, [0, 2, 3, 1])
        posemb_grid2 = torch.reshape(_69, [1, torch.mul(gs_h, gs_w), -1])
      File "code/__torch__/torch/nn/functional/___torch_mangle_25.py", line 256, in interpolate
                        ops.prim.RaiseException("AssertionError: ")
                        align_corners6 = _25
                      _81 = torch.upsample_bilinear2d(input, output_size2, align_corners6, scale_factors5)
                            ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                      _79 = _81
                    else:
    
    Traceback of TorchScript, original code (most recent call last):
      File "D:\dev\Mars\DPT-dpt_scriptable\dpt\models.py", line 114, in forward
        def forward(self, x):
            inv_depth = self.forward_features(x).squeeze(dim=1)
                        ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        
            if self.invert:
      File "D:\dev\Mars\DPT-dpt_scriptable\dpt\vit.py", line 302, in forward
            _, _, h, w = x.shape
        
            layers = self.forward_flex(x)
                     ~~~~~~~~~~~~~~~~~ <--- HERE
        
            # HACK: this is to make TorchScript happy. Can't directly address modules,
      File "D:\dev\Mars\DPT-dpt_scriptable\dpt\vit.py", line 259, in forward_flex
            B, _, H, W = x.shape
        
            pos_embed = self._resize_pos_embed(
                        ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                self.model.pos_embed,
                int(H // self.patch_size[1]),
      File "D:\dev\Mars\DPT-dpt_scriptable\dpt\vit.py", line 247, in _resize_pos_embed
        
            posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2)
            posemb_grid = F.interpolate(
                          ~~~~~~~~~~~~~ <--- HERE
                posemb_grid, size=[gs_h, gs_w], mode="bilinear", align_corners=False
            )
      File "C:\Users\Ilya\anaconda3\envs\np\lib\site-packages\torch\nn\functional.py", line 3709, in interpolate
        if input.dim() == 4 and mode == "bilinear":
            assert align_corners is not None
            return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        if input.dim() == 5 and mode == "trilinear":
            assert align_corners is not None
    RuntimeError: Input and output sizes should be greater than 0, but got input (H: 24, W: 24) output (H: 42, W: 0)
    

    It seems something wrong with input size? Should it be 384x384? I am not sure what is wrong

    opened by InfiniteLife 4
  • Error in the run_segmentation.py file.

    Error in the run_segmentation.py file.

    I have a question for your run_segmentation.py. Looking at the structure of the DPT model, the foward returns five outputs.

    [out, layer1, layer2, layer3, layer4]

    In run_segmentation.py, insert a sample into the model and enter the above-mentioned type of list as out. After that, Torch.nn.functional.If you put it in the interpolate, you will encounter the following error.

    AttributeError: 'list' object has no attribute 'dim'

    How do you solve it?

    opened by sjg02122 4
  • Trying to understand the Dense Prediction Transformer architecture

    Trying to understand the Dense Prediction Transformer architecture

    This is not basically an issue but I'm trying to understand the DPT architecture. I'm quite new with transformers but I have previous knowledge with CNN's.

    Let's take an example where I would like to generate dpt large depth model with patch size of 16 and image size 384. I can see that pretrained weights for the original model variants are loaded from timm but in this example I would like to generate the model from the scratch.

    I gathered all the essential functions for the below code snippet (hopefully). I don't totally understand that in where and how I should apply functions forward_flex() or _resize_pos_embed()? The second question is that if I would like to calculate multiscale-loss, for which layers it should be done? Thanks in advance!

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    
    class Transpose(nn.Module):
        def __init__(self, dim0, dim1):
            super(Transpose, self).__init__()
            self.dim0 = dim0
            self.dim1 = dim1
    
        def forward(self, x):
            x = x.transpose(self.dim0, self.dim1)
            return x
    
    class Interpolate(nn.Module):
        def __init__(self, scale_factor):
            super(Interpolate, self).__init__()
            self.scale_factor = scale_factor
    
        def forward(self, x):
            x = nn.functional.interpolate(x, scale_factor=self.scale_factor, mode='bilinear', align_corners=True)
            return x
    
    class ResidualConvUnit(nn.Module):
        def __init__(self, features):
            super().__init__()
            self.conv1 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True)
            self.conv2 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True)
            self.relu = nn.ReLU(inplace=True)
    
        def forward(self, x):
            out = self.relu(x)
            out = self.conv1(out)
            out = self.relu(out)
            out = self.conv2(out)
            return out + x
    
    class FeatureFusionBlock(nn.Module):
        def __init__(self, features):
            super(FeatureFusionBlock, self).__init__()
            self.resConfUnit1 = ResidualConvUnit(features)
            self.resConfUnit2 = ResidualConvUnit(features)
    
        def forward(self, *xs):
            output = xs[0]
            if len(xs) == 2:
                output += self.resConfUnit1(xs[1])
            output = self.resConfUnit2(output)
            output = nn.functional.interpolate(output, scale_factor=2, mode="bilinear", align_corners=True)
            return output
    
    class PatchEmbed(nn.Module):
        def __init__(self, img_size=384, patch_size=16, in_chans=3, embed_dim=1024):
            super().__init__()
            self.img_size = img_size
            self.patch_size = patch_size
            self.grid_size = (img_size // patch_size, img_size // patch_size)
            self.num_patches = self.grid_size[0] * self.grid_size[1]
    
            self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
    
        def forward(self, x):
            B, C, H, W = x.shape
            x = self.proj(x)
            x = x.flatten(2).transpose(1, 2)
            return x
    
    class Attention(nn.Module):
        def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
            super().__init__()
            assert dim % num_heads == 0, 'dim should be divisible by num_heads'
            self.num_heads = num_heads
            head_dim = dim // num_heads
            self.scale = head_dim ** -0.5
    
            self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
            self.attn_drop = nn.Dropout(attn_drop)
            self.proj = nn.Linear(dim, dim)
            self.proj_drop = nn.Dropout(proj_drop)
    
        def forward(self, x):
            B, N, C = x.shape
            qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
            q, k, v = qkv.unbind(0)
    
            attn = (q @ k.transpose(-2, -1)) * self.scale
            attn = attn.softmax(dim=-1)
            attn = self.attn_drop(attn)
    
            x = (attn @ v).transpose(1, 2).reshape(B, N, C)
            x = self.proj(x)
            x = self.proj_drop(x)
            return x
    
    class LayerScale(nn.Module):
        def __init__(self, dim, init_values=1e-5, inplace=False):
            super().__init__()
            self.inplace = inplace
            self.gamma = nn.Parameter(init_values * torch.ones(dim))
    
        def forward(self, x):
            return x.mul_(self.gamma) if self.inplace else x * self.gamma
    
    class Mlp(nn.Module):
        def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, bias=True, drop=0.):
            super().__init__()
            out_features = out_features or in_features
            hidden_features = hidden_features or in_features
    
            self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
            self.act = act_layer()
            self.drop1 = nn.Dropout(drop)
            self.fc2 = nn.Linear(hidden_features, out_features, bias=bias)
            self.drop2 = nn.Dropout(drop)
    
        def forward(self, x):
            x = self.fc1(x)
            x = self.act(x)
            x = self.drop1(x)
            x = self.fc2(x)
            x = self.drop2(x)
            return x
    
    class Block(nn.Module):
    
        def __init__(self, embed_dim=1024, num_heads=16, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
            super().__init__()
            self.norm1 = norm_layer(embed_dim)
            self.attn = Attention(embed_dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
    
            self.norm2 = norm_layer(embed_dim)
            self.mlp = Mlp(in_features=embed_dim, hidden_features=int(embed_dim * mlp_ratio), act_layer=act_layer, drop=drop)
    
        def forward(self, x):
            x = x + self.attn(self.norm1(x))
            x = x + self.mlp(self.norm2(x))
            return x
    
    class ProjectReadout(nn.Module):
        def __init__(self, in_features, start_index=1):
            super(ProjectReadout, self).__init__()
            self.start_index = start_index
    
            self.project = nn.Sequential(nn.Linear(2 * in_features, in_features), nn.GELU())
    
        def forward(self, x):
            readout = x[:, 0].unsqueeze(1).expand_as(x[:, self.start_index :])
            features = torch.cat((x[:, self.start_index :], readout), -1)
    
            return self.project(features)
    
    class Encoder(nn.Module):
        def __init__(self, embed_dim=1024, num_heads=16, norm_layer=nn.LayerNorm):
            super(Encoder, self).__init__()
    
            self.patch_embed = PatchEmbed()
            self.num_patches = self.patch_embed.num_patches
            self.pos_embed = nn.Parameter(torch.randn(1, self.num_patches, embed_dim) * .02)
            self.pos_drop = nn.Dropout(p=0.1)
    
            self.encoder_block = Block(embed_dim=embed_dim, num_heads=num_heads)
    
            self.norm = norm_layer(embed_dim)
    
        def forward(self, x):
            x = self.patch_embed(x)
    
            x = self.pos_embed(x)
            x = self.pos_drop(x)
    
            for i in range(6):       
                x = self.encoder_block(x)
    
            layer1 = x
    
            for i in range(6):       
                x = self.encoder_block(x)
    
            layer2 = x
    
            for i in range(6):       
                x = self.encoder_block(x)
    
            layer3 = x
    
            for i in range(6):       
                x = self.encoder_block(x)
    
            layer4 = self.norm(x)
            return layer1, layer2, layer3, layer4
    
    class DPT_VITL_16_384(nn.Module):
        def __init__(self, img_size=384, features=256, embed_dim=1024, in_shape=[256, 512, 1024, 1024]):
            super(DPT_VITL_16_384, self).__init__()
    
            self.encoder = Encoder(embed_dim=embed_dim)
    
            self.refinenet1 = FeatureFusionBlock(features)
            self.refinenet2 = FeatureFusionBlock(features)
            self.refinenet3 = FeatureFusionBlock(features)
            self.refinenet4 = FeatureFusionBlock(features)
    
            self.layer1_rn = nn.Conv2d(in_shape[0], features, kernel_size=3, stride=1, padding=1, bias=False)
            self.layer2_rn = nn.Conv2d(in_shape[1], features, kernel_size=3, stride=1, padding=1, bias=False)
            self.layer3_rn = nn.Conv2d(in_shape[2], features, kernel_size=3, stride=1, padding=1, bias=False)
            self.layer4_rn = nn.Conv2d(in_shape[3], features, kernel_size=3, stride=1, padding=1, bias=False)
    
            self.readout_oper0 = ProjectReadout(embed_dim, start_index=1)
            self.readout_oper1 = ProjectReadout(embed_dim, start_index=1)
            self.readout_oper2 = ProjectReadout(embed_dim, start_index=1)
            self.readout_oper3 = ProjectReadout(embed_dim, start_index=1)
    
            self.act_postprocess1 = nn.Sequential(self.readout_oper0, Transpose(1, 2), nn.Unflatten(2, torch.Size([img_size // 16, img_size // 16])),
                                                  nn.Conv2d(in_channels=embed_dim, out_channels=in_shape[0], kernel_size=1, stride=1, padding=0),
                                                  nn.ConvTranspose2d(in_channels=in_shape[0], out_channels=in_shape[0], kernel_size=4, stride=4, padding=0, bias=True, dilation=1, groups=1))
    
            self.act_postprocess2 = nn.Sequential(self.readout_oper1, Transpose(1, 2), nn.Unflatten(2, torch.Size([img_size // 16, img_size // 16])),
                                                  nn.Conv2d(in_channels=embed_dim, out_channels=in_shape[1],kernel_size=1,stride=1,padding=0),
                                                  nn.ConvTranspose2d(in_channels=in_shape[1], out_channels=in_shape[1], kernel_size=2, stride=2, padding=0, bias=True, dilation=1, groups=1))
    
            self.act_postprocess3 = nn.Sequential(self.readout_oper2, Transpose(1, 2), nn.Unflatten(2, torch.Size([img_size // 16, img_size // 16])),
                                                  nn.Conv2d(in_channels=embed_dim, out_channels=in_shape[2], kernel_size=1, stride=1, padding=0))
    
            self.act_postprocess4 = nn.Sequential(self.readout_oper3, Transpose(1, 2), nn.Unflatten(2, torch.Size([img_size // 16, img_size // 16])),
                                                  nn.Conv2d(in_channels=embed_dim, out_channels=in_shape[3], kernel_size=1, stride=1, padding=0),
                                                  nn.Conv2d(in_channels=in_shape[3], out_channels=in_shape[3], kernel_size=3, stride=2, padding=1))
    
    
            self.head = nn.Sequential(
                nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1),
                Interpolate(scale_factor=2),
                nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
                nn.ReLU(True),
                nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
                nn.ReLU(True)
            )
    
        def forward(self, x):
    
            layer_1, layer_2, layer_3, layer_4 = self.encoder(x)
    
            layer_1 = self.act_postprocess1(layer_1)
            layer_2 = self.act_postprocess1(layer_2)
            layer_3 = self.act_postprocess1(layer_3)
            layer_4 = self.act_postprocess1(layer_4)
    
            layer_1_rn = self.layer1_rn(layer_1)
            layer_2_rn = self.layer2_rn(layer_2)
            layer_3_rn = self.layer3_rn(layer_3)
            layer_4_rn = self.layer4_rn(layer_4)
    
            path_4 = self.refinenet4(layer_4_rn)
            path_3 = self.refinenet3(path_4, layer_3_rn)
            path_2 = self.refinenet2(path_3, layer_2_rn)
            path_1 = self.refinenet1(path_2, layer_1_rn)
    
            out = self.head(path_1)
    
            return out
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using {device} device")
    model = DPT_VITL_16_384().to(device)
    print(model)
    
    opened by tholmb 0
  • confidence of the prediction result

    confidence of the prediction result

    Thanks for sharing this research and the models.

    How can i generate the confidence map of the prediction result? I want to measure the accuracy of the predicted depth of every pixel.

    opened by fuyuanhao 0
  • Questions about the BlendedMVS datasets used in DPT model

    Questions about the BlendedMVS datasets used in DPT model

    Thanks for sharing this research and the models.

    I have several questions about the BlendedMVS datasets used in DPT model. Q1: Whether DPT model (dpt_hybrid-midas and dpt_large-midas.pt) uses the BlendedMVS validation datasets in the training process? Q2: Whether DPT model (dpt_hybrid-midas.pt and dpt_large-midas.pt) uses the BlendedMVS+ and BlendedMVS++ datasets in the training process? Q3: Could u kindly provide the model weights dpt_hybrid-midas and dpt_large-midas trained without BlendedMVS dataset? I want to evaluate the results in multi-view stereo situation.

    Thank you very much for your excellent work and I am looking forward to your reply. My Email: [email protected]

    opened by fuyuanhao 0
  • Add pillow in requirements.txt

    Add pillow in requirements.txt

    Hi, As you can observe from the following stacktrace, we need to also provide pillow in the environment.

    Traceback (most recent call last): File "run_segmentation.py", line 11, in import util.io File "/home/s/DepthNets/DPT/util/io.py", line 9, in from PIL import Image ModuleNotFoundError: No module named 'PIL' Traceback (most recent call last): File "run_segmentation.py", line 11, in import util.io File "/home/s/DepthNets/DPT/util/io.py", line 9, in from PIL import Image ModuleNotFoundError: No module named 'PIL'

    opened by Salvatore-tech 0
  • ONNX Conversion Scripts

    ONNX Conversion Scripts

    This PR implements ONNX conversion scripts and scripts to run the resulting models on monodepth and segmentation tasks. Furthermore fixes from #42 are incorporated. The converted weights are available here and are verified to produce numerically similar results to the original models on exemplary inputs. Please let me know if I should add anything to the README.

    opened by timmh 4
Owner
Intelligent Systems Lab Org
Intelligent Systems Lab Org
A Moonraker plug-in for real-time compensation of frame thermal expansion

Frame Expansion Compensation A Moonraker plug-in for real-time compensation of frame thermal expansion. Installation Credit to protoloft, from whom I

58 Jan 02, 2023
Face Mask Detection on Image and Video using tensorflow and keras

Face-Mask-Detection Face Mask Detection on Image and Video using tensorflow and keras Train Neural Network on face-mask dataset using tensorflow and k

Nahid Ebrahimian 12 Nov 11, 2022
《Geo Word Clouds》paper implementation

《Geo Word Clouds》paper implementation

Russellwzr 2 Jan 28, 2022
TuckER: Tensor Factorization for Knowledge Graph Completion

TuckER: Tensor Factorization for Knowledge Graph Completion This codebase contains PyTorch implementation of the paper: TuckER: Tensor Factorization f

Ivana Balazevic 296 Dec 06, 2022
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 08, 2023
Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

CopeNLU 36 Dec 05, 2022
Bravia core script for python

Bravia-Core-Script You need to have a mandatory account If this L3 does not work, try another L3. enjoy

5 Dec 26, 2021
LogAvgExp - Pytorch Implementation of LogAvgExp

LogAvgExp - Pytorch Implementation of LogAvgExp for Pytorch Install $ pip instal

Phil Wang 31 Oct 14, 2022
Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

NLOS-OT Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted) Description In this reposit

Ruixu Geng(耿瑞旭) 16 Dec 16, 2022
Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and t

305 Dec 16, 2022
Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

AimCLR This is an official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Reco

Gty 44 Dec 17, 2022
Malmo Collaborative AI Challenge - Team Pig Catcher

The Malmo Collaborative AI Challenge - Team Pig Catcher Approach The challenge involves 2 agents who can either cooperate or defect. The optimal polic

Kai Arulkumaran 66 Jun 29, 2022
Voxel Transformer for 3D object detection

Voxel Transformer This is a reproduced repo of Voxel Transformer for 3D object detection. The code is mainly based on OpenPCDet. Introduction We provi

173 Dec 25, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 02, 2022
OpenVINO黑客松比赛项目

Window_Guard OpenVINO黑客松比赛项目 英文名称:Window_Guard 中文名称:窗口卫士 硬件 树莓派4B 8G版本 一个磁石开关 USB摄像头(MP4视频文件也可以) 软件(库) OpenVINO RPi 使用方法 本项目使用的OPenVINO是是2021.3版本,并使用了

Tango 6 Jul 04, 2021
Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

Adrian Edwards 290 Jan 09, 2023
The project was to detect traffic signs, based on the Megengine framework.

trafficsign 赛题 旷视AI智慧交通开源赛道,初赛1/177,复赛1/12。 本赛题为复杂场景的交通标志检测,对五种交通标志进行识别。 框架 megengine 算法方案 网络框架 atss + resnext101_32x8d 训练阶段 图片尺寸 最终提交版本输入图片尺寸为(1500,2

20 Dec 02, 2022
MARE - Multi-Attribute Relation Extraction

MARE - Multi-Attribute Relation Extraction Repository for the paper submission: #TODO: insert link, when available Environment Tested with Ubuntu 18.0

0 May 11, 2021
The official PyTorch code for NeurIPS 2021 ML4AD Paper, "Does Thermal data make the detection systems more reliable?"

MultiModal-Collaborative (MMC) Learning Framework for integrating RGB and Thermal spectral modalities This is the official code for NeurIPS 2021 Machi

NeurAI 12 Nov 02, 2022
Demos of essentia classifiers hosted on replicate.ai

essentia-replicate-demos Demos of Essentia models hosted on replicate.ai's MTG site. The models Check our site for a complete list of the models avail

Music Technology Group - Universitat Pompeu Fabra 12 Nov 14, 2022