当前位置：网站首页>Image Manipulation Detection by Multi-View Multi-Scale Supervision

Image Manipulation Detection by Multi-View Multi-Scale Supervision

2022-04-21 17:04:00 【Kun Li】

【 Scanning 】ICCV 2021 丨 MVSS-Net: Image tamper detection based on multi view and multi-scale supervision 《 Scanning 》 Columns will receive top essence articles in the long term. , Welcome to contribute ~https://mp.weixin.qq.com/s/Jkq2gQX-_Ss3kziIJU-oEg

1.abstract

The key challenge of image tamper detection is how to learn the generalization features sensitive to new data tampering and prevent false alarms on real images , Current research emphasizes sensitivity, Neglected specificity, This passage multi-view feature learning and multi-scale supervision To solve , Through multi perspective feature learning and multi-scale supervision , This idea is very common in tamper detection , Previously, the in vivo detection of small vision technology also added a Fourier branch to supervise the refinement of features , Because tamper detection focuses on the problem of tampered image artifacts and edges , The core is how to distinguish the tampered areas on the original drawing , Therefore, the supervision of adding a strong branch is also reasonable . Multi view learning utilizes the noise distribution and boundary artifacts around the tampered area ,By exploiting noise distribution and boundary artifact surrounding tampered regions, To learn the characteristics of semantic unknowability , So as to obtain more general characteristics , The latter allows us to learn from real images , These images are not important for the current methods based on semantic segmentation network .

2.introduction

copy-move, Copy and move elements from one area of a given image to another area ,splicing, Copy and paste elements from one image onto another ,inpainting, Remove unwanted elements , These are three common types of image processing .

This task is considered as a simplified case of image semantic segmentation , But the semantic segmentation model is suboptimal , Because it is designed to capture semantic information , Make the network dependent on data sets , Instead of generalizing , That's very good , In fact, the scene design and definition of early tamper detection , Mostly artificial data , Define tamper types on artificial data , Often the trained models are strongly dependent on data , Very poor generalization .

In order to learn semantic unknowable features （semantic-agnostic features）, Image content must be suppressed , That's important , Before, when doing the two classification tasks of tampering and non tampering , The classification model learns the feature factors other than tampering features , And these are what we don't need , To suppress the interference of these image contents . The current method is divided into two groups , That is, noise view method and edge monitoring method , hypothesis splicing and inpainting The new element introduced is different from the real part in terms of noise distribution , The first set of methods aims to take advantage of this difference , The noise map of the input image generated by a predefined high pass filter or a trainable counterpart is sent to the depth network alone or together with the input image , But this pair has no new elements to introduce copymove It's invalid , The second set of methods focuses on finding boundary artifacts as tampering tracks around the tampered area , Rebuild the edge of the area by adding auxiliary branches .

3.related work

The above figure is some recent work investigated by the author , The top note is similar to my previous research work. ,RBG、 Noise diagram, etc , I've also seen the use of Fourier and ela The graph is used to enhance the auxiliary branch feature .

This article focuses on copy-move/splicing/inpainting Three types of , For Gaussian blur and jpeg Compress this concern constrained cnn. Use BayerConv This constrained convolution layer is helpful to extract noise information , But using them alone will lead to the loss of the original rgb Risks of other useful information in the input . Double current fasterrcnn It's using srm filter ,mvssnet Noise map is used , And later merged rgb And noise diagram , And the fusion is not untrained bilinear pooling , It is dual attention.

Tampering with a given area in a given image will inevitably leave traces between the tampered area and its surrounding environment , Therefore, how to use this edge artifact is also very important for tamper detection .mvssnet There is an edge Supervision Branch .

4.proposed model

The classification idea intelligently determines whether the changed area has been tampered with , But there is no way to accurately trace the tampered area , However, the segmentation method can not only judge whether it is tampered with , And give the specific area of tampering , The segmentation method is really good , Before doing classification and detection, we actually need to think about similar problems more closely , Each pixel has a probability of binary classification , On top of this, there is a global segmentation graph ,mvssnet Accept rgb And noise diagram , There are three scales of labels to monitor , Pixels , Edges and images . On the issue of tamper detection, we should do it based on the idea of segmentation , I've been thinking about how to remove the interference of factors other than non tampering , For now , Classification methods will inevitably encounter this problem , Not thin enough , Classification of features that are not pixel level . However, it is also important to find commonalities after feature extraction in the region .

4.1 multi-view feature learning

resnet50 As the backbone ,edge-supervised The branch is specifically designed to take advantage of subtle boundary artifacts around the tampered area ,noise-sensitive The purpose of branching is to capture inconsistencies between tampered areas and real areas . Both branches have nothing to do with semantics .

4.1.1 edge-supervised branch

Ideally , Through edge supervision , We hope that the response area of the network will be more concentrated in the tampered area . Designing such an edge monitoring network is not easy . It's worth thinking about , Do you want to db So let the model pay more attention to these edge areas ？ As the first 2 Section , The main challenge is how to build the appropriate input for the edge detection head . One side , Use the last ResNet The characteristics of blocks are problematic , Because this will force deep features to capture low-level edge patterns , Thus affecting the main task of operation segmentation . On the other hand , Using features from the initial block is also problematic , Because the subtle edge patterns contained in these shallow features can easily disappear after multiple depth convolutions . therefore , It is necessary to use both shallow and deep features . However , We believe that the simple feature connection used before is suboptimal , Because the features are mixed , And there is no guarantee that deeper features will be fully supervised by the edge head . To overcome the challenge , We propose to construct the input of the edge head from shallow to deep .

From different ResNet The features of the blocks are combined in a progressive manner for operating edge detection . To enhance edge related patterns , We introduced Sobel layer . The first i The features of a block first pass through Sobel layer , Then there is the edge residual block (ERB), Then combine them with their counterparts from the next block （ By summation ）. To prevent cumulative effects , The combined features go through another... Before the next round of feature combination ERB. We believe that this mechanism helps to prevent extreme situations in which the marginal head oversaw or completely ignored the deep features . Through the visualization diagram 4 Last of ResNet Characteristic graph of block , We observed that the proposed ESB Indeed, there is a more concentrated response near the tampering area .

In the figure 2 in ,ESB There are two outputs , The first output goes through sigmoid function , It's an edge surveillance chart , The second is the main segmentation graph

4.1.2 noise-sensitive branch

In order to make full use of noise view , We built a model with edge-supervised branch Parallel noise sensitive branches ,nsb It's a standard fcn, Use resnet50, Noise extraction selects BayarConv, It is better than SRM Filters are better .

4.1.3 branch fusion by dual attention

Through trainable dual attention Modules to integrate esb and nsb Output characteristic diagram , Bilinear pooling is not used , Double current fasterrcnn It uses bilinear pooling , You don't have to train .

da There are two parallel branches , Blue is the channel , Green is the location , CA Associate channel features , To selectively emphasize the interdependent channel characteristic diagram . meanwhile ,PA The features of each location are selectively updated by the weighted sum of the features of all locations .CA and PA After integration , adopt 1x1 Convolution is converted to 1 Diagram of two channels , Image size unchanged , Then use the parameterless bilinear up sampling , Then the sigmoid, Turn to the final segmentation diagram .da There were two 2048 Sum of graphs of channels , Add and become 4096 passageway , after da attention become 1 passageway .

4.2 multi-scale supervision

Supervision of three scales , Pixel level loss , Edge loss and image level loss for learning semantic independent features ,

pixel-scale loss. Use dice loss, There are usually very few pixels in a given image , Learn from extremely unbalanced data , Learn from the original size .

edge loss. Use dice loss, It's an ancillary loss , Do not calculate on the size of the original drawing , stay 1/4 Calculate the loss under the dimensions in the figure , Reduces the computational cost of training , At the same time, it improves performance .

image-scale loss.bce loss

dice loss Be similar to iou-loss,bce Is the second classification of pixels , Image segmentation task ,softmax Cross entropy loss It is to predict the category of each pixel , Then average all pixels . In essence, it is still equal learning for each pixel of the picture , This leads to when there is an imbalance between multiple categories on the image , The training of the model will be dominated by the most mainstream categories . The Internet is more inclined to learn from mainstream categories , It reduces the ability of feature extraction for non mainstream categories ,bce If yes, the weight will be added to the positive and negative samples .dice loss Through prediction and GT The intersection of is calculated by dividing their overall pixels , Consider all pixels of a category as a whole , And calculate the proportion of intersection in the whole , Therefore, it will not be affected by a large number of mainstream pixels , Can extract better results . In the actual ,dice loss Often associated with bce loss Use a combination of , To improve the stability of model training .

This paper is a combination of these three losses ：