当前位置：网站首页>S TYLE N E RF: A S TYLE - BASED 3D-A WARE G ENERA - TOR FOR H IGH - RESOLUTION I MAGE S YNTHESIS

S TYLE N E RF: A S TYLE - BASED 3D-A WARE G ENERA - TOR FOR H IGH - RESOLUTION I MAGE S YNTHESIS

2022-04-21 12:50:00 【_ Summer tree】

Insert picture description here

Abstract

StyleNeRF:

With multi view consistency 3D Perceptual generation model
Based on unorganized 2D Image training .
combination NeRF and Style based generator , be used for ： Improve the rendering effect and quality of high-resolution images 3D Uniformity （ The goal is ）
Use only volume rendering To produce low resolution feature mapping , Then gradually in 2D Up sampling to solve the problem of rendering effect .
Ways to mitigate inconsistencies ：
- a better unsampler
- New loss of regularization
- ……
The effect achieved ： StyleNerf It can quickly and synthesize high-resolution images , And retain 3Ｄ Uniformity .
You can control the camera poses And different levels of style . This can be used to generate an invisible perspective .
It also supports challenging tasks , Includes zooming in and out 、 Style blending 、 Inversion and semantic editing

Problems with existing methods ：

High resolution images cannot be synthesized
Produce obvious 3D Inconsistent artifacts .
Lack of control over style attributes and explicit camera pose

Method

Insert picture description here

Comparison between up sampling method and other methods , Our method can maintain a good 3D Uniformity .

3.1 IMAGE SYNTHESIS AS NEURAL IMPLICIT FIELD RENDERING

Generative style based NeRF

To model high-frequency details , We map x and d From each dimension to Fourier characteristics （fourier feature）
By using style vectors w Adjust the NeRF To formalize StyleNeRF Express , As shown below
- f It's a mapping network , Move the noise vector from Mapping spherical Gaussian space to style space W.
- $g_w^i(\cdot)$ It means the first one i By entering a style vector $\omega$ Adjust MLP layer
- $\phi_{\omega}^n(x)$ yes x Point of the first n Layer characteristics .
We use the extracted features to predict density and color .
among hσ and hc It can be linear projection or 2 layer MLP.
front min(nσ, nc) Layers are shared in the network .

Volume Rendering

Let's assume that the camera is on a unit sphere , Pointing with a fixed field of view (FOV) The origin of .
We sample the pitch and yaw of the camera from a uniform or Gaussian distribution according to the data set （pitch & yaw）.
Render the image Ｉ.　（ Consistent with the basic formula ）
- and NeRF equally , Used stratified and hierarchical sampling

Challenges

these models cost much more computation to render an image at the exact resolution
consumes much more memory to cache the intermediate results for gradient back-propagation during
training

3.2 Approximation of high resolution image generation

2D The reason for fast image generation

Each pixel only needs a single forward pass through the network ;
Image features are generated from coarse to fine , The higher the resolution of the feature map, the less the number of channels , To save memory .

By aggregating features early into... Before calculating the final color 2D Space to partially realize the first point . , We will work out the formula 4 Adjusted for ：
Insert picture description here

We use up-sample The low resolution feature space is approximated to the high resolution feature space .

Recursive insertion of up sampling operator can realize efficient high-resolution image synthesis , Because the volume rendering with large amount of calculation only needs to generate low-resolution feature map .
When fewer channels are used for higher resolution , Efficiency will be further improved .

Although early aggregation and upsampling operations can accelerate the rendering process of high-resolution image synthesis , But they destroy NeRF Inherent consistency of .
How do inconsistencies result ？

,the resulting model contains non-linear transformations to capture spurious correlations in 2D observation, mainly when substantial ambiguity exists.
Second, such a pixel-space operation like up-sampling would compromise 3D consistency.

3.3 PRESERVING 3D CONSISTENCY

Unsampler design

We achieve the balance between consistency and image quality by combining these two approaches (see Figure 2).
For any input feature mapping $\in R^{N * N * D}$ :
Insert picture description here

$\psi_{\theta}:R^D \rightarrow R^{4D}$ It's two levels of a science department MLP.
K Is a fixed fuzzy kernel

NeRF path regularization

Regularize the model output to match the original path （ equation （4））.
This is done by resampling the pixels on the output and comparing them with NeRF The generated pixels are compared to achieve ：
Insert picture description here

S Is a collection of randomly sampled pixels .
Rin and Rout It's through NeRF Generated low resolution image and StyleNeRF Generated high-resolution images The speed of light of the corresponding pixel .

Remove view direction condition

Predicting colors with view direction condition would give the model additional freedom to capture spurious correlations and dataset bias, especially if only a single-view target is provided.

Predicting colors using view orientation conditions will provide additional freedom for the model to capture false correlations and dataset deviations , Especially when only single view targets are provided . So we removed the view direction to improve consistency . Pictured 8 Shown .
Insert picture description here

Fix 2D noise injection

Studies have shown that ： Injecting noise per pixel can improve the model's sensitivity to random changes （ Like hair 、 Stubble ） Modeling capabilities of
Our default solution is to exchange the ability of the model to capture changes by eliminating noise injection .
We have also proposed based on StyleNeRF A novel geometric perceptual noise injection method for estimating the surface .（ See appendix A3）

3.4 StyleNeRF framework

Mapping Network

Sample from the standard Gaussian distribution latent codes, And pass by mapping network To deal with . Finally, the output vector is broadcast to synthesis network
Insert picture description here

Synthesis Network

We use it NeRF++ As styleNeRF The backbone of .
NeRF++ By a unit sphere in the foreground NeRF And a background parameterized by an inverted sphere NeRF form .
Two MLP Used to predict density , among BG Than FG Less parameters .
Then a shared MLP Used to predict color .
Each style condition block consists of an affine transformation （affine transformaton） Layer and a 1×1 Convolution layer （Conv） form .
Conv The group is adjusted with radial transformation style .
Leaky_Relu For nonlinear activation .
The number of blocks depends on the resolution of the input and target image .

Discriminator & Objectives

StyleNeRF Use a device with R1 Regular Unsaturated GAN The goal is .
new NeRF Path regularization Applied to increase 3D Uniformity .
The final loss function is defined in the following ：
Insert picture description here

G It includes Mapping and synthsis network The generator .

Progressive training

Start training from the bottom to high resolution .
We propose a new three-stage progressive training strategy ：

For the former T1 A picture , Do not make low resolution approximation .
stay T1-T2 A picture , The gas city and the discriminator increase the output resolution until the target resolution is reached .
Last , We have a fixed architecture , Continuous training model in high resolution , until T3 A picture .
Details refer to appendix Ａ４.

experiment

use FFHQ、 MetFaces、AFHQ、CompCars assessment styleNeRF
baseline：

HoloGAN
GRAF
pi-GAN
CIRAFFE
batch_size 64, T1 = 500k , T2 = 5000k, T3=25000k.
The input resolution is fixed to 32x 32

result

Insert picture description here

High resolution synthesis

Controllable image synthesis
Camera control ：（ This effect is not good ）
Insert picture description here
Style blending and interpolation ：

Insert picture description here

Important references

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 11453–11464, 2021b.

Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving
neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.