当前位置：网站首页>Knowledge Distillation Thesis Learning

Knowledge Distillation Thesis Learning

2022-08-10 05:46:00 【a program circle】

Supervised Learning

Supervised learning: The training sample set contains not only samples, but also labels corresponding to these samples, that is, samples and sample labels appear in pairs.The goal of supervised learning is to learn an efficient sample-to-label mapping from the training samples, enabling it to predict the labels of unknown samples.Common supervised learning methods include neural networks, support vector machines.
All regression and classification algorithms are supervised learning.The algorithm difference between regression and classification is the type of output variable. Quantitative output is called regression, or continuous variable prediction; qualitative output is called classification, or discrete variable prediction.

Unsupervised Learning

Unsupervised learning algorithm: no training samples.The category of the sample is not known in advance, and similar samples are grouped together in a certain way.

Semi-Supervised Learning

Semi-supervised learning: With a small number of training samples, the learning machine is based on the knowledge obtained from the training samples, and gradually corrects the existing knowledge in combination with the distribution of the test samples, and judges the type of the test samples.

Knowledge Distillation

Knowledge distillation (KD) is a common method for model compression.Knowledge distillation is to train the small model by building a lightweight small model and use the supervision information of the larger model with better performance to achieve better performance and accuracy.The large model is called the Teacher model and the small model is called the Student model.

Classification of Knowledge Distillation: Offline Distillation, Semi-Supervised Distillation, Self-Supervised Distillation

The core idea of knowledge distillation is to first train a complex network model, and then use the output of this complex network and the true label of the data to train a smaller network, so the knowledge distillation framework usually includes a complex model and a small model.

Features of Knowledge Distillation

1, Improve model accuracy
If the user is not satisfied with the accuracy of the current network model A, then you can first train a higher-precision teacher model B (usually more parameters and longer delay),Then use this trained teacher model B to perform knowledge distillation on student model A to obtain a higher-precision model.
2. Reduce model delay and compress network parameters.
3. Domain transfer between image tags
The user trains a teacher model A using the dog and cat datasets, and a teacher model B using bananas and apples, then you can use these two models at the same time.Distill a model that can recognize dogs, cats, bananas, and apples, and integrate and transfer data sets from two different domains.
4. Reducing the amount of labeling
It is achieved by semi-supervised distillation. The user uses the trained teacher network model to distill the unlabeled data set to achieve the purpose of reducing the amount of labeling.

Characteristic distillation

logits represents the data of the last layer of the model in deep learning, that is, raw data, which can then be scaled by softmax or sigmod
The range of logits is [− ∞ - \infty−∞, + ∞ + \infty+∞]
** logits: unnormalized probability, generally the input of the softmax layer.**So logits has the same shape as labels
It can also be used as the input of sigmoid

Feature distillation Unlike the Logits method, Student only learns the result knowledge of the teacher's logits, but learns the intermediate layer features in the teacher's network structure.The correspondence of the teacher's middle feature layer is the knowledge passed to the student.

原网站

版权声明
本文为[a program circle]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/222/202208100529162024.html