当前位置:网站首页>Deep learning -- Summary of Feature Engineering
Deep learning -- Summary of Feature Engineering
2022-04-23 19:25:00 【Try not to lie flat】
For machine learning , General steps :
Data collection — Data cleaning — Feature Engineering — Data modeling
We know , Feature engineering includes feature construction , Feature extraction and feature selection . Feature engineering is actually transforming the original data into models , The process of training data .
Feature building
https://zhuanlan.zhihu.com/p/424518359 Other bloggers' explanations for normalization
In feature construction , First give me a pile of data , So many and messy , We must normalize its data first , Let the data be distributed as I want to see . Then after the specification , You need data preprocessing , Especially missing values 、 Classification feature processing 、 Processing of continuous features .
Data normalization : normalization : Maximum and minimum standardization 、Z-Score Standardization
So what's the biggest difference between them ? Is to change the distribution of characteristic data .
Maximum and minimum standardization : Will change the distribution of characteristic data
Z-Score Standardization : Do not change the distribution of characteristic data
Maximum and minimum standardization :
- The linear function transforms the method of linearizing the original data into [0 1] The scope of the , The calculation result is the normalized data ,X For raw data
- This normalization method is more suitable for The values are concentrated The situation of
- defects : If max and min unstable , It's easy to make the normalization result unstable , It makes the follow-up effect unstable . Empirical constants can be used to replace max and min
- Application scenarios : When it comes to distance measurement 、 Covariance calculation 、 When the data does not conform to the positive distribution , You can use the first method or other normalization methods ( barring Z-score Method ). For example, in image processing , take RGB After the image is converted to a grayscale image, its value is limited to [0 255] The scope of the
Z-Score Standardization :
- among ,μ、σ They are the mean and method of the original data set .
- Normalize the original data set to mean 0、 variance 1 Data set of
- This normalization method requires that the distribution of the original data can be approximately Gaussian distribution , Otherwise, the effect of normalization will become very bad .
- Application scenarios : stay classification 、 clustering In the algorithm, , When distance is needed to measure similarity 、 Or use PCA technology During dimensionality reduction ,Z-score standardization Perform better .
feature extraction
So in the feature extraction method , We first learned about data partitioning : Include what the dataset is ? Give you a pile of data , What is your split method ? There are also important dimensionality reduction methods :PCA, There are other ways , such as ICA, But for my final exam , I won't focus on the record, hahaha .
Data sets : Training set 、 Verification set 、 Test set
- Training set : Training data , Adjust model parameters 、 Training model weight , Building machine learning model
- Verification set : The performance of the model is verified by the data separated from the training set , As the performance index of the evaluation model
- Test set : Enter the training set with new data , To verify the quality of the trained model
Split method : Set aside method 、K- Fold cross validation
- Set aside method : Divide the data set into mutually exclusive sets , Maintain the consistency of the split set data
- K- Fold cross validation : Split the dataset into K A mutually exclusive subset of similar size , Ensure the consistency of their data distribution
In order to convert the original data into obvious physical / Characteristics of statistical significance , You need to build new data , The methods used are usually PCA、ICA、LDA etc. .
So why do we need to reduce the dimension of features
- Eliminate noise
- Data compression
- Eliminate data redundancy
- Improve the accuracy of the algorithm
- Reduce the data dimension to 2 Dimension or 3 dimension , Maintain data visibility
PCA( Principal component analysis ): Through the transformation of coordinate axis ; Find the optimal subspace of data distribution
- Enter the original data , The structure is (m,n), Find the original n It's made up of two eigenvectors n Dimensional space
- Determine the eigenvector after dimensionality reduction :K
- Through some kind of change , find n A new eigenvector , And the new n Dimensional space V*—— Matrix decomposition
- Find the original data in the new feature space V Medium n The value corresponding to a new eigenvector , Mapping data to a new space
- Before selection K One of the most informative features , Delete unselected features , Will succeed n Dimension reduction of dimensional space K dimension
For feature selection , There are several ways : Filter type 、 Parcel type 、 The embedded ( Understanding can )
Last , Let's look at the difference between super parameters and parameters :
- Hyperparameters : Parameters set before learning the model , Artificially set , such as padding、stride、k-means Of k、 depth 、 Number and size of convolution kernels 、 Learning rate
- Parameters : The parameters obtained through a series of model training , Such as weight w and wx+b Inside b.
版权声明
本文为[Try not to lie flat]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231859372120.html
边栏推荐
- OpenHarmony开源开发者成长计划,寻找改变世界的开源新生力!
- Tencent cloud GPU best practices - remote development training using jupyter pycharm
- The usage of slice and the difference between slice and array
- The platinum library cannot search the debug process records of some projection devices
- static类变量快速入门
- Transaction processing of SQL Server database
- Codeforces Round #784 (Div. 4)
- Hot reload debugging
- Go recursively loops through folders
- Intuitive understanding of the essence of two-dimensional rotation
猜你喜欢
精简CUDA教程——CUDA Driver API
Kubernetes入门到精通-裸机LoadBalence 80 443 端口暴露注意事项
No, some people can't do the National Day avatar applet (you can open the traffic master and earn pocket money)
OpenHarmony开源开发者成长计划,寻找改变世界的开源新生力!
[report] Microsoft: application of deep learning methods in speech enhancement
On the forced conversion of C language pointer
2021-2022-2 ACM集训队每周程序设计竞赛(8)题解
Use of fluent custom fonts and pictures
Garbage collector and memory allocation strategy
ArcMap publishing slicing service
随机推荐
MySQL syntax collation (4)
Kubernetes入门到精通-在 Kubernetes 上安装 OpenELB
Strange passion
Matlab 2019 installation of deep learning toolbox model for googlenet network
为何PostgreSQL即将超越SQL Server?
Use of fluent custom fonts and pictures
命令-sudo
SSDB Foundation
An algorithm problem was encountered during the interview_ Find the mirrored word pairs in the dictionary
Zero base to build profit taking away CPS platform official account
Openlayers draw rectangle
openlayers 5.0 两种居中方式
js上传文件时控制文件类型和大小
Easy mock local deployment (you need to experience three times in a crowded time. Li Zao will do the same as me. Love is like a festival mock)
Virtual machine performance monitoring and fault handling tools
The usage of slice and the difference between slice and array
Why is PostgreSQL about to surpass SQL Server?
goroutine
Executor、ExecutorService、Executors、ThreadPoolExecutor、Future、Runnable、Callable
Openlayers 5.0 two centering methods