当前位置:网站首页>Feature Engineering of interview summary
Feature Engineering of interview summary
2022-04-23 13:16:00 【DCGJ666】
Feature Engineering of interview summary
- What are the characteristics of engineering
- In case of missing value , What are the ways to deal with it
- Treatment of sample imbalance
- appear Nan Why
- Feature screening , How to find features with high similarity and remove
- Contains millions , How to deal with hundreds of millions of characteristic data in deep learning
- What are the methods to calculate the correlation between features ?
What are the characteristics of engineering
- Data preprocessing
1. Handling missing values
2. Picture data expansion
3. Handling outliers
4. Deal with category imbalance - Feature scaling
1. normalization
2. Regularization - Feature code
1. Serial number code
2. Hot coding alone
3. Binary code
4. discretization - feature selection
1. Filter type (filter): First of all, we select the features of the data set , The process has nothing to do with subsequent learners , That is to design some statistics to filter features , Do not consider the follow-up learner problem , Such as variance selection , Chi square test , Mutual information
2. Parcel type (wrapper): It's actually a classifier , It is the performance of subsequent learners as the evaluation standard of feature subset . Such as las vagas Algorithm
3. The embedded (embedding): In fact, it is the learner's autonomous selection of features . For example, based on the selection of punishment items , Tree based selection GBDT - feature extraction
1. Dimension reduction
2. Image feature extraction
3. Text feature extraction - Feature building
In case of missing value , What are the ways to deal with it
- Use features with missing values directly : When only a small number of samples lack this feature, you can try to use ;
- Delete features with missing values : This method is generally applicable to most samples that lack this feature , And containing only a small number of valid values is valid
- Interpolation to complete missing values
mean value 、 The number of 、 Median 、 Fixed value 、 Manual 、 Nearest neighbor complement
Modeling predictions : Return to 、 Decision tree
High dimensional mapping , Compress perception
There are many ways to interpolate
Treatment of sample imbalance
- Expand the data set
- Try other evaluation indicators
- Resampling the dataset
- Sample the data samples of the subclass to increase the number of data samples of the subclass , Oversampling (over-sampling, The number of samples is greater than the number of such samples )
- Sample a large class of data samples to reduce the number of such data samples , Under sampling (under-sampling, The number of samples is less than the number of such samples )
- Try different classification algorithms : For example, the decision tree often performs well on category unbalanced data
- Try to punish the model : For example, your classification task is to identify those sub categories , Then you can add weights to the small class sample data of the classifier , Reduce the weight of large classes of samples ,focal loss
appear Nan Why
- Nan The meaning of is meaningless number , There are several situations :0/0, Inf/Inf, Inf-Inf, Inf*0 etc. , Will lead to uncertain results , So you get NaN
- Data processing , In practical engineering, data is often missing or incomplete , At this point, we can set those missing to nan
- When reading data , A character is not data , Then we think of it as nan Handle
Feature screening , How to find features with high similarity and remove
feature selection — Filtration method : May adopt Variance selection method or Correlation coefficient method
Contains millions , How to deal with hundreds of millions of characteristic data in deep learning
Many features , Less data , It is easy to cause model over fitting
- Dimension reduction :PCA or LDA
- Using regularization ,L1 or L2
- Sample expansion
- feature selection : Remove unimportant features
What are the methods to calculate the correlation between features ?
- pearson coefficient , Calculate the data of constant distance continuous variables . Is between -1 and 1 Between the value of the
- spearman Rank correlation coefficient : It is an indicator to measure the statistical correlation between two variables , Used to evaluate how good the current monotone function is to describe the relationship between two variables
- kendall The correlation coefficient : Kendall coefficient is a statistical value used to measure the correlation between two random variables
版权声明
本文为[DCGJ666]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230611343376.html
边栏推荐
- JMeter operation redis
- Temperature and humidity monitoring + timing alarm system based on 51 single chip microcomputer (C51 source code)
- pyqt5 将opencv图片存入内置SQLlite数据库,并查询
- C语言之字符串与字符数组的区别
- Playwright controls local Google browsing to open and download files
- 100 GIS practical application cases (51) - a method for calculating the hourly spatial average of NC files according to the specified range in ArcGIS
- AUTOSAR from introduction to mastery 100 lectures (86) - 2F of UDS service foundation
- 解决Oracle中文乱码的问题
- Design and manufacture of 51 single chip microcomputer solar charging treasure with low voltage alarm (complete code data)
- Analysis of the latest Android high frequency interview questions in 2020 (BAT TMD JD Xiaomi)
猜你喜欢
The filter() traverses the array, which is extremely friendly
叮~ 你的奖学金已到账!C认证企业奖学金名单出炉
Vscode tips
melt reshape decast 长数据短数据 长短转化 数据清洗 行列转化
three. JS text ambiguity problem
Servlet of three web components
"Xiangjian" Technology Salon | programmer & CSDN's advanced road
你和42W奖金池,就差一次“长沙银行杯”腾讯云启创新大赛!
解决虚拟机中Oracle每次要设置ip的问题
How do ordinary college students get offers from big factories? Ao Bing teaches you one move to win!
随机推荐
Kernel error: no rule to make target 'Debian / canonical certs pem‘, needed by ‘certs/x509_ certificate_ list‘
Three channel ultrasonic ranging system based on 51 single chip microcomputer (timer ranging)
Uninstall MySQL database
4.22学习记录(你一天只做了水题是吗)
How to build a line of code with M4 qprotex
Ding ~ your scholarship has arrived! C certified enterprise scholarship list released
9419 page analysis of the latest first-line Internet Android interview questions
JMeter operation redis
async void 導致程序崩潰
web三大组件之Servlet
Important knowledge of transport layer (interview, retest, final)
Esp32 vhci architecture sets scan mode for traditional Bluetooth, so that the device can be searched
在 pytorch 中加载和使用图像分类数据集 Fashion-MNIST
Office 2021 installation package download and activation tutorial
office2021安装包下载与激活教程
Design of STM32 multi-channel temperature measurement wireless transmission alarm system (industrial timing temperature measurement / engine room temperature timing detection, etc.)
AUTOSAR from introduction to mastery 100 lectures (51) - AUTOSAR network management
"Play with Lighthouse" lightweight application server self built DNS resolution server
Learning notes of AMBA protocol
XML