当前位置:网站首页>logistic regression model - based on R
logistic regression model - based on R
2022-08-08 14:00:00 【Ah Qiangzhen】
logistic回归模型—基于R
logistic回归又称logistic回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域.例如,探讨引发疾病的危险因素,并根据危险因素预测疾病发生的概率等.以胃癌病情分析为例,选择两组人群,一组是胃癌组,一组是非胃癌组,两组人群必定具有不同的体征与生活方式等.因此因变量就为是否胃癌,值为“是”或“否”,自变量就可以包括很多了,如年龄、性别、饮食习惯、幽门螺杆菌感染等.自变量既可以是连续的,也可以是分类的.然后通过logistic回归分析,可以得到自变量的权重,从而可以大致了解到底哪些因素是胃癌的危险因素.同时根据该权值可以根据危险因素预测一个人患癌症的可能性.
Data understanding and preparation
library(MASS)
data(biopsy)
biopsy
数据集包含699tissue samples from patients,保存在11个变量的数据框中,如下所示:
- ID:样本编码
- V1:cell concentration
- V2:Cell size uniformity
- V3:细胞形状均匀度
- V4:Edge adhesion
- V5:单上皮细胞大小
- V6:naked nucleus(16observations are indeed)
- V7:Peaceful chromatin
- V8:normal nucleoli
- V9:mitotic state
- class:Tumor diagnosis results,良性或恶性:This is the outcome variable we want to predict
First delete the code of the sample
biopsy$ID=NULL
一. 对缺失值的处理
Due to missing values only16个,account for all observations2%,So we just need to remove the missing values
biopsy <- na.omit(biopsy)
Of course, you can also use interpolation to deal with missing values
See this article on handling missing values
二.Assignment of dummy variables
考虑将class的两个类别malignant,benign分别赋值为1,0
通过ifelse函数来实现
y <- ifelse(biopsy$class=="malignant",1,0);y
三.箱线图
library(tidyverse)
library(reshape2)
biopsy1 <- melt(biopsy,id.var="class");biopsy1
#箱线图
ggplot(biopsy1,aes(class,value))+geom_boxplot(aes(color=variable))+
facet_wrap(~variable,ncol=3)
四.相关性分析
library(corrplot)
cor(biopsy[,1:9])%>% corrplot.mixed()
从相关系数可以看出,We will have a collinearity problem,特别是V2和V3The correlation coefficient between is as high as 0.91,showed obvious collinearity
训练集与测试集的划分
There are various ways to properly partition the data into training and test sets:50/50,60/40,70/30,80/20,诸如此类.You should choose your own experience and judgment to choose how to divide the data.在本例中,I choose to follow70/30proportion of the data.如下所示
set.seed(123)
ind <- sample(2,nrow(biopsy1),replace = T,prob=c(0.7,0.3))
ind
train <- biopsy[ind==1,];train
test <- biopsy[ind==2,];test
To ensure that the outcome variables of the two datasets are balanced,We do the following checks
table(train$class)
table(test$class)
This internal ratio is acceptable
模型构建与评价
一.logistic回归模型
fit <- glm(class~.,family=binomial,data=train)
summary(fit)
#Check for multicollinearity
library(car)
vif(fit)
由于vif值都小于5,The effect of multicollinearity can be excluded
二.Check how the model performs on the training and test datasets
The first thing to do is to build a vector to represent the predicted probability:
train.probs <- predict(fit,type="response")
The next step is to evaluate the performance of the model on the training set,Then evaluate how well it fits on the test set.
这时我们需要0和1来表示,The default value used by the function to differentiate between benign and malignant results is 0.5,也就是说,当概率大于0.5时,The result is considered to be malignant
train.probs <- predict(fit,type="response")
train.probs <- ifelse(train.probs >=0.5,1,0)
trainy <- y[ind==1]
testy <- y[ind==2]
#install.packages("caret")
library(caret)
confusionMatrix(table(trainy, train.probs))
On the diagonal is our prediction error value:
error=(7+8)/474;error
It can be seen that the prediction error rate is 0.03
Next, the performance of the data on the test set:
testy <- y[ind==2]
length(testy)
test.probs <- predict(fit,newdata=test,type="response") #注意这里是newdata
test.probs <- ifelse(test.probs>=0.5,1,0)
length(test.probs)
confusionMatrix(table(testy, test.probs))
error1 <- (2+3)/209;error1
It can be seen that the prediction error rate is only 0.02
使用交叉验证的logistic回归
交叉验证
For the raw data we will divide a part of ittrain data,part is divided intotest data.train data用于训练,test datafor testing accuracy.在test dataThe result of the above test is calledvalidation error.Apply an algorithm to a raw data,We can't just make a random partition oncetrain和test data,然后得到一个validation error,Just as a standard to measure the quality of this algorithm.Because there is chance.We have to do random partitions many timestrain data和test data,Calculate their respective on itvalidation error.So there is a groupvalidation error,根据这一组validation error,It can better and accurately measure the quality of the algorithm.
Logistic 回归详解 交叉验证概念
我们将用bestglm包进行交叉验证
install.packages("bestglm")
library(bestglm)
X <- train[,1:9]
XY <- data.frame(cbind(X,trainy))
bestglm(Xy=XY,IC="BIC",family=binomial)
在上面的代码中.Xy=XYRefers to our already formatted data frame,IC="BIC"Tell the program that the information criterion we use is cross-validation
接下来看看BICThe prediction effect of the optimal subset algorithm
bicfit <- glm(class~V1+V4+V6+V8,family=binomial,data=train)
test.bic.probs <- predict(bicfit,family=binomial,newdata=test,
type="response")
test.bic.probs <- ifelse(test.bic.probs>=005,1,0)
confusionMatrix(table(testy,test.bic.probs))
You can see the correct rateACCURACY为0.97
边栏推荐
- LeetCode简单题之统计星号
- 看到这个应用上下线方式,不禁感叹:优雅,太优雅了!
- 什么样的程序员在35岁依然被公司抢着要?打破程序员“中年危机”
- 干货满满,中科院信工所于静新课帮你get学术研究与论文写作技能
- 全网最全的PADS 9.5安装教程与资源包
- Thesis understanding: "Self-adaptive loss balanced Physics-informed neural networks"
- Qt的简易日志库实现及封装
- window停掉指定端口的进程
- 彻底了解什么是POE交换机!!!
- Code Casual Recording Notes_Dynamic Programming_322 Change Exchange
猜你喜欢
随机推荐
idea 好工具
暗恋云匹配匿名交友聊天系统开发
小白大白读论文-关于EfficientNetV2论文的 疑问 与 总结
【LeetCode】761. 特殊的二进制序列
【Rust—LeetCode题解】1.两数之和
华为云弹性云服务器ECS使用【华为云至简致远】
IT故障快速解决就用行云管家!快速安全!
Thesis understanding: "Self-adaptive loss balanced Physics-informed neural networks"
textarea disable drag and drop
R语言基于指定规则、条件删除列表中的元素:使用purrr包的discard函数移除列表数据中的NA值
基于Nodejs的医生预约平台的设计和实现
译文推荐|深入解析 BookKeeper 协议模型与验证
南非 KMP 媒体集团实施了 DMS(文档管理系统)使流程数字化,员工可以再次专注于他们的实际任务,提供了效率
R语言数据类型转换:基本数据类型的转换、将一种数据类型转化为另外一种数据类型
R语言ggpubr包的ggsummarystats函数可视化分面箱图(通过ggfunc参数和facet.by参数设置)、添加描述性统计结果表格、palette参数配置不同水平可视化图像和统计值的颜色
LeetCode 每日一题 2022/8/1-2022/8/7
OpenInfra Days China 2022 |StreamNative 翟佳、刘德志受邀分享
哈佛大学砸场子:DALL-E 2只是「粘合怪」,生成正确率只有22%
QtWebassembly遇到的一些报错问题及解决方案
R语言ggplot2可视化:使用ggpubr包的ggdonutchart函数可视化甜甜圈图(donut chart)、为甜甜圈图添加自定义标签(包含文本内容以及数值百分比)、lab.font参数设置标