因此八个kaggle实例学习化解机器学习难题

<h4 >6. Prediction</h4>

接力验证

貌似景色下,应该将锻炼多少分为两有的,大器晚成都部队分用于锻练,另后生可畏局部用来注脚。或许选取k-fold交叉验证。本文将持有演练多少都用于演练,然后轻松选拔百分之三十三数目集用于验证。

cv.summarize <- function(data.true, data.predict) {
  print(paste('Recall:', Recall(data.true, data.predict)))
  print(paste('Precision:', Precision(data.true, data.predict)))
  print(paste('Accuracy:', Accuracy(data.predict, data.true)))
  print(paste('AUC:', AUC(data.predict, data.true)))
}
set.seed(415)
cv.test.sample <- sample(1:nrow(train), as.integer(0.3 * nrow(train)), replace = TRUE)
cv.test <- data[cv.test.sample,]
cv.prediction <- predict(model, cv.test, OOB=TRUE, type = "response")
cv.summarize(cv.test$Survived, cv.prediction)

## [1] "Recall: 0.947976878612717"
## [1] "Precision: 0.841025641025641"
## [1] "Accuracy: 0.850187265917603"
## [1] "AUC: 0.809094822285082"

图片 1

共票号游客幸存率高

对于Ticket变量,重复度超低,不可能直接运用。先总结出每张票对应的旅客数。

ticket.count <- aggregate(data$Ticket, by = list(data$Ticket), function(x) sum(!is.na(x)))

此处有个测度,票号相仿的司乘人士,是一亲戚,极大概同期并存只怕同有的时候间遇难。现将全数游客根据Ticket分为两组,生机勃勃组是运用单独票号,另大器晚成组是与客人分享票号,并总括出各组的现成与遇难人数。

data$TicketCount <- apply(data, 1, function(x) ticket.count[which(ticket.count[, 1] == x['Ticket']), 2])
data$TicketCount <- factor(sapply(data$TicketCount, function(x) ifelse(x > 1, 'Share', 'Unique')))
ggplot(data = data[1:nrow(train),], mapping = aes(x = TicketCount, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('TicketCount') + 
  ylab('Count') + 
  ggtitle('How TicketCount impact survivor') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 2

由上海体育地方可以看到,未与旁人同票号的游客,独有130/(130+351State of Qatar=27%存世,而与客人同票号的旅客有212/(212+198卡塔尔(قطر‎=51.7%存活。总结TicketCount的WOE与IV如下。其IV为0.2751882,且"Highly Predictive"

WOETable(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])

##      CAT GOODS BADS TOTAL    PCT_G     PCT_B        WOE        IV
## 1  Share   212  198   410 0.619883 0.3606557  0.5416069 0.1403993
## 2 Unique   130  351   481 0.380117 0.6393443 -0.5199641 0.1347889

IV(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])

## [1] 0.2751882
## attr(,"howgood")
## [1] "Highly Predictive"
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

调优

train.tail()test.head()train.describe()

加载数据

在加载数据以前,先通过如下代码加载之后会用到的有着奔驰G级库

library(readr) # File read / write
library(ggplot2) # Data visualization
library(ggthemes) # Data visualization
library(scales) # Data visualization
library(plyr)
library(stringr) # String manipulation
library(InformationValue) # IV / WOE calculation
library(MLmetrics) # Mache learning metrics.e.g. Recall, Precision, Accuracy, AUC
library(rpart) # Decision tree utils
library(randomForest) # Random Forest
library(dplyr) # Data manipulation
library(e1071) # SVM
library(Amelia) # Missing value utils
library(party) # Conditional inference trees
library(gbm) # AdaBoost
library(class) # KNN
library(scales)

通过如下代码将演练多少和测量检验数据分别加载到名字为train和test的data.frame中

train <- read_csv("train.csv")
test <- read_csv("test.csv")

出于之后需求对教练多少和测量试验数据做同样的更动,为制止重新操作和产出差异至的景况,更为了幸免只怕遇见的Categorical类型新level的难题,这里提出将演练多少和测验数据统一,统大器晚成操作。

data <- bind_rows(train, test)
train.row <- 1:nrow(train)
test.row <- (1 + nrow(train)):(nrow(train) + nrow(test))
KNN, 0.816143497758Random Forest, 0.829596412556只选择比较重要的几个特征后的 Random Forest, 0.834080717489Logistic Regression, 0.811659192825SVC, 0.838565022422XGBoost, 0.820627802691

去掉关联特征

是因为FamilySize结合了SibSp与Parch的消息,因而能够品尝将SibSp与Parch从特征变量中移除。

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit2.csv", row.names = FALSE)

该模型预测结果在Kaggle的得分仍然为0.80383。

在化解机器学习难点时,日常包罗以下流程:

下篇预报

下一篇小说将尊重疏解使用机器学习解决工程难题的经常思路和方法。

在此篇小说中得以学到三个完好的使用机械学习消除解析难题的历程,它回顾了消释难点的貌似流程,描述性总括的常用方法,数据冲洗的常用方法,如何由给定的普通变量启示式考虑别的影响因素,sklearn 建模的貌似流程,以至极流行的 ensemble learning 怎么用

中位数增补缺点和失误的Embarked值

从如下数据可以见到,缺点和失误Embarked消息的司乘人士的Pclass均为1,且Fare均为80。

data[is.na(data$Embarked), c('PassengerId', 'Pclass', 'Fare', 'Embarked')]

## # A tibble: 2 × 4
##   PassengerId Pclass  Fare Embarked
##         <int>  <int> <dbl>    <chr>
## 1          62      1    80     <NA>
## 2         830      1    80     <NA>

由下图所见,Embarked为C且Pclass为1的游客的Fare中位数为80。

ggplot(data[!is.na(data$Embarked),], aes(x=Embarked, y=Fare, fill=factor(Pclass))) +
  geom_boxplot() +
  geom_hline(aes(yintercept=80), color='red', linetype='dashed', lwd=2) +
  scale_y_continuous(labels=dollar_format()) + theme_few()

图片 3

因而得以将缺点和失误的Embarked值设置为'C'。

data$Embarked[is.na(data$Embarked)] <- 'C'
data$Embarked <- as.factor(data$Embarked)

数码预览

先观察数据

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

从上可见,数据集带有10个变量,1309条数据,此中891条为教练多少,418条为测量试验数据

下边步向正题:

补偿缺点和失误值

依次变量在测验集和锻练集的分布大约风华正茂致。

将缺点和失误的Cabin设置为暗中认可值

缺点和失误Cabin音讯的记录数非常多,不合乎利用中位数或然平均值添补,日常经过采纳任何变量预测照旧直接将缺点和失误值设置为默许值的章程增加补充。由于Cabin音信不太轻松从其余变量预测,况且在上生机勃勃节中,将NA单独看待时,其IV已经比较高。因而这里直接将缺点和失误的Cabin设置为三个暗中同意值。

data$Cabin <- as.factor(sapply(data$Cabin, function(x) ifelse(is.na(x), 'X', str_sub(x, start = 1, end = 1))))

<h4 >5. Ensemble</h4>

家长与子女数为1到3的游客更大概幸存

对此Parch变量,分别计算出幸存与受害人数。

ggplot(data = data[1:nrow(train),], mapping = aes(x = Parch, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  labs(title = "How Parch impact survivor", x = "Parch", y = "Count", fill = "Survived") + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 4

从上图可以预知,Parch为0的乘客,幸存率低于55%;Parch为1到3的司乘人士,幸存率高于百分之三十;Parch大于等于4的游客,幸存率非常的低。可由此估测计算WOE与IV定量测算Parch对预测的贡献。IV为0.1166611,且"Highly Predictive"。

WOETable(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   233  445   678 0.671469741 0.810564663 -0.1882622 0.026186312
## 2   1    65   53   118 0.187319885 0.096539162  0.6628690 0.060175728
## 3   2    40   40    80 0.115273775 0.072859745  0.4587737 0.019458440
## 4   3     3    2     5 0.008645533 0.003642987  0.8642388 0.004323394
## 5   4     4    4     4 0.011527378 0.007285974  0.4587737 0.001945844
## 6   5     1    4     5 0.002881844 0.007285974 -0.9275207 0.004084922
## 7   6     1    1     1 0.002881844 0.001821494  0.4587737 0.000486461

IV(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

## [1] 0.1166611
## attr(,"howgood")
## [1] "Highly Predictive"

而后,能够用上面包车型客车指令先阅览一下数据表的组织:

原创作品,同步首发自作者个人博客 。转发请必得在文章伊始显眼处注解出处

我是 不会停的蜗牛 阿丽丝85后全职主妇钟爱人工智能,行动派创新手艺,构思力,学习力提高修炼举行中迎接您的爱好,关切和批评!

预测

predict.result <- predict(model, data[(1+nrow(train)):(nrow(data)), ], OOB=TRUE, type = "response")
output <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(output, file = "cit1.csv", row.names = FALSE)

该模型预测结果在Kaggle的得分为0.80383,排第992名,前992/6292=15.8%。

<h4 >1. Data Exploration</h4>

其它

经考试,将缺点和失误的Embarked补充为现身最多的S而非C,成绩有着提高。但该情势理论依附不强,况且该战绩只是Public排名的榜单成绩,并不是最后成就,并不可能证实该办法料定优惠别的方式。因此本文并不引进该方式,只是作为生机勃勃种也许的思绪,供大家参谋学习。

data$Embarked[c(62,830)] = "S"
data$Embarked <- factor(data$Embarked)

set.seed(415)
model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit5.csv", row.names = FALSE)

该模型预测结果在Kaggle的得分仍然为0.82775,排第114名,前114/6292=1.8%
图片 5

FamilySize为2到4的旅客幸存大概性较高

SibSp与Parch都证实,当游客无亲朋好友时,幸存率非常的低,游客有少数老时辰,幸存率高于百分之五十,而当亲戚数过高时,幸存率反而减弱。在此,能够考虑将SibSp与Parch相加,生成新的变量,FamilySize。

data$FamilySize <- data$SibSp + data$Parch + 1
ggplot(data = data[1:nrow(train),], mapping = aes(x = FamilySize, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('FamilySize') + 
  ylab('Count') + 
  ggtitle('How FamilySize impact survivor') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 6

总结FamilySize的WOE和IV可以预知,IV为0.3497672,且“Highly Predictive”。由SibSp与Parch派生出来的新变量FamilySize的IV高于SibSp与Parch的IV,由此,可将那么些派生变量FamilySize作为特色变量。

WOETable(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

##   CAT GOODS BADS TOTAL       PCT_G      PCT_B        WOE           IV
## 1   1   163  374   537 0.459154930 0.68123862 -0.3945249 0.0876175539
## 2   2    89   72   161 0.250704225 0.13114754  0.6479509 0.0774668616
## 3   3    59   43   102 0.166197183 0.07832423  0.7523180 0.0661084057
## 4   4    21    8    29 0.059154930 0.01457195  1.4010615 0.0624634998
## 5   5     3   12    15 0.008450704 0.02185792 -0.9503137 0.0127410643
## 6   6     3   19    22 0.008450704 0.03460838 -1.4098460 0.0368782940
## 7   7     4    8    12 0.011267606 0.01457195 -0.2571665 0.0008497665
## 8   8     6    6     6 0.016901408 0.01092896  0.4359807 0.0026038712
## 9  11     7    7     7 0.019718310 0.01275046  0.4359807 0.0030378497

IV(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

## [1] 0.3497672
## attr(,"howgood")
## [1] "Highly Predictive"

下一场看一下挨门逐户变量对分类标签的熏陶:

伴侣及兄弟姐妹数适中的旅客更易幸存

对于SibSp变量,分别总计出幸存与遇险人数。

ggplot(data = data[1:nrow(train),], mapping = aes(x = SibSp, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  labs(title = "How SibSp impact survivor", x = "Sibsp", y = "Count", fill = "Survived") + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 7

从上海体育地方可以知道,SibSp为0的旅客,幸存率低于1/2;SibSp为1或2的司乘职员,幸存率高于二分之一;SibSp大于等于3的旅客,幸存率相当低。可由此测算WOE与IV定量测算SibSp对预测的贡献。IV为0.1448994,且"Highly Predictive"。

WOETable(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   210  398   608 0.593220339 0.724954463 -0.2005429 0.026418349
## 2   1   112   97   209 0.316384181 0.176684882  0.5825894 0.081387334
## 3   2    13   15    28 0.036723164 0.027322404  0.2957007 0.002779811
## 4   3     4   12    16 0.011299435 0.021857923 -0.6598108 0.006966604
## 5   4     3   15    18 0.008474576 0.027322404 -1.1706364 0.022063953
## 6   5     5    5     5 0.014124294 0.009107468  0.4388015 0.002201391
## 7   8     7    7     7 0.019774011 0.012750455  0.4388015 0.003081947

IV(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

## [1] 0.1448994
## attr(,"howgood")
## [1] "Highly Predictive"

接下去,可以观测种种变量的分布意况:

Embarked为S的游客幸存率很低

Embarked变量代表登船码头,现通过计算差异码头登船的游客幸存率来判定Embarked是或不是可用以预测旅客幸存景况。

ggplot(data[1:nrow(train), ], mapping = aes(x = Embarked, y = ..count.., fill = Survived)) +
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Embarked') +
  ylab('Count') +
  ggtitle('How Embarked impact survivor') +
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 8

从上海教室可以预知,Embarked为S的旅客幸存率仅为217/(217+427State of Qatar=33.7%,而Embarked为C或为NA的游客幸存率均超越百分之七十。起始剖断Embarked可用来预测乘客是或不是幸存。Embarked的WOE和IV计算如下。

WOETable(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

##   CAT GOODS BADS TOTAL      PCT_G     PCT_B        WOE           IV
## 1   C    93   75   168 0.27352941 0.1366120  0.6942642 9.505684e-02
## 2   Q    30   47    77 0.08823529 0.0856102  0.0302026 7.928467e-05
## 3   S   217  427   644 0.63823529 0.7777778 -0.1977338 2.759227e-02

IV(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

## [1] 0.1227284
## attr(,"howgood")
## [1] "Highly Predictive"

从上述计算结果可以知道,IV为0.1227284,且“Highly Predictive”。

譬如说,性其余熏陶,通过可视化能够窥见,生还的司乘人士中女子多于男子.

总结

本文详述了何等通过数据预览,探求式数据深入分析,缺点和失误数据增补,删除关联特征以至派生新特色等方式,在Kaggle的Titanic幸存预测这一分类难题竞技前获得前2%排名的具体方法。

事情发生前有过生机勃勃篇 特征工程怎么做只是介绍了生龙活虎部分定义,那个事例就是相比具有启示性,看看怎么通过给定的多少个变量,去实行成更有影响力的 feature,如何结合实情联想新的成分,并转形成数字的花样表明出来。

支付船票费越高幸存率越高

对此Fare变量,由下图能够,Fare越大,幸存率越高。

ggplot(data = data[(!is.na(data$Fare)) & row(data[, 'Fare']) <= 891, ], aes(x = Fare, color=Survived)) + 
  geom_line(aes(label=..count..), stat = 'bin', binwidth=10)  + 
  labs(title = "How Fare impact survivor", x = "Fare", y = "Count", fill = "Survived")

图片 9

列出富有缺点和失误数据

attach(data)
  missing <- list(Pclass=nrow(data[is.na(Pclass), ]))
  missing$Name <- nrow(data[is.na(Name), ])
  missing$Sex <- nrow(data[is.na(Sex), ])
  missing$Age <- nrow(data[is.na(Age), ])
  missing$SibSp <- nrow(data[is.na(SibSp), ])
  missing$Parch <- nrow(data[is.na(Parch), ])
  missing$Ticket <- nrow(data[is.na(Ticket), ])
  missing$Fare <- nrow(data[is.na(Fare), ])
  missing$Cabin <- nrow(data[is.na(Cabin), ])
  missing$Embarked <- nrow(data[is.na(Embarked), ])
  for (name in names(missing)) {
    if (missing[[name]][1] > 0) {
      print(paste('', name, ' miss ', missing[[name]][1], ' values', sep = ''))
    }
  }
detach(data)

## [1] "Age miss 263 values"
## [1] "Fare miss 1 values"
## [1] "Cabin miss 1014 values"
## [1] "Embarked miss 2 values"

教练模型

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
title[title.isin(['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev'])] = 'Officer'deck = full[~full.Cabin.isnull()].Cabin.map( lambda x : re.compile("([a-zA-Z]+)").search.groupchecker = re.compile")full['Group_num'] = full.Parch + full.SibSp + 1

女子幸存率远超过男人

对于Sex变量,由Titanic号沉没的背景可见,逃生时依据“妇女与幼童先走”的法则,因此估量,Sex变量应该对预测游客幸存有帮带。

正如数据证实了那后生可畏估计,当先百分之三十六女人(233/(233+81State of Qatar=74.伍分叁)得以幸存,而男性中唯有相当小一些(109/(109+468卡塔尔=22.85%)幸存。

data$Sex <- as.factor(data$Sex)
ggplot(data = data[1:nrow(train),], mapping = aes(x = Sex, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Sex') + 
  ylab('Count') + 
  ggtitle('How Sex impact survivo') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 10

通过测算WOE和IV可见,Sex的IV为1.34且"Highly Predictive",可暂将Sex作为特色变量。

WOETable(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])

##      CAT GOODS BADS TOTAL     PCT_G    PCT_B        WOE        IV
## 1 female   233   81   314 0.6812865 0.147541  1.5298770 0.8165651
## 2   male   109  468   577 0.3187135 0.852459 -0.9838327 0.5251163

IV(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])

## [1] 1.341681
## attr(,"howgood")
## [1] "Highly Predictive"

上面是数额中的原始变量,看看由它们得以联想到什么样因素。

去掉IV较低的Cabin

出于Cabin的IV值相对相当低,因此得以思量将其从模型中移除。

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit3.csv", row.names = FALSE)

该模型预测结果在Kaggle的得分仍然为0.80383。

以此局地,能够总括一下相继变量的缺点和失误值情形:

昨今分裂仓位的司乘人士幸存率不一样

对于Cabin变量,其值以字母最初,后边伴以数字。这里有三个推断,字母代表有些区域,数据意味着该区域的序号。相符于火车票即有车箱号又有座位号。由此,这里可尝试将Cabin的首字母提抽出来,并分别总括出区别首字母仓位对应的旅客的幸存率。

ggplot(data[1:nrow(train), ], mapping = aes(x = as.factor(sapply(data$Cabin[1:nrow(train)], function(x) str_sub(x, start = 1, end = 1))), y = ..count.., fill = Survived)) +
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Cabin') +
  ylab('Count') +
  ggtitle('How Cabin impact survivor') +
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 11

由上海教室可知,仓位号首字母为B,C,D,E,F的司乘职员幸存率均大于一半,而其它仓位的司乘人员幸存率均远远小于四分之二。仓位变量的WOE及IV总括如下。综上可得,Cabin的IV为0.1866526,且“Highly Predictive”

data$Cabin <- sapply(data$Cabin, function(x) str_sub(x, start = 1, end = 1))
WOETable(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

##   CAT GOODS BADS TOTAL      PCT_G      PCT_B        WOE          IV
## 1   A     7    8    15 0.05109489 0.11764706 -0.8340046 0.055504815
## 2   B    35   12    47 0.25547445 0.17647059  0.3699682 0.029228917
## 3   C    35   24    59 0.25547445 0.35294118 -0.3231790 0.031499197
## 4   D    25    8    33 0.18248175 0.11764706  0.4389611 0.028459906
## 5   E    24    8    32 0.17518248 0.11764706  0.3981391 0.022907100
## 6   F     8    5    13 0.05839416 0.07352941 -0.2304696 0.003488215
## 7   G     2    2     4 0.01459854 0.02941176 -0.7004732 0.010376267
## 8   T     1    1     1 0.00729927 0.01470588 -0.7004732 0.005188134

IV(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])

## [1] 0.1866526
## attr(,"howgood")
## [1] "Highly Predictive"

在 Titanic: Machine Learning from Disaster 这么些主题材料中,要解决的是依据所提供的 age,sex 等成分的数量,判别哪些旅客更有相当大希望生存下来,所以那是三个分类难点。

什么收获排名前2%的成就

图片 12

分化Title的旅客幸存率不一样

司乘职员姓名重复度太低,不相符间接动用。而姓名中满含Mr. Mrs. Dr.等有着文化天性的音讯,可将之抽出出来。

本文使用如下情势从姓名中抽出行客的Title

data$Title <- sapply(data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
data$Title <- sub(' ', '', data$Title)
data$Title[data$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
data$Title[data$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
data$Title[data$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
data$Title <- factor(data$Title)

收取完游客的Title后,总计出差异Title的乘客的共处与遇难人数

ggplot(data = data[1:nrow(train),], mapping = aes(x = Title, y = ..count.., fill=Survived)) + 
  geom_bar(stat = "count", position='stack') + 
  xlab('Title') + 
  ylab('Count') + 
  ggtitle('How Title impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

图片 13

从上海体育场所可看出,Title为Mr的司乘职员幸存比例不行小,而Title为Mrs和Miss的游客幸存比例不行大。这里运用WOE和IV来定量测算Title那朝气蓬勃变量对于最终的猜想是还是不是有用。从总括结果可知,IV为1.520702,且"Highly Predictive"。因而,可暂将Title作为预测模型中的多少个风味变量。

WOETable(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])

##       CAT GOODS BADS TOTAL       PCT_G       PCT_B         WOE            IV
## 1     Col     1    1     2 0.002873563 0.001808318  0.46315552  4.933741e-04
## 2      Dr     3    4     7 0.008620690 0.007233273  0.17547345  2.434548e-04
## 3    Lady     2    1     3 0.005747126 0.001808318  1.15630270  4.554455e-03
## 4  Master    23   17    40 0.066091954 0.030741410  0.76543639  2.705859e-02
## 5    Miss   127   55   182 0.364942529 0.099457505  1.30000942  3.451330e-01
## 6    Mlle     3    3     3 0.008620690 0.005424955  0.46315552  1.480122e-03
## 7      Mr    81  436   517 0.232758621 0.788426763 -1.22003757  6.779360e-01
## 8     Mrs    99   26   125 0.284482759 0.047016275  1.80017883  4.274821e-01
## 9      Ms     1    1     1 0.002873563 0.001808318  0.46315552  4.933741e-04
## 10    Rev     6    6     6 0.017241379 0.010849910  0.46315552  2.960244e-03
## 11    Sir     2    3     5 0.005747126 0.005424955  0.05769041  1.858622e-05

IV(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])

## [1] 1.487853
## attr(,"howgood")
## [1] "Highly Predictive"

率先正是把多少分为练习集和测验集,用到 train_test_split

竞赛内容介绍

Titanic幸存预测是Kaggle上参Gaby赛人数最多的竞技之风华正茂。它需要参Gaby赛选手通过练习多少集深入分析出哪些品种的人更或者幸存,并预测出测量检验数据集中的享有旅客是不是生还。

该类型是一个二元分类难点