The Titanic!

The sinking of the Titanic, although taking place over 100 years ago, seems to keep popping up in many different contexts. The most significant of course is the nice data set regarding descriptions of passengers and whether or not they survived. This type of data set lends itself nicely to supervised machine learning classification models. Here we will use R to try to build a model to predict passenger survival based on their attributes.

For this modelling we will use the titanic data set in the ‘titanic’ package.

(Side note - you can find all of this code on my Github: https://github.com/JTDean123)

Data Cleanup

First we load the data set and get a big picture overview of it.

#install.packages("titanic")
library(titanic)
data(titanic_train)
kable(head(titanic_train, 5), format="markdown", align = 'c')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S

It appears there are some features that will not be useful for this problem. Using my extensive domain knowledge of oceanic catastrophes I imagine we should remove ‘name’ and ‘PassengerID’ and ‘ticket’ as these are unlikely to provide and predictive power. Although making ‘name’ a factor based on origin (European, Asian, etc) would be an interesting thing to try.

tt.reduced <- titanic_train[,-c(1,4,9)]
kable(head(tt.reduced, 5), format="markdown", align = 'c')
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 male 22 1 0 7.2500 S
1 1 female 38 1 0 71.2833 C85 C
1 3 female 26 0 0 7.9250 S
1 1 female 35 1 0 53.1000 C123 S
0 3 male 35 0 0 8.0500 S

Now that we have the data loaded and a suitable (at least for now) set of features chosen we can do some data cleanup. First we can look for the always worrisome ‘NA’ entries:

sapply(tt.reduced, function(x){sum(is.na(x))})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare    Cabin 
##        0        0        0      177        0        0        0        0 
## Embarked 
##        0

Yikes, we are missing a lot of age data. More quantitatively, the fraction that we are missing is:

sum(is.na(tt.reduced$Age))/nrow(tt.reduced)
## [1] 0.1986532

We will address this below. We next look for features that contain missing data that is blank, as opposed to ‘NA’.

sapply(tt.reduced, function(x){sum(x=='')})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare    Cabin 
##        0        0        0       NA        0        0        0      687 
## Embarked 
##        2

The feature ‘Cabin’ is over 3/4 blank, and two values are missing in the ‘Embarked’ data set. Removing all of the ‘Cabin’ observations that are missing an entry would drastically reduce the size of our data set, therefore i propose that we remove ‘Cabin’ as a feature.

tt.reduced <- tt.reduced[,-8]

Furthermore, we will remove observations (rows) that are missing data for ‘Embarked’

tt.reduced <- tt.reduced[tt.reduced$Embarked != '',]

Now, back to the missing ‘age’ data and what to do about this. It seems we have three options:

  1. create a model to predict missing ‘age’ and use these values.

  2. delete the observations that contain a ‘NA’ ‘age’ entry

  3. remove age as a predictor

All of these seem to have their drawbacks, but going about route 1) seems worrisome since we will introducing an extra layer of error. Also, route 3) is subobtimal as I would expect ‘age’ to be highly correlated with survival and it is probably important to include in the model. So I choose to remove missing age data, option 2. I justify this because it is less than 20% of the observations so bias should be limited. Furthermore, the predict() function excludes observations that are missing data, so it is a good idea to ensure that the training and test data do not contain missing data.

We can now remove observations that contain ‘NA’ entries for ‘age’.

tt.reduced <- tt.reduced[is.na(tt.reduced$Age) == FALSE,]
sapply(tt.reduced, function(x){sum(is.na(x))})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0
nrow(tt.reduced)
## [1] 712
kable(head(tt.reduced, 5), format="markdown", align = 'c')
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.2500 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.9250 S
1 1 female 35 1 0 53.1000 S
0 3 male 35 0 0 8.0500 S

Looking good. We have 712 observations and we have handled missing data. Since we will building a classification model we need to format the appropriate features as factors. The features that make sense to format as factors are ‘Survived’, ‘Pclass’, ‘Sex’, and ‘Embarked’.

tt.reduced$Survived <- as.factor(tt.reduced$Survived)
tt.reduced$Pclass <- as.factor(tt.reduced$Pclass)
tt.reduced$Sex <- as.factor(tt.reduced$Sex)
tt.reduced$Embarked <- as.factor(tt.reduced$Embarked)

We can verify that this worked before moving forward:

str(tt.reduced)
## 'data.frame':    712 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp   : int  1 1 0 1 0 0 3 0 1 1 ...
##  $ Parch   : int  0 0 0 0 0 0 1 2 0 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...

Importantly, we see that ‘Sex’ and ‘Survived’ are now factors with two levels. This makes sense because passengers can either survive or perish or be male or female. The final thing for us to do before diving into model building is split the data into training and test sets. We will train our model with the training set and evaluate performance on the test set. For this project we will split 70% of the data into a training set and 30% into a test set.

splitMe <- sample(2, nrow(tt.reduced), replace = TRUE, prob = c(0.7,0.3))
tt.reduced.train <- tt.reduced[splitMe==1,]
tt.reduced.test <- tt.reduced[splitMe==2,]

Data Exploration

Now that we have dealt with missing data and split the data into training and test sets we can start to explore the data. We first examine survival (1) and death (2) for the entire population:

summary(tt.reduced.train$Survived)
##   0   1 
## 288 185

As shown above, it appears about 60% of the population perished. We can next look at distribution of sex on the Titanic:

summary(tt.reduced.train$Sex)
## female   male 
##    176    297

The ship contained about twice as many males as females. A more interesting question is if survival depended on the sex of the passenger.

table(tt.reduced.train$Survived, tt.reduced.train$Sex)
##    
##     female male
##   0     53  235
##   1    123   62

Bad news for the men of the Titanic - they were much more likely to perish than females! We next inspect the ‘Age’ feature.

library(dplyr)
tt.reduced.train %>% group_by(Sex) %>% summarise(avgAge = mean(Age), stdev = sd(Age))
## # A tibble: 2 × 3
##      Sex   avgAge    stdev
##   <fctr>    <dbl>    <dbl>
## 1 female 27.94034 14.13971
## 2   male 29.98064 14.37065

Males and females aboard the Titanic were approximately the same average age and the standard deviations were about half the mean. We can further explore this feature with a plot.

library(ggplot2)
ggplot(data=tt.reduced.train, aes(x=Age, fill=Sex)) + geom_density(alpha=0.3)

As shown above, the distributions of male and female age have a mean of 31 and 29 years old, respectively. Interestingly, the female group has a greater fraction of members between zero and 20 years old compared to the male group. Also, the male group has a greater fraction of members over the age of 60 than the female group. We next move to the ‘Fare’ feature. One would think that passengers that paid more money would get better access to life boats, right? Or maybe passengers that paid less tended to be in better physical shape and this increased their chances of survival? We start with summary statistics.

summary(tt.reduced.train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.05   16.00   35.61   33.00  512.30

We find that the mean fare is 34 and the median is 15. Since the mean is double the median the data must be right skewed. This is confirmed by plotting the data on a histogram.

ggplot(data=tt.reduced.train, aes(x=Fare)) + geom_histogram(binwidth = 5, col = "black", aes(y=..density..)) + theme_bw()

As shown above, there is quite a bit of right skew in the Fare. I would be interested to know what a 500+ fare gets you on the Titanic. But back to our original question - is Fare related to survival?

ggplot(tt.reduced.train, aes(x=Fare, y=Survived)) + geom_point() + theme_bw()

It is difficult to interpret this plot with the outliers, so we will focus on the Fare that contains the majority of the data.

ggplot(tt.reduced.train, aes(x=Fare, y=Survived)) + geom_point() + theme_bw() + xlim(0,65)

It is difficult to say if there is a relationship between Fare and Survived. We could next build a logistic regression model…. but focus! Now that we have a general idea of what the data looks like we can start model building.

Decision Trees

We will next build a decision tree classification model to predict survival from the 8 features that we have chosen. I like decision trees for supervised machine learning with this type of data set because they are not sensitive to the scale of features and they can handle features that are both quantitative and qualitative. We start with hyperparameter tuning to determine the optimal tree structure.

library(rpart)
library(rattle)
library(rpart.plot)

tree.survival = rpart(Survived~., data=tt.reduced.train)
print(tree.survival$cptable)
##           CP nsplit rel error    xerror       xstd
## 1 0.37837838      0 1.0000000 1.0000000 0.05736933
## 2 0.12432432      1 0.6216216 0.6216216 0.05042990
## 3 0.03783784      2 0.4972973 0.4972973 0.04653226
## 4 0.01441441      4 0.4216216 0.4648649 0.04534218
## 5 0.01351351      7 0.3783784 0.5243243 0.04746544
## 6 0.01081081      9 0.3513514 0.5243243 0.04746544
## 7 0.01000000     10 0.3405405 0.5027027 0.04672303

We will choose model parameters corresponding to the lowest xerror, the cross validation error. We can visualize our tree as follows.

fancyRpartPlot(tree.survival)

What do we see? The first feature for splitting the data is the sex. Also, if you are a male, your Fare was < 52, and older than 13 years old you have a 85% chance of dying. That would be me! Thanks for ruining boat rides for me, decision trees. We can next prune this tree to prevent overfitting. We will do this using the Cp that generated the lowest xerror.

cp <- min(tree.survival$cptable[,1])
pruned.tree.survival <- prune(tree.survival, cp=cp)

Now we can use our pruned decision tree to make a prediction on the test data set.

tree.survival.predict <- predict(pruned.tree.survival, tt.reduced.test, type="class")

The results of the prediction can be visualized in a confusion matrix:

library(caret)
confusionMatrix(tree.survival.predict, tt.reduced.test$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 127  42
##          1   9  61
##                                           
##                Accuracy : 0.7866          
##                  95% CI : (0.7292, 0.8368)
##     No Information Rate : 0.569           
##     P-Value [Acc > NIR] : 1.315e-12       
##                                           
##                   Kappa : 0.5473          
##  Mcnemar's Test P-Value : 7.433e-06       
##                                           
##             Sensitivity : 0.9338          
##             Specificity : 0.5922          
##          Pos Pred Value : 0.7515          
##          Neg Pred Value : 0.8714          
##              Prevalence : 0.5690          
##          Detection Rate : 0.5314          
##    Detection Prevalence : 0.7071          
##       Balanced Accuracy : 0.7630          
##                                           
##        'Positive' Class : 0               
## 

78.6% - Not bad! However, i am interested to know if a random forest model will improve predictive accuracy. I am suspicious that some of the variables induced in the model may not be necessary and lead to overfitting, and random forests can possibly address this issue since each tree in the forest does not use the entire set of features. We begin with hyperparameter tuning by first calculating the optimal number of trees.

library(randomForest)
rf.tree.survival <- randomForest(Survived~., data=tt.reduced.train)
plot(rf.tree.survival)

ntrees <- which.min(rf.tree.survival$err.rate[,1])

It looks like the optimal number of trees is around 30. We next built a random forest with this model parameter.

rf.tree.survival <- randomForest(Survived~., data=tt.reduced.train, ntree=ntrees)
rf.trees.predict <- predict(rf.tree.survival, tt.reduced.test, type="class")
confusionMatrix(rf.trees.predict, tt.reduced.test$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 127  36
##          1   9  67
##                                           
##                Accuracy : 0.8117          
##                  95% CI : (0.7563, 0.8592)
##     No Information Rate : 0.569           
##     P-Value [Acc > NIR] : 1.7e-15         
##                                           
##                   Kappa : 0.6035          
##  Mcnemar's Test P-Value : 0.0001063       
##                                           
##             Sensitivity : 0.9338          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.7791          
##          Neg Pred Value : 0.8816          
##              Prevalence : 0.5690          
##          Detection Rate : 0.5314          
##    Detection Prevalence : 0.6820          
##       Balanced Accuracy : 0.7922          
##                                           
##        'Positive' Class : 0               
## 

Now our predictive accuracy is over 81% - We were able to improve our predictive accuracy over a single decision tree by using a random forest model.

Model Performance

Before we release our model to the cruise ship companies of the world we need to evaluate it’s performance. A good way to do this is with a receiver operating characteristic curve. This curve is a plot of true positive vs false positive rate and allows us to visualize the trade off between sensitivity and specificity. An ideal model will show a maxium at (1,0), indicating perfect prediction of true positives and an absence of false positives. The area under such a curve will be equal to one. To generate a ROC curve for our random forest model we use the code below.

require(ROCR)
tree.survival.predict2 <- predict(rf.tree.survival, tt.reduced.test, type="prob")
predROC <- prediction(tree.survival.predict2[,2], tt.reduced.test$Survived)
perfROC <- performance(predROC, "tpr", "fpr")
plot(perfROC)
abline(a=0, b=1)

And we calculate the area under the curve as follows.

perfROC <- performance(predROC, "auc")
perfROC@y.values[[1]]
## [1] 0.871502

0.87 AUC! I’ll take it.

Conclusions

In conclusion we were able to use decision trees and random forests to build predictive models for survial aboard the Titanic with >80% accuracy. The random forest model was found to have an AUC of 0.87, and if you have made it this far - Thanks!