The Titanic!

The sinking of the Titanic, although taking place over 100 years ago, seems to keep popping up in many different contexts. The most significant of course is the nice data set regarding descriptions of passengers and whether or not they survived. This type of data set lends itself nicely to supervised machine learning classification models. Here we will use R to try to build a model to predict passenger survival based on their attributes.

For this modelling we will use the titanic data set in the ‘titanic’ package.

(Side note - you can find all of this code on my Github: https://github.com/JTDean123)

Data Cleanup

First we load the data set and get a big picture overview of it.

#install.packages("titanic")
library(titanic)
data(titanic_train)
kable(head(titanic_train, 5), format="markdown", align = 'c')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S

It appears there are some features that will not be useful for this problem. Using my extensive domain knowledge of oceanic catastrophes I imagine we should remove ‘name’ and ‘PassengerID’ and ‘ticket’ as these are unlikely to provide and predictive power. Although making ‘name’ a factor based on origin (European, Asian, etc) would be an interesting thing to try.

tt.reduced <- titanic_train[,-c(1,4,9)]
kable(head(tt.reduced, 5), format="markdown", align = 'c')
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 male 22 1 0 7.2500 S
1 1 female 38 1 0 71.2833 C85 C
1 3 female 26 0 0 7.9250 S
1 1 female 35 1 0 53.1000 C123 S
0 3 male 35 0 0 8.0500 S

Now that we have the data loaded and a suitable (at least for now) set of features chosen we can do some data cleanup. First we can look for the always worrisome ‘NA’ entries:

sapply(tt.reduced, function(x){sum(is.na(x))})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare    Cabin 
##        0        0        0      177        0        0        0        0 
## Embarked 
##        0

Yikes, we are missing a lot of age data. More quantitatively, the fraction that we are missing is:

sum(is.na(tt.reduced$Age))/nrow(tt.reduced)
## [1] 0.1986532

We will address this below. We next look for features that contain missing data that is blank, as opposed to ‘NA’.

sapply(tt.reduced, function(x){sum(x=='')})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare    Cabin 
##        0        0        0       NA        0        0        0      687 
## Embarked 
##        2

The feature ‘Cabin’ is over 3/4 blank, and two values are missing in the ‘Embarked’ data set. Removing all of the ‘Cabin’ observations that are missing an entry would drastically reduce the size of our data set, therefore i propose that we remove ‘Cabin’ as a feature.

tt.reduced <- tt.reduced[,-8]

Furthermore, we will remove observations (rows) that are missing data for ‘Embarked’

tt.reduced <- tt.reduced[tt.reduced$Embarked != '',]

Now, back to the missing ‘age’ data and what to do about this. It seems we have three options:

  1. create a model to predict missing ‘age’ and use these values.

  2. delete the observations that contain a ‘NA’ ‘age’ entry

  3. remove age as a predictor

All of these seem to have their drawbacks, but going about route 1) seems worrisome since we will introducing an extra layer of error. Also, route 3) is subobtimal as I would expect ‘age’ to be highly correlated with survival and it is probably important to include in the model. So I choose to remove missing age data, option 2. I justify this because it is less than 20% of the observations so bias should be limited. Furthermore, the predict() function excludes observations that are missing data, so it is a good idea to ensure that the training and test data do not contain missing data.

We can now remove observations that contain ‘NA’ entries for ‘age’.

tt.reduced <- tt.reduced[is.na(tt.reduced$Age) == FALSE,]
sapply(tt.reduced, function(x){sum(is.na(x))})
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0
nrow(tt.reduced)
## [1] 712
kable(head(tt.reduced, 5), format="markdown", align = 'c')
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.2500 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.9250 S
1 1 female 35 1 0 53.1000 S
0 3 male 35 0 0 8.0500 S

Looking good. We have 712 observations and we have handled missing data. Since we will building a classification model we need to format the appropriate features as factors. The features that make sense to format as factors are ‘Survived’, ‘Pclass’, ‘Sex’, and ‘Embarked’.

tt.reduced$Survived <- as.factor(tt.reduced$Survived)
tt.reduced$Pclass <- as.factor(tt.reduced$Pclass)
tt.reduced$Sex <- as.factor(tt.reduced$Sex)
tt.reduced$Embarked <- as.factor(tt.reduced$Embarked)

We can verify that this worked before moving forward:

str(tt.reduced)
## 'data.frame':    712 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp   : int  1 1 0 1 0 0 3 0 1 1 ...
##  $ Parch   : int  0 0 0 0 0 0 1 2 0 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...

Importantly, we see that ‘Sex’ and ‘Survived’ are now factors with two levels. This makes sense because passengers can either survive or perish or be male or female. The final thing for us to do before diving into model building is split the data into training and test sets. We will train our model with the training set and evaluate performance on the test set. For this project we will split 70% of the data into a training set and 30% into a test set.

splitMe <- sample(2, nrow(tt.reduced), replace = TRUE, prob = c(0.7,0.3))
tt.reduced.train <- tt.reduced[splitMe==1,]
tt.reduced.test <- tt.reduced[splitMe==2,]

Data Exploration

Now that we have dealt with missing data and split the data into training and test sets we can start to explore the data. We first examine survival (1) and death (2) for the entire population:

summary(tt.reduced.train$Survived)
##   0   1 
## 288 185

As shown above, it appears about 60% of the population perished. We can next look at distribution of sex on the Titanic:

summary(tt.reduced.train$Sex)
## female   male 
##    176    297

The ship contained about twice as many males as females. A more interesting question is if survival depended on the sex of the passenger.

table(tt.reduced.train$Survived, tt.reduced.train$Sex)
##    
##     female male
##   0     53  235
##   1    123   62

Bad news for the men of the Titanic - they were much more likely to perish than females! We next inspect the ‘Age’ feature.

library(dplyr)
tt.reduced.train %>% group_by(Sex) %>% summarise(avgAge = mean(Age), stdev = sd(Age))
## # A tibble: 2 × 3
##      Sex   avgAge    stdev
##   <fctr>    <dbl>    <dbl>
## 1 female 27.94034 14.13971
## 2   male 29.98064 14.37065

Males and females aboard the Titanic were approximately the same average age and the standard deviations were about half the mean. We can further explore this feature with a plot.

library(ggplot2)
ggplot(data=tt.reduced.train, aes(x=Age, fill=Sex)) + geom_density(alpha=0.3)

As shown above, the distributions of male and female age have a mean of 31 and 29 years old, respectively. Interestingly, the female group has a greater fraction of members between zero and 20 years old compared to the male group. Also, the male group has a greater fraction of members over the age of 60 than the female group. We next move to the ‘Fare’ feature. One would think that passengers that paid more money would get better access to life boats, right? Or maybe passengers that paid less tended to be in better physical shape and this increased their chances of survival? We start with summary statistics.

summary(tt.reduced.train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.05   16.00   35.61   33.00  512.30

We find that the mean fare is 34 and the median is 15. Since the mean is double the median the data must be right skewed. This is confirmed by plotting the data on a histogram.

ggplot(data=tt.reduced.train, aes(x=Fare)) + geom_histogram(binwidth = 5, col = "black", aes(y=..density..)) + theme_bw()

As shown above, there is quite a bit of right skew in the Fare. I would be interested to know what a 500+ fare gets you on the Titanic. But back to our original question - is Fare related to survival?

ggplot(tt.reduced.train, aes(x=Fare, y=Survived)) + geom_point() + theme_bw()

It is difficult to interpret this plot with the outliers, so we will focus on the Fare that contains the majority of the data.

ggplot(tt.reduced.train, aes(x=Fare, y=Survived)) + geom_point() + theme_bw() + xlim(0,65)