Jason Dean

July 6th, 2017

jtdean@gmail.com

https://github.com/jtdean123

Here we build a simple decision tree to predict whether a wine is white or red based on the following attributes:

● Fixed acidity

● Free sulphur dioxide

● Volatile acidity

● Total sulphur dioxide

● Citric acid

● Residual sugar

● pH

In [117]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
import itertools
import time
from sklearn.tree import export_graphviz
import subprocess
from IPython.display import Image
```

In [2]:

```
# read the data
wine = pd.read_csv('wine.csv')
wine.head()
```

Out[2]:

In [3]:

```
# quick sanity check to make sure we have numerical data and not str
string_alert = False
for i in range(0, wine.shape[1]):
for j in range(0, wine.shape[0]):
if type(wine.iloc[j,i]) is str:
print 'type str at : ', j, i
sting_alert = True
if string_alert == False:
print 'congrats, no strings'
```

In [4]:

```
type(wine.iloc[3,2])
```

Out[4]:

In [5]:

```
print 'Number of observations: ', wine.shape[0]
```

In [6]:

```
print 'Number of red wine observations: ', wine[wine['class'] == 1].shape[0]
print 'Number of white wine observations: ', wine[wine['class'] == 0].shape[0]
```

In [7]:

```
wine.isnull().values.any()
```

Out[7]:

In [8]:

```
wine.duplicated().values.any()
```

Out[8]:

In [9]:

```
# evaluate mean grouped by class
wine.groupby(['class']).mean()
```

Out[9]:

In [10]:

```
wine.groupby(['class']).median()
```

Out[10]:

In [14]:

```
# pH
red_pH = list(wine[wine['class'] == 1]['fixed_acidity'])
white_pH = list(wine[wine['class'] == 0]['fixed_acidity'])
font = {'weight' : 'bold',
'size' : 18}
plt.rc('font', **font)
weights_r = np.ones_like(red_pH)/len(red_pH)
weights_w = np.ones_like(white_pH)/len(white_pH)
bins = np.arange(0,14,0.2)
plt.figure(figsize=(10, 6))
plt.hist(red_pH, bins, alpha=0.5, label='red', color='red', edgecolor = "black", weights=weights_r)
plt.hist(white_pH, bins, alpha=0.5, label='white', color='lightblue', edgecolor = "black", weights=weights_w)
plt.legend(loc='upper right')
plt.xlabel('fixed_acidity')
plt.ylabel('Fraction')
plt.show()
plt.rcdefaults()
```

In [13]:

```
# sugar
red_sugar = list(wine[wine['class'] == 1]['residual_sugar'])
white_sugar = list(wine[wine['class'] == 0]['residual_sugar'])
font = {'weight' : 'bold',
'size' : 18}
plt.rc('font', **font)
weights_r = np.ones_like(red_pH)/len(red_pH)
weights_w = np.ones_like(white_pH)/len(white_pH)
bins = np.arange(0,wine['residual_sugar'].max(),0.5)
plt.figure(figsize=(10, 6))
plt.hist(red_sugar, bins, alpha=0.5, label='red', color='red', edgecolor = "black", weights=weights_r)
plt.hist(white_sugar, bins, alpha=0.5, label='white', color='lightblue', edgecolor = "black", weights=weights_w)
plt.legend(loc='upper right')
plt.xlabel('residual_sugar')
plt.ylabel('Fraction')
plt.show()
plt.rcdefaults()
```

In [15]:

```
# sugar vs. pH
font = {'weight' : 'bold',
'size' : 18}
plt.rc('font', **font)
plt.figure(figsize=(10, 6))
ax = wine[wine['class']==0].plot(kind='scatter', x='residual_sugar', y='fixed_acidity', color='red', label='red');
wine[wine['class']==1].plot(kind='scatter', x='residual_sugar', y='fixed_acidity', color='lightblue', label='white', ax=ax);
plt.show()
```

The goal of this project is to build a decision tree to classify wine as white or red based on the features that we have available to us. Feature engineering, the process of eliminating, combining, or transforming features prior to model construction seems to be the hardest and most important part of a machine learning task. How do we identify what features to include in this model? The easiest place to start is to include all of them, so that is what we will do.

First we will build a decision tree using default parameters and all of the features. Before we begin, though, we will split the data into a test and training set.

In [16]:

```
# split the data into test and training sets
wine_train, wine_test, class_train, class_test = train_test_split((wine.drop(['class'], axis=1)),
wine['class'],
test_size=0.3,
random_state=321)
```

In [17]:

```
# make a few functions that will be useful for model building
def accuracy(actual, predictions):
'''
calculate the accuracy as: # correct / # total
'''
correct = 0
for i,j in zip(list(actual), predictions):
if i == j:
correct += 1.
return correct/len(actual)
def confusion(actual, predictions):
'''
generate a confusion matrix
'''
confusion_mat = pd.DataFrame(confusion_matrix(actual, predictions))
confusion_mat.columns = ['Predicted White', 'Predicted Red']
confusion_mat.index = ['Actual White', 'Actual Red']
return confusion_mat
def area_under_curve(actual, predictions):
'''
calculate the auc under a roc using simpson's method
'''
tp=0; fp=0; tn=0;
for i, j in zip(actual, predictions):
if j == 1 and i == 1:
tp += 1.
if j == 1 and i == 0:
fp += 1.
if j == 0 and i == 0:
tn += 1.
# calcuate true positive and false positive rates
tpr = tp/(tp + fp)
fpr = fp/(fp + tn)
# calculate area under the curve
area = 1/2 * fpr* tpr + (1-fpr) * tpr + 1/2 * (1-tpr)*(1-fpr)
return area
```

In [18]:

```
# build a decision tree with all of the features
tree_all = DTC()
tree_all_DCT = tree_all.fit(wine_train, class_train)
preds_all = tree_all_DCT.predict(wine_test)
print 'Accuracy: ', accuracy(class_test, preds_all)
confusion(class_test, preds_all)
```

Out[18]:

Not bad! These practice data sets are good for my ego. As shown above, without doing any cross validation or hyperparameter tuning we achieved 96.1% accuracy on the test set. Since this is an unbalanced data set the accuracy is not that informative, however, so next we calculate the AUC. The AUC is defined as the area under a ROC curve. The ROC curve is a plot of true positive rate vs false positive rate, and for a non-probabilistic model like a decision tree the AUC is defined by a single point.

The true positive rate is defined as: true positive / (true positive + false positive)

The false positive rate is defined as: false positive / (false positive + true negative)

Therefore, to calcualte the AUC we determined the area under a curve bounded by [0,0], [tpr, fpr], and [1,1] and we will do this using basic geometry by adding the area of two triangles and a rectange.

In [19]:

```
print 'AUC for model with all features included: ', area_under_curve(class_test, preds_all)
```

In [57]:

```
def cross_validation_model(x_train, x_class, y_test, y_class):
'''
generate a model via 10x CV and return an AUC
'''
# set of parameters to test
param_grid = {"criterion": ["gini", "entropy"],
"min_samples_split": [2, 5, 10, 20, 50],
"max_depth": [None, 2 , 4, 5, 10, 20],
"min_samples_leaf": [1, 2, 3, 5, 10, 20, 50],
"max_leaf_nodes": [None, 5, 10, 20 , 50],
}
# gridsearch CV 10 fold
dtc = DTC()
clf = RandomizedSearchCV(dtc, param_grid, cv=10)
clf.fit(x_train, x_class)
# create a model with the best parameters and predict
preds = clf.predict(y_test)
return [area_under_curve(y_class, preds), accuracy(y_class, preds)]
```

In [59]:

```
# wine_train, wine_test, class_train, class_test
all_features_cv = cross_validation_model(wine_train, class_train, wine_test, class_test)
print 'AUC, all features, 10 fold CV: ', all_features_cv[0]
print 'Accuracy, all features, 10 fold CV: ', all_features_cv[1]
```

In [60]:

```
# evaluate all possible combinations of the 7 features
combinations = []
features = [0, 1, 2, 3, 4, 5, 6]
for i in features:
combos = list(itertools.combinations(features, i))
for j in combos:
combinations.append(j)
print 'total # of combinations: ', len(combinations)
```

In [61]:

```
# generate all possible models and evaluate performance
start_time = time.time()
combos_performance = []
for i in combinations:
if len(i) == 0: continue
i = list(i)
combos_train = wine_train.iloc[:, i]
combos_test = wine_test.iloc[:, i]
combos_cv = cross_validation_model(combos_train, class_train, combos_test, class_test)
combos_performance.append(combos_cv)
print("--- %s seconds ---" % (time.time() - start_time))
```

In [73]:

```
# find the model that generate the higest accuracy and plot
auc_combos = [i[0] for i in combos_performance]
acc_combos = [i[1] for i in combos_performance]
plt.scatter(auc_combos, acc_combos, edgecolors='black')
plt.xlabel('Accuracy')
plt.ylabel('AUC')
plt.grid()
plt.show()
```

In [126]:

```
# find the best model
max_acc = 0; index=0;
for i, j in enumerate(acc_combos):
if j > max_acc:
max_acc = j
index = i
# determine what features were included in this model
print 'Features included in best model: ', combinations[index]
print 'Best accuaracy: ', max_acc
```

Finally, we can visualize the best decision tree.

In [112]:

```
# set of parameters to test
#wine_train, wine_test, class_train, class_test
param_grid = {"criterion": ["gini", "entropy"],
"min_samples_split": [2, 5, 10, 20, 50],
"max_depth": [None, 2 , 4, 5, 10, 20],
"min_samples_leaf": [1, 2, 3, 5, 10, 20, 50],
"max_leaf_nodes": [None, 5, 10, 20 , 50],
}
# gridsearch CV 10 fold, then generate a model w/ the best parameters
dtc = DTC()
clf = RandomizedSearchCV(dtc, param_grid, cv=10)
clf.fit(wine_train.iloc[:,list(combinations[index])], class_train)
dtc = DTC(criterion=clf.best_estimator_.criterion,
min_samples_split = clf.best_estimator_.min_samples_split,
max_depth = clf.best_estimator_.max_depth,
min_samples_leaf = clf.best_estimator_.min_samples_leaf,
max_leaf_nodes = clf.best_estimator_.max_leaf_nodes)
dtc.fit(wine_train.iloc[:,list(combinations[index])], class_train)
# display the tree!
output_dot = 'decision_tree.dot'
feature_names = ['fixed_acidity', 'volatile_acidy', 'residual_sugar', 'free_sulfur_dioxide']
with open(output_dot, 'w') as f:
f = export_graphviz(dtc, out_file=f, feature_names=feature_names)
# paste the output into-
```

In [123]:

```
%%bash
dot -Tpng decision_tree.dot -o decision_tree.png
```

In [125]:

```
Image("decision_tree.png")
```

Out[125]: