Naive Bayes Classifier

A Naive Bayes Classifier is a popular probabilistic machine learning algorithm and it is one of my favorite models because it is fast to build and conceptually easy to understand. I feel that the second point, ease of understanding, is an often underappreciated quality when one is deciding the 'best' algorithm for a machine learning application. In my experience I have found that there are situations where a sacrifice in model performance is worth the trade off of generating a model that you can explain to other people in a way that appeals to their intuition.

The key assumption in Naive Bayes Classifiers is that the features are independent, and this will be explained in more detail below. Probabilistic machine learning models predict a probability that an observation falls under a given label. For example, consider the binary classification learning task of predicting the class (0 or 1) of observation y based on n features (x). In this example we wish to generate a conditional probability that an observation belongs to class 1 given it's features, or:


\begin{equation*} P(y = 1 | x_1, x_2, ... , x_n) \end{equation*}


Using Baye's theorem we can extend this conditional probability, or posterior probability, expression as the products of a prior and a likelihood:


\begin{equation*} P(y = 1 | x_1, x_2, ... , x_n) = \frac{P(y=1) * P(x_1, x_2, ... , x_n | y=1)}{P(x_1, x_2, ... , x_n)} \end{equation*}


As mentioned above, the key assumption in a Naive Bayes Classifier is that the features are independent. Recall that for independent events A and B:


\begin{equation*} P(A | B) = P(A) * P(B) \end{equation*}


Now, using this assumption of independent events we can expand the likelihood above as:


\begin{equation*} P(x_1, x_2, ... , x_n | y=1) = P(x_1 | y=1) * P(x_2 | y=1) ... P(x_n | y=1) \end{equation*}


Thus, we have expanded the likelihood to be a product of the conditional probabilities of each feature given a class. The next question is how does one calculate these conditional probabilities? Given a set of training data we can imagine the feature data for a given feature and class as a histogram. Using the pdf of the normal distribution we can calculate the likelihood probability for feature x1 and observation (i) as follows:


\begin{equation*} P(x_{1,i} | y=1) = \frac {1}{\sqrt {2\sigma_{x_1} ^{2}\pi }} * e^{-{\frac{(x_{1,i}-\overline{x_1} )^{2}}{2\sigma_{x_1} ^{2}}}} \end{equation*}


So, to calculate this probability we only need to know the mean and variance of the feature given a class in the training data. This example assumes that the feature data is normally distributed, however this rationale is easily extended to other feature data distributions. We next turn to the prior. The prior is calculated by simply determining the fraction of each class in the training data set. Thus, to assign a class to an observation of unknown class we calculate the posterior probabilities for the observation for each class and assign it to the class that gives the highest value. Easy! The denominator of the probability in Baye's formula, or the marginal probability, is not needed for this type of classification since it will have the same value for every posterior probability, thus cancelling out.

Now we can build a model...

In [49]:
import pandas as pd
from scipy.stats import norm
import numpy as np

Now that we have the framework of our Naive Bayes classifier we can build a model. For this example we will build a classifier that predicts if an unknown sports player is a member of the Los Angeles Lakers (basketball) or the Los Angeles Dodgers (baseball) from their height and salary. Since basketball players are generally taller and command higher salaries than baseball players these two features should be useful for determining an unknown player's team. For this analysis I picked the first four Laker players by alphabetical order and the first four Dodger pitchers available on the ESPN.com team pages and obtained their heights (inches) and salaries. The four players from each team included in this study are:

Lakers:
Tarik Black, Corey Brewer, Jordan Clarkson, Luol Deng

Dodgers:
Luis Avilan, Pedro Baez, Josh Fields, Chris Hatcher

First we put this data into a pandas DataFrame and define two hypothetical players (1 = lakers and 2 = dodgers).

In [192]:
teams = pd.DataFrame()

teams['Team'] = ['lakers', 'lakers', 'lakers', 'lakers', 'dodgers', 'dodgers', 'dodgers', 'dodgers']
teams['Height'] = [81, 81, 77, 81, 74, 74, 72, 71]
teams['Salary'] = [6191000, 7612172, 12500000, 18000000, 1390000, 520000, 900000, 1065000]

player1 = pd.DataFrame()
player1['Height'] = [79]
player1['Salary'] = [8000000]

player2 = pd.DataFrame()
player2['Height'] = [73]
player2['Salary'] = [900000]

teams.head(8)
Out[192]:
Team Height Salary
0 lakers 81 6191000
1 lakers 81 7612172
2 lakers 77 12500000
3 lakers 81 18000000
4 dodgers 74 1390000
5 dodgers 74 520000
6 dodgers 72 900000
7 dodgers 71 1065000

As shown above, we see that Lakers players are generally taller than dodgers players and, as expected, they get paid quite a bit more. As mentioned above, we are interested in calculating the posterior probability for each player as follows:

prob(Team | Height, Salary) = prob(Team) x prob(Height | Team) x prob( Salary | Team) / prob(Height, Salary)

To do this we need three functions:

  • prior: calculates the prior probability of being on either the Lakers or Dodgers. Since we have four members of each in our sample data this will be equal to 0.5

  • likelihood: determines the conditional probability of obtaining an unknown players height or salary given that they are a member of a team. For this calculation we assume that salary and height are independent and that they are normally distributed. I expect salary to be right skewed across all salaries, but we will move forward with this normality assumption.

  • posterior: the product of the priors and likelihoods. easy!

We can now build these three functions as follows.

In [194]:
def prior(label, total):
    # label = team
    # total = list of teams
    
    count = [i for i in total if label in i]
    return(len(count)/len(total))

def likelihood(test, features, team, attribute):    
    # test = a player's attribute (height or salary)
    # features = training data set
    # team = team to calculate likelihood against
    # attribute = the attribute to be evaluated
    
    stdev = np.std(features[attribute].loc[(features.Team == team)])
    mean = np.mean(features[attribute].loc[(features.Team == team)])
    likehood = norm(mean, stdev).pdf(test)
    return float(likehood)


def posterior(train, player, team):
    # train = training set
    # player = player of unknown team, to be classified
    # team = team to calculate posterior against
    
    # calcualate the prior
    priors = prior(team, train['Team'])
    
    # calculate prob(Height | Team)
    likeH = likelihood(player['Height'], train, team, 'Height')    
    
    # calculate prob(Salary | Team)
    likeS = likelihood(player['Salary'], train, team, 'Salary')
    
    post = priors*likeH*likeS
    return(post)

We can now calculate the posterior probabilities of each player being a member of each team based on their feature data. We will start with player 1. As shown above, we see that player 1 has a height and a salary that seems to fit that of an Laker, therefore we expect the posterior probability of player 1 to be higher when assigned to a Laker class.

In [195]:
player1.head()
Out[195]:
Height Salary
0 79 8000000
In [196]:
# Posterior of (player 1 = Laker | player 1 features)
post1_Laker = posterior(teams, player1, 'lakers')

# Posterior of (player 1 = Dodger | player 1 features)
post1_Dodger = posterior(teams, player1, 'dodgers')

print("Prob (player 1 = Laker | player 1 features):  ", post1_Laker)
print("Prob (player 1 = Dodger | player 1 features):  ", post1_Dodger)
Prob (player 1 = Laker | player 1 features):   6.73468324350167e-09
Prob (player 1 = Dodger | player 1 features):   8.864993615935072e-122

Great! Based on this, we correctly predict that player 1 is a member of the Lakers. We next make a prediction for player 2. In contrast to player 1, player 2 has a height and salary that is consistent with a dodger player.

In [197]:
player2.head()
Out[197]:
Height Salary
0 73 900000
In [198]:
# Posterior of (player 2 = Laker | player 2 features)
post2_Laker = posterior(teams, player2, 'lakers')

# Posterior of (player 2 = Dodger | player 2 features)
post2_Dodger = posterior(teams, player2, 'dodgers')

print("Prob (player 2 = Laker | player 2 features):  ", post2_Laker)
print("Prob (player 2 = Dodger | player 2 features):  ", post2_Dodger)
Prob (player 2 = Laker | player 2 features):   2.522491995250135e-13
Prob (player 2 = Dodger | player 2 features):   1.873304660063009e-07

We now find that the highest posterior probability for player 2 is when they are dodger. The simple model we built allows us to correctly predict which team a player is on based on their height and salary. This is a small simple example, but the beauty of Naive Bayes, other than it's simplicity, is that is is easily scaled to large data sets.