Posted on March 12, 2017 at 9:00 PM Gradient boosted models are a popular machine learning algorithm that can be used for both regression and classification. In this short introduction I will attempt to explain the basic principles behind this powerful technique and how it can be used.

The goal of supervised learning is to find a function that maps a set of input features (x) to a desired output response variable (y). More formally, given a set of feature vectors -

(x_1, y_1), (x_2, y_2), ... (x_n, y_n)

Find a function -
F(x) = y

that minimizes a loss function -
L(y, F(x))

Our model,
F(x)

will not fit the response variable perfectly, and we can capture deviations from the actual and predicted values of the response variable with a function -
h(x)

such that -
F(x) + h(x) = y

and -
h(x) = F(x) - y

As shown above, the residuals, contained in -
h(x)

are the difference between the predicted and actual values of the output variable. In gradient boosting we attempt to improve the initial model by incorporating an additional model of the residuals. In this way, we can build a model in a stagewise manner where each stage introduces a weak learner that models the residuals, and the final model is simply a linear combination of the weak learners. This can be expressed as -
F_1(x) + h_1(x) = y

h_1(x) = F_2(x) + h_2(x)

h_2(x) = F_3(x) + h_3(x)

.\\ .\\ .\\

h_{n-1}(x) = F_n(x) + h_n(x)

We can now express this model as a linear combination of weak learners as -
F(x, \beta) = \sum_{i=1}^{n}\beta_ih(x)

The weak learners used for the construction of this ensemble are typically decision tress. Do not worry about the coefficient at this time, I'll get to that. Now, how is this related to gradient descent? The principle of gradeint descent is that the local minimum of a function can be found by taking steps that are proportional to the negative of the gradient (derivative) of the function at the current point. As mentioned above, a supervised machine learning model is generated by finding a function that minimizes a loss function. Consider the popular regression mean squared error loss function assuming the Gaussian distribution -
loss = \sum_{i=1}^{n}(y_i - F(x_i))^2

Thus, we can find a local minimum for the loss function by moving against the gradient, or -
-\frac{d(loss)}{d(F(x_i))} = 2(F(x_i) - y_i)

This result is significant, because the negative gradient of this loss function is equal to the residuals. This means that by introducing weak learners that fit the residuals we are essentially moving to a local minimum of the loss function. This strategy can be applied to any differentiable loss function. For example, consider the cross-entropy, or negative log loss of a binary classifier -
logloss = \frac{-1}{N}\sum_{i=1}^{n}y_ilog(f(x_i)) + (1-y_i)log(1-f(x_i))

It can be shown that the negative log loss of a binary classifier, such as a logistic regression model, is equal to the maximum likelihood estimation. The likelihood of n independent events is equal to -
L = \prod_{i=1}^{n} p_i

Therefore, the probabilities corresponding to classification by a logistic regression model correspond to -
P(y=1 | x) = \frac{1}{1+e^{-f(x)}}

and -
P(y=-1 | x) = 1 - \frac{1}{1+e^{-f(x)}} = \frac{1}{1+e^{f(x)}}

Combing these we obtain -
P(y | x) = \frac{1}{1+e^{-y*f(x)}}

and the likelihood now becomes -
L = \prod_{i=1}^{n}\frac{1}{1+e^{-y_i*f(x)}}

We can now calculate the log of the likelihood -
\sum_{i=1}^{n}log(1 + e^{-y_i*f(x_i)})

L = log(1 + e^{-y*f(x)})

This now becomes our loss function, and we are only left to take the derivative -
\frac{d(L)}{d(f(x))} = \frac{-1}{1+e^{f(x)*y}}

Therefore, by learning a weak model to the derivative of the we are essentially, as above with the case for regression, moving towards a local minimum of the log loss for the classifier. Many other loss functions, such as the exponential loss used in the Adaboost algorithm, are differentiable and are adaptable to the gradient boost methodology.

There are four model hyperparameters parameters that are important for gradient boosted model building in R: the minimum number of observations in a terminal node, the number of trees, interaction depth, and shrinkage.

* n.minobsinnode - minimum number of trees in a terminal node

* n.trees - the number of trees in the model (i.e. the number of gradient boosting iterations)

* shrinkage - reduces the impact of each tree and prevents the model from taking steps that are too large

* interaction.depth - the number of splits on a tree starting from a single node

To see an example of this in action, go see my next post!