This chapter explores boosting, another ensemble learning algorithm typically based on decision trees that often produces even better results the Random Forests.
The key difference is that boosting, in its original AdaBoost version, modifies the training data for each tree based on the cumulative errors made by the model before adding the new tree. Random Forests, in contrast, use bagging to train many trees independently from each other using randomized versions of the training set. While Random Forests can be trained in parallel, boosting proceeds sequentially using reweighted versions of the data. State-of-the-art boosting implementations also adopt the randomization strategies of random forests.
In this chapter, we will see how boosting has evolved into one of the most successful machine learning algorithms over the last three decades. At the time of writing, it has come to dominate machine learning competitions for structured data (as opposed to high-dimensional images or speech, for example, where the relationship between the input and output is more complex, and deep learning excels at). More specifically, in this chapter we will cover the following topics:
Like bagging, boosting combines base learners into an ensemble. Boosting was initially developed for classification problems, but can also be used for regression, and has been called one of the most potent learning ideas introduced in the last 20 years (as described in Elements of Statistical Learning by Trevor Hastie, et al). Like bagging, it is a general method or metamethod that can be applied to many statistical learning models.
The following sections briefly introduce AdaBoost and then focus on the gradient boosting model, as well as several state-of-the-art implementations of this algorithm.
AdaBoost is a significant departure from bagging, which builds ensembles on very deep trees to reduce bias. AdaBoost, in contrast, grows shallow trees as weak learners, often producing superior accuracy with stumps—that is, trees formed by a single split. The algorithm starts with an equal-weighted training set and then successively alters the sample distribution. After each iteration, AdaBoost increases the weights of incorrectly classified observations and reduces the weights of correctly predicted samples so that subsequent weak learners focus more on particularly difficult cases. Once trained, the new decision tree is incorporated into the ensemble with a weight that reflects its contribution to reducing the training error.
As part of its ensemble module, sklearn provides an AdaBoostClassifier implementation that supports two or more classes.
The code examples for this section are in the notebook gbm_baseline that compares the performance of various algorithms with a dummy classifier that always predicts the most frequent class.
The main idea behind the resulting Gradient Boosting Machines (GBM) algorithm is the training of the base learners to learn the negative gradient of the current loss function of the ensemble. As a result, each addition to the ensemble directly contributes to reducing the overall training error given the errors made by prior ensemble members. Since each new member represents a new function of the data, gradient boosting is also said to optimize over the functions hm in an additive fashion.
The two key drivers of gradient boosting performance are the size of the ensemble and the complexity of its constituent decision trees. The control of complexity for decision trees aims to avoid learning highly specific rules that typically imply a very small number of samples in leaf nodes. We covered the most effective constraints used to limit the ability of a decision tree to overfit to the training data in Chapter 4 on Decision Trees and Random Forests.
In addition to directly controlling the size of the ensemble, there are various regularization techniques, such as shrinkage, that we encountered in the context of the Ridge and Lasso linear regression models in Chapter 7, Linear Models – Regression and Classification. Furthermore, the randomization techniques used in the context of random forests are also commonly applied to gradient boosting machines.
The ensemble module of sklearn contains an implementation of gradient boosting trees for regression and classification, both binary and multiclass.
The notebook sklearn_gbm_tuning_results displays some of the results that can be obtained.
scikit-klearnGradient Boosting docs
Over the last few years, several new gradient boosting implementations have used various innovations that accelerate training, improve resource efficiency, and allow the algorithm to scale to very large datasets. The new implementations and their sources are as follows: - XGBoost (extreme gradient boosting), started in 2014 by Tianqi Chen at the University of Washington - LightGBM, first released in January 2017, by Microsoft - CatBoost, first released in April 2017 by Yandex
The book reviews the numerous algorithmic innovations that have emerged over time and subsequently converged (so that most features are available for all implementations) before illustrating their implementation.
In this section, we’ll design, implement, and evaluate a trading strategy for US equities driven by daily return forecasts produced by gradient boosting models.
As in the previous examples, we’ll lay out a framework and build a specific example that you can adapt to run your own experiments. There are numerous aspects that you can vary, from the asset class and investment universe to more granular aspects like the features, holding period, or trading rules. See, for example, the Alpha Factor Library in the Appendix for numerous additional features.
We use the Quandl Wiki data to engineer a few simple features (see notebook preparing_the_model_data for details) and select a model using 2015/16 as validation period and run an out-of-sample test for 2017.
We’ll keep the trading strategy simple and only use a single machine learning (ML) signal; a real-life application will likely use multiple signals from different sources, such as complementary ML models trained on different datasets or with different lookahead or lookback periods. It would also use sophisticated risk management from simple stop-loss to value-at-risk analysis.
XGBoost, LightGBM, and CatBoost offer interfaces for multiple languages, including Python, and have both a sklearn interface that is compatible with other sklearn features, such as GridSearchCV and their own methods to train and predict gradient boosting models. The notebook gbm_baseline illustrates the use of the sklearn interface for each implementation. The library methods are often better documented and are also easy to use, so we'll use them to illustrate the use of these models.
The process entails the creation of library-specific data formats, the tuning of various hyperparameters, and the evaluation of results that we will describe in the following sections.
The notebook evaluate_trading_signals demonstrates how to evaluate the trading signals.
The notebook making_out_of_sample_predictions shows how to create predictions for the best-performing models.
The notebook backtesting_with_zipline creates a strategy based on the model predictions, simulates its historical performance using [Zipline](https://www.zipline.io/, and evaluates the result using [Pyfolio](https://github.com/quantopian/pyfolio.
Understanding why a model predicts a certain outcome is very important for several reasons, including trust, actionability, accountability, and debugging. Insights into the nonlinear relationship between features and the outcome uncovered by the model, as well as interactions among features, are also of value when the goal is to learn more about the underlying drivers of the phenomenon under study.
The notebook model_interpretation illustrates the following methods.
There are three primary ways to compute global feature importance values: - Gain: This classic approach introduced by Leo Breiman in 1984 uses the total reduction of loss or impurity contributed by all splits for a given feature. The motivation is largely heuristic, but it is a commonly used method to select features. - Split count: This is an alternative approach that counts how often a feature is used to make a split decision, based on the selection of features for this purpose based on the resultant information gain. - Permutation: This approach randomly permutes the feature values in a test set and measures how much the model's error changes, assuming that an important feature should create a large increase in the prediction error. Different permutation choices lead to alternative implementations of this basic approach.
In addition to the summary contribution of individual features to the model's prediction, partial dependence plots visualize the relationship between the target variable and a set of features. The nonlinear nature of gradient boosting trees causes this relationship to depend on the values of all other features.
At the 2017 NIPS conference, Scott Lundberg and Su-In Lee from the University of Washington presented a new and more accurate approach to explaining the contribution of individual features to the output of tree ensemble models called SHapley Additive exPlanations, or SHAP values.
This new algorithm departs from the observation that feature-attribution methods for tree ensembles, such as the ones we looked at earlier, are inconsistent — that is, a change in a model that increases the impact of a feature on the output can lower the importance values for this feature.
SHAP values unify ideas from collaborative game theory and local explanations, and have been shown to be theoretically optimal, consistent, and locally accurate based on expectations. Most importantly, Lundberg and Lee have developed an algorithm that manages to reduce the complexity of computing these model-agnostic, additive feature-attribution methods from O(TLDM) to O(TLD2), where T and M are the number of trees and features, respectively, and D and L are the maximum depth and number of leaves across the trees.
This important innovation permits the explanation of predictions from previously intractable models with thousands of trees and features in a fraction of a second. An open source implementation became available in late 2017 and is compatible with XGBoost, LightGBM, CatBoost, and sklearn tree models.
Shapley values originated in game theory as a technique for assigning a value to each player in a collaborative game that reflects their contribution to the team's success. SHAP values are an adaptation of the game theory concept to tree-based models and are calculated for each feature and each sample. They measure how a feature contributes to the model output for a given observation. For this reason, SHAP values provide differentiated insights into how the impact of a feature varies across samples, which is important given the role of interaction effects in these nonlinear models.
SHAP values provide granular feature attribution at the level of each individual prediction, and enable much richer inspection of complex models through (interactive) visualization. The SHAP summary scatterplot displayed at the beginning of this section offers much more differentiated insights than a global feature-importance bar chart. Force plots of individual clustered predictions allow for more detailed analysis, while SHAP dependence plots capture interaction effects and, as a result, provide more accurate and detailed results than partial dependence plots.
This section and the notebooks will be updated once Algoseek makes the sample data available.
The notebook intraday_features creates features from minute-bar trade and quote data.
The notebook intraday_model optimizes a LightGBM boosting model, generates out-of-sample predictions, and evaluates the result.