What are XGBoost and LightGBM?
A decision tree ensemble is a collection of decision trees all built on the same dataset. Gradient boosting is a method for converting ensembles of weak learners (i.e., models for which predictions are little better than random guesses) into strong learners in a stepwise fashion.
How decision trees are built and how gradient boosting works are topics for separate articles. For now, it is enough to know that while XGBoost and LightGBM arise from the same idea, they achieve that end in different ways.
XGBoost, or Extreme Gradient Boosting, was originally authored by Tianqi Chen. To build trees, it makes use of two algorithms: Weighted Quantile Sketch and Sparsity-aware Split Finding. Both of these are methods for finding splits, i.e., decisions that split the data. Prior to splitting, the data has to be presorted according to feature value, generally in ascending order.1
Weighted Quantile Sketch is a means of distributing potential splits evenly across the data that can account for the weight, or importance, of a given data point.1 This bins continuous data, creating a histogram; this means splits only have to be created per bin rather than for each feature value. Sparsity-aware Split Finding handles missing data by defining a default direction at each tree node; depending on the feature, a missing value will direct the decision along a left or right path.1 In effect, this means XGBoost will skip over rows that contain missing data.
As with any machine learning algorithm, XGBoost is optimizing a loss function, which is a measurement of the error in a model. It does so by performing gradient descent on the loss function, meaning it works to minimize the slope of the function. Since it works with the gradient (or slope) of the function, XGBoost can utilize any differentiable loss function.
At a high level, XGBoost begins by creating an initial tree. The gradients of the loss function are iteratively computed for the tree. Subsequent trees are then fit to those gradients, thereby minimizing the gradient of those trees. This process is iterative and typically bounded by a user-defined number of trees to be built.
LightGBM, or Light Gradient Boosting Machine, was created at Microsoft.2 Much like XGBoost, it is a gradient boosted decision tree ensemble algorithm; however, its implementation is quite different and, in many ways, more efficient. Key differences arise in the two techniques it uses to handle creating splits: Gradient-based One-side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
GOSS holds that data points with larger gradients are more important to finding splits.2 At a high level, the idea is that a data point with a small gradient is already minimized, so focus should be put onto those points with larger gradients. This is related to the idea of information gains: which splits will best differentiate the data? It is important to note that small gradient points are not discarded; GOSS will perform a random sampling of them and weight them with a constant value so that, while focus is placed on the large gradient points, the original data distribution is mostly preserved.
EFB is the means by which LightGBM handles sparse data.2 It works by taking into account that features are often exclusive, meaning that they only rarely take non-zero values at the same time (see one-hot encoding for a simple example). Such features can be safely bundled or combined, which reduces the width (i.e., the number of columns) in a dataset.
In terms of building out and scoring ensembles, LightGBM works in a similar fashion to XGBoost—iteratively building decision trees, calculating their gradients, and then fitting new trees to those gradients in order to minimize them. The process is typically bounded by a user-defined number of trees to be built.
What does it all mean?
There are some key differences between the algorithms. Because of split methods that XGBoost uses—which require that data is presorted and binned before splitting—it is not very memory efficient and large datasets can cause out-of-memory problems. Additionally, because it builds trees depth-first (i.e., going down the tree’s levels), it can take longer to find lower-error trees. Also of note is that XGBoost cannot handle categorical variables natively; it is up to you to encode those before training.
By comparison, LightGBM tends to be more CPU- and memory-efficient and works well with large datasets. It achieves this with its splitting methods: GOSS emphasizes features that provide the most information for a split, thereby reducing data size without impacting accuracy. LightGBM also builds trees best-first (i.e., going across the tree’s leaves). LightGBM can handle categorical variables natively. However, the algorithm does make some approximations that can affect accuracy depending on the data; for example, efficient feature bundling is NP-hard (computationally intractable), so it has to tolerate overlap between non-zero values in order to achieve performance.
Which should you choose?
Both XGBoost and LightGBM are very powerful and flexible machine learning algorithms. They can achieve high accuracy on both classification and regression problems. And, they can achieve this accuracy across a broad range of data.
As can be seen in this Kaggle kernel, the latest implementations of both algorithms compare very well to one another. XGBoost now implements feature binning much like LightGBM to better handle sparse data. Additionally, XGBoost can grow decision trees in best-first fashion similar to LightGBM.
In the end, the one you choose will depend on your data and which algorithm you happen to be most familiar with. Each has a plethora of parameters for tuning to squeeze out those last bits of performance and accuracy. But, such tuning is difficult and time consuming.
DataRobot makes this easy by automating Machine Learning: not just the engineering of features, but also the exploring of parameters that allow a model to best fit the data. Rather than having to go all-in on one, DataRobot allows you to try both algorithms—along with many others—and gives you the information you need to judge which one works best for your problem.
2 G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 3146–3154.