### Gradient Boosting Tree vs Random Forest

• FihopZz

5 years ago

Gradient tree boosting as proposed by Friedman uses decision trees as base learners. I'm wondering if we should make the base decision tree as complex as possible (fully grown) or simpler? Is there any explanation for the choice?

Random Forest is another ensemble method using decision trees as base learners. Based on my understanding, we generally use the almost fully grown decision trees in each iteration. Am I right? You can find another very good reference for boosted trees here: https://xgboost.readthedocs.io/en/latest/model.html @Naghmeh - Dead link; appears to have moved to https://xgboost.readthedocs.io/en/latest/tutorials/model.html

• Antoine

5 years ago

$\text{error = bias + variance}$

• Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating the output from many models).
• On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for large, unpruned trees, so that the bias is initially as low as possible.

Please note that unlike Boosting (which is sequential), RF grows trees in parallel. The term iterative that you used is thus inappropriate. "The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest)" -- the part about "slightly higher than the bias of an individual tree in the forest" seems incorrect. See https://web.stanford.edu/~hastie/Papers/ESLII.pdf section 15.4.2: "As in bagging, the bias of a random forest is the same as the bias of any of the individual sampled trees." Maybe you mean "slighly higher than the bias of a single fully-grown tree fit to the original data"? @gung I think there is a key question unanswered in OP, which is: why not use a fully grown tree at the 1st step of GBM? Why use a sequence of weak learner is better than one single fully grown tree? I am curious about that

• jpmuc

5 years ago

This question is addressed in this very nice post. Please take a look at it and the references therein. http://fastml.com/what-is-better-gradient-boosted-trees-or-random-forest/

Notice in the article that the speaks about calibration, and links to another (nice) blog post about it. Still, I find that the paper Obtaining Calibrated Probabilities from Boosting gives you a better understanding of what calibration in the context of boosted classifiers is, and what are standard methods to perform it.

And finally one aspect missing (a bit more theoretical). Both RF and GBM are ensemble methods, meaning you build a classifier out a big number of smaller classifiers. Now the fundamental difference lies on the method used:

1. RF uses decision trees, which are very prone to overfitting. In order to achieve higher accuracy, RF decides to create a large number of them based on bagging. The basic idea is to resample the data over and over and for each sample train a new classifier. Different classifiers overfit the data in a different way, and through voting those differences are averaged out.
2. GBM is a boosting method, which builds on weak classifiers. The idea is to add a classifier at a time, so that the next classifier is trained to improve the already trained ensemble. Notice that for RF each iteration the classifier is trained independently from the rest. Would it be a fair conclusion from your answer that RF overfits more than GBM? @8forty I wouldn't draw that conclusion - while a single tree in RF will overfit more than a single tree in GBM (because these are much smaller), in RF these overfit will be averaged out when employing a lot of trees, while in GBM the more trees you add, the higher the risk of overfitting. In short, as N (number of trees used) goes to infinity, I expect RF to overfit much less than GBM The references in the answer are useful. Also, https://scikit-learn.org/stable/modules/calibration.html - this is a thorough discussion of the probability calibration. You can do this step as a postprocessing to improve the results.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM
• {{ error }}