# Difference between revisions of "Tree-based..."

Line 4: | Line 4: | ||

* [[Capabilities]] | * [[Capabilities]] | ||

− | + | Tree-based models are a supervised machine learning method commonly used in soil survey and ecology for exploratory data analysis and prediction due to their simplistic nonparametric design. Instead of fitting a model to the data, tree-based models recursively partition the data into increasingly homogenous groups based on values that minimize a loss function (such as Sum of Squared Errors (SSE) for regression or Gini Index for classification) (McBratney et al.,2013). The two most common packages for generating tree-based models in R are rpart and randomForest. The rpart package creates a regression or classification tree based on binary splits that maximize homogeneity and minimize impurity. The output is a single decision tree that can be further “pruned” or trimmed back using the cross-validation error statistic to reduce over-fitting. The randomForest package is similar to rpart, but is double random in that each node is split using a random subset of predictors AND observations at each node and this process is repeated hundreds of times (as specified by the user). Unlike rpart, random forests do not produce a graphical decision tree since the predictions are averaged across hundreds or thousands of trees. Instead, random forests produce a variable importance plot and a tabular statistical summary. [http://ncss-tech.github.io/stats_for_soil_survey/chapters/8_Tree_models/treemodels.html Tree-based Models | Katey Yoast] | |

− | https:// | + | https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/16533dca42cafce4b00d224727dc5d977ef7d67e/8-Figure3-1.png |

<youtube>eKD5gxPPeY0</youtube> | <youtube>eKD5gxPPeY0</youtube> | ||

<youtube>J4Wdy0Wc_xQ</youtube> | <youtube>J4Wdy0Wc_xQ</youtube> |

## Revision as of 18:31, 3 June 2018

Tree-based models are a supervised machine learning method commonly used in soil survey and ecology for exploratory data analysis and prediction due to their simplistic nonparametric design. Instead of fitting a model to the data, tree-based models recursively partition the data into increasingly homogenous groups based on values that minimize a loss function (such as Sum of Squared Errors (SSE) for regression or Gini Index for classification) (McBratney et al.,2013). The two most common packages for generating tree-based models in R are rpart and randomForest. The rpart package creates a regression or classification tree based on binary splits that maximize homogeneity and minimize impurity. The output is a single decision tree that can be further “pruned” or trimmed back using the cross-validation error statistic to reduce over-fitting. The randomForest package is similar to rpart, but is double random in that each node is split using a random subset of predictors AND observations at each node and this process is repeated hundreds of times (as specified by the user). Unlike rpart, random forests do not produce a graphical decision tree since the predictions are averaged across hundreds or thousands of trees. Instead, random forests produce a variable importance plot and a tabular statistical summary. Tree-based Models | Katey Yoast