Difference between revisions of "Tree-based..."

From
Jump to: navigation, search
Line 4: Line 4:
 
* [[Capabilities]]  
 
* [[Capabilities]]  
  
An ensemble is a collection of models that do not predict the real objective field of the ensemble, but rather the improvements needed for the function that computes this objective. As shown in the image above, the modeling process starts by assigning some initial values to this function, and creates a model to predict which gradient will improve the function results. The next iteration considers both the initial values and these corrections as its original state, and looks for the next gradient to improve the prediction function results even further. The process stops when the prediction function results match the real values or the number of iterations reaches a limit. As a consequence, all the models in the ensemble will always have a numeric objective field, the gradient for this function. The real objective field of the problem will then be computed by adding up the contributions of each model weighted by some coefficients. If the problem is a classification, each category (or class) in the objective field has its own subset of models in the ensemble whose goal is adjusting the function to predict this category. [http://blog.bigml.com/2017/03/14/introduction-to-boosted-trees/ Introduction to Boosted Trees | bigML]
+
Tree-based models are a supervised machine learning method commonly used in soil survey and ecology for exploratory data analysis and prediction due to their simplistic nonparametric design. Instead of fitting a model to the data, tree-based models recursively partition the data into increasingly homogenous groups based on values that minimize a loss function (such as Sum of Squared Errors (SSE) for regression or Gini Index for classification) (McBratney et al.,2013). The two most common packages for generating tree-based models in R are rpart and randomForest. The rpart package creates a regression or classification tree based on binary splits that maximize homogeneity and minimize impurity. The output is a single decision tree that can be further “pruned” or trimmed back using the cross-validation error statistic to reduce over-fitting. The randomForest package is similar to rpart, but is double random in that each node is split using a random subset of predictors AND observations at each node and this process is repeated hundreds of times (as specified by the user). Unlike rpart, random forests do not produce a graphical decision tree since the predictions are averaged across hundreds or thousands of trees. Instead, random forests produce a variable importance plot and a tabular statistical summary. [http://ncss-tech.github.io/stats_for_soil_survey/chapters/8_Tree_models/treemodels.html Tree-based Models | Katey Yoast]
  
https://littleml.files.wordpress.com/2017/03/boosted-trees-process.png?w=497
+
https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/16533dca42cafce4b00d224727dc5d977ef7d67e/8-Figure3-1.png
  
 
<youtube>eKD5gxPPeY0</youtube>
 
<youtube>eKD5gxPPeY0</youtube>
 
<youtube>J4Wdy0Wc_xQ</youtube>
 
<youtube>J4Wdy0Wc_xQ</youtube>

Revision as of 19:31, 3 June 2018

YouTube search...

Tree-based models are a supervised machine learning method commonly used in soil survey and ecology for exploratory data analysis and prediction due to their simplistic nonparametric design. Instead of fitting a model to the data, tree-based models recursively partition the data into increasingly homogenous groups based on values that minimize a loss function (such as Sum of Squared Errors (SSE) for regression or Gini Index for classification) (McBratney et al.,2013). The two most common packages for generating tree-based models in R are rpart and randomForest. The rpart package creates a regression or classification tree based on binary splits that maximize homogeneity and minimize impurity. The output is a single decision tree that can be further “pruned” or trimmed back using the cross-validation error statistic to reduce over-fitting. The randomForest package is similar to rpart, but is double random in that each node is split using a random subset of predictors AND observations at each node and this process is repeated hundreds of times (as specified by the user). Unlike rpart, random forests do not produce a graphical decision tree since the predictions are averaged across hundreds or thousands of trees. Instead, random forests produce a variable importance plot and a tabular statistical summary. Tree-based Models | Katey Yoast

8-Figure3-1.png