Random forest is a statistical algorithm that is used to cluster points of data in functional groups. When the data set is large and/or there are many variables it becomes difficult to cluster the data because not all variables can be taken into account, therefore the algorithm can also give a certain chance that a data point belongs in a certain group.
Steps of the algorithm[change | change source]
This is how the clustering takes place.
- Of the entire set of data a subset is taken (training set).
- The algorithm clusters the data in groups and subgroups. If you would draw lines between the data points in a subgroup, and lines that connect subgroups into group etc. the structure would look somewhat like a tree. This is called a decision tree.
- At each split or node in this cluster/tree/dendrogram variables are chosen at random by the program to judge whether datapoints have a close relationship or not.
- The program makes multiple trees a.k.a. a forest. Each tree is different because for each split in a tree, variables are chosen at random.
- Then the rest of the dataset (not the training set) is used to predict which tree in the forests makes the best classification of the datapoints (in the dataset the right classification is known).
- The tree with the most predictive power is shown as output by the algorithm.
Using the algorithm[change | change source]
In a random forest algorithm the number of trees grown (ntree) and the number of variables that are used at each split (mtry) can be chosen by hand; example settings are 500 trees, 71 variables.