6 a] Explain Random Forest Classifier.
The Random Forest Classifier is an ensemble machine learning algorithm that builds multiple decision trees and combines their predictions to produce a more accurate and stable prediction. It is a generalization of decision trees, enhanced through a process called bagging (bootstrap aggregating).
Here’s how the Random Forest algorithm works:
1.Bootstrapping:
- The algorithm creates multiple subsets of the training data by sampling with replacement, known as bootstrap sampling. Each subset typically contains 80% of the original data points (this can be adjusted).
2.Building Trees:
- For each bootstrap sample, a decision tree is built. At each node of the tree, the algorithm selects a random subset of features (say, 5 out of 100 features) to split the data on. This randomness helps reduce correlation among the trees and increases robustness.
3.No Pruning:
- Unlike traditional decision trees, trees in a random forest are typically not pruned. Each tree is allowed to grow as deep as it needs, allowing the model to incorporate noise and idiosyncratic patterns.
4.Aggregation:
- Once all trees are built, the predictions of each tree are combined. For classification tasks, the final prediction is based on a majority vote among the trees, and for regression tasks, the prediction is the average of all trees’ outputs.
Hyperparameters
- N (Number of Trees): You specify the number of decision trees in the forest.
- F (Number of Random Features): You specify the number of random features to be considered at each split in a tree.
- Sample Size: Optionally, you can adjust the size of the bootstrap sample.
Advantages of Random Forest:
- Increased Accuracy: By averaging the predictions of many trees, Random Forest tends to have higher accuracy and better generalization than a single decision tree.
- Robustness: It reduces the risk of overfitting, as the randomness introduced during feature selection and bootstrapping helps mitigate the sensitivity to noise in the data.
Disadvantages:
- Loss of Interpretability: While a single decision tree is interpretable, a random forest with many trees is much harder to interpret, making it difficult to understand the decision-making process.
CODE: (optional)
# Author: Jared Lander
#
# we will be using the diamonds data from ggplot
require(ggplot2)
# load and view the diamonds data
data(diamonds)
head(diamonds)
# plot a histogram with a line marking $12,000
ggplot(diamonds) + geom_histogram(aes(x=price)) +
geom_vline(xintercept=12000)
# build a TRUE/FALSE variable indicating if the price is above
our threshold
diamonds$Expensive <- ifelse(diamonds$price >= 12000, 1, 0)
head(diamonds)
# get rid of the price column
diamonds$price <- NULL
## glmnet
require(glmnet)
# build the predictor matrix, we are leaving out the last
column which is our response
x <- model.matrix(~., diamonds[, -ncol(diamonds)])
# build the response vector
y <- as.matrix(diamonds$Expensive)
# run the glmnet
system.time(modGlmnet <- glmnet(x=x, y=y, family="binomial"))
# plot the coefficient path
plot(modGlmnet, label=TRUE)
# this illustrates that setting a seed allows you to recreate
random results, run them both a few times
set.seed(48872)
sample(1:10)
## decision tree
require(rpart)
# fir a simple decision tree
modTree <- rpart(Expensive ~ ., data=diamonds)
# plot the splits
plot(modTree)
text(modTree)
## bagging (or bootstrap aggregating)
require(boot)
mean(diamonds$carat)
sd(diamonds$carat)
# function for bootstrapping the mean
boot.mean <- function(x, i)
{
mean(x[i])
}
# allows us to find the variability of the mean
boot(data=diamonds$carat, statistic=boot.mean, R=120)
require(adabag)
## boosting
require(mboost)
system.time(modglmBoost <- glmboost(as.factor(Expensive) ~ .,
data=diamonds, family=Binomial(link="logit")))
summary(modglmBoost)
?blackboost
## random forests
require(randomForest)
system.time(modForest <- randomForest(Species ~ ., data=iris,
importance=TRUE, proximity=TRUE))
