6 a] Explain Random Forest Classifier.
The Random Forest Classifier is an ensemble machine learning algorithm that builds multiple decision trees and combines their predictions to produce a more accurate and stable prediction. It is a generalization of decision trees, enhanced through a process called bagging (bootstrap aggregating).
Here’s how the Random Forest algorithm works:
1.Bootstrapping:
- The algorithm creates multiple subsets of the training data by sampling with replacement, known as bootstrap sampling. Each subset typically contains 80% of the original data points (this can be adjusted).
2.Building Trees:
- For each bootstrap sample, a decision tree is built. At each node of the tree, the algorithm selects a random subset of features (say, 5 out of 100 features) to split the data on. This randomness helps reduce correlation among the trees and increases robustness.
3.No Pruning:
- Unlike traditional decision trees, trees in a random forest are typically not pruned. Each tree is allowed to grow as deep as it needs, allowing the model to incorporate noise and idiosyncratic patterns.
4.Aggregation:
- Once all trees are built, the predictions of each tree are combined. For classification tasks, the final prediction is based on a majority vote among the trees, and for regression tasks, the prediction is the average of all trees’ outputs.
Hyperparameters
- N (Number of Trees): You specify the number of decision trees in the forest.
- F (Number of Random Features): You specify the number of random features to be considered at each split in a tree.
- Sample Size: Optionally, you can adjust the size of the bootstrap sample.
Advantages of Random Forest:
- Increased Accuracy: By averaging the predictions of many trees, Random Forest tends to have higher accuracy and better generalization than a single decision tree.
- Robustness: It reduces the risk of overfitting, as the randomness introduced during feature selection and bootstrapping helps mitigate the sensitivity to noise in the data.
Disadvantages:
- Loss of Interpretability: While a single decision tree is interpretable, a random forest with many trees is much harder to interpret, making it difficult to understand the decision-making process.
CODE: (optional)
# Author: Jared Lander # # we will be using the diamonds data from ggplot require(ggplot2) # load and view the diamonds data data(diamonds) head(diamonds) # plot a histogram with a line marking $12,000 ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) # build a TRUE/FALSE variable indicating if the price is above our threshold diamonds$Expensive <- ifelse(diamonds$price >= 12000, 1, 0) head(diamonds) # get rid of the price column diamonds$price <- NULL ## glmnet require(glmnet) # build the predictor matrix, we are leaving out the last column which is our response x <- model.matrix(~., diamonds[, -ncol(diamonds)]) # build the response vector y <- as.matrix(diamonds$Expensive) # run the glmnet system.time(modGlmnet <- glmnet(x=x, y=y, family="binomial")) # plot the coefficient path plot(modGlmnet, label=TRUE) # this illustrates that setting a seed allows you to recreate random results, run them both a few times set.seed(48872) sample(1:10) ## decision tree require(rpart) # fir a simple decision tree modTree <- rpart(Expensive ~ ., data=diamonds) # plot the splits plot(modTree) text(modTree) ## bagging (or bootstrap aggregating) require(boot) mean(diamonds$carat) sd(diamonds$carat) # function for bootstrapping the mean boot.mean <- function(x, i) { mean(x[i]) } # allows us to find the variability of the mean boot(data=diamonds$carat, statistic=boot.mean, R=120) require(adabag) ## boosting require(mboost) system.time(modglmBoost <- glmboost(as.factor(Expensive) ~ ., data=diamonds, family=Binomial(link="logit"))) summary(modglmBoost) ?blackboost ## random forests require(randomForest) system.time(modForest <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE))