A machine learning model is an abstraction derived from training data that can make predictions on new, unseen data. The modeling process involves training a machine learning algorithm on a dataset, tuning it to improve performance, validating it, and finally using it for predictions.
Steps Involved in Modeling
The four basic steps in the machine learning modeling process are:
- Choose a machine learning algorithm suitable for the problem and dataset.
- Input the training dataset to help the algorithm learn and capture patterns.
- Tune the parameters to improve learning and accuracy.
- Evaluate the learned model on unseen data.
Types of Parameters in Modeling
There are two types of parameters in machine learning:
- Model Parameters:
- These are learned directly from the training data.
- Example: Coefficients in linear regression, weights in neural networks, split attributes in decision trees.
- Hyperparameters:
- These are high-level configuration settings not learned directly from the data.
- Example: Learning rate, regularization term, number of decision trees in a random forest.
Error and Loss Function
When predictions are made, errors can occur:
- Training Error (In-Sample Error):
Error when predicting on the training dataset. - Test Error (Out-of-Sample Error):
Error when predicting on new, unseen data. - Loss Function (Mean Squared Error – MSE):
Measures the average squared difference between predicted and actual values:

Model Selection and Model Evaluation
Two main concerns during model selection are:
- Model Performance – How well it performs on training data.
- Model Complexity – How complex the model becomes after training.
Model selection involves choosing the best-suited model or tuning hyperparameters. Since no single model is perfect, we aim to choose a model that performs reasonably well.
Approaches for Model Selection
- Resampling Methods (e.g., Train/Test Split, Cross-validation)
- Direct Performance Measures (e.g., accuracy, error rate)
- Scoring Methods (e.g., Minimum Description Length)
Resampling Methods
Used to tune and evaluate the model using subsets of the dataset:
- Holdout Method:
- Split the dataset into a training and testing set.
- Simple but can suffer from high variance.
- K-Fold Cross Validation:
- Split data into k folds.
- Use k-1 folds to train and 1 fold to test, repeated k times.
- Final score is the average of k test scores.
- Stratified K-Fold:
- Like K-Fold, but ensures class balance is maintained in all folds.
- Leave-One-Out Cross Validation (LOOCV):
- Each sample is tested once, using all other samples for training.
- Very accurate but computationally expensive.
Scoring-Based Model Selection: MDL Principle
- Minimum Description Length (MDL):
Chooses the model that minimizes total encoding length of: - The model
- The data given the model

- It supports Occam’s Razor, favoring simpler models with sufficient performance.