1.c) Describe the process of fitting a model to a dataset in detail.
Answer:
Fitting a Model
- Fitting a model means estimate the parameters of the model using the observed data. The data is used as evidence to help approximate the real world-mathematical process that generated the data. A good model fit refers to a model that accurately approximates the output when it is provided with unseen inputs.
- Fitting the model often involves optimization methods and algorithms such as maximum likelihood estimation, to help get the parameters. When you estimate the parameters, they are actually estimators, meaning they themselves are functions of the data.
- Fitting the model is when you start actually coding: your code will read in the data, and you’ll specify the functional form that you wrote down on the piece of paper.
- Then R or Python will use built-in optimization methods to give you the most likely values of the parameters given the data. Initially you should have an understanding that optimization is taking place and how it works, but you don’t have to code this part yourself—it underlies the R or Python functions.
- The process involves running an algorithm on data for which the target variable (“labeled” data) is known to produce a machine learning model. Then, the model’s outcomes are compared to the real, observed values of the target variable to determine the accuracy.
- Overfitting is the term used to mean that you used a dataset to estimate the parameters of your model, but your model isn’t that good at capturing reality beyond your sampled data. Overfitting occurs when a model learns the training data too well, including its noise and outliers, to the extent that it performs poorly on unseen data
Causes of Overfitting:
- Complex Models: Models with too many parameters relative to the size of the training data can capture noise instead of underlying patterns.
- Insufficient Data: When the amount of training data is limited, complex models may find patterns where none exist due to randomness.
- Feature Overfitting: Including irrelevant features or too many features in the model can lead to overfitting.
- Lack of Regularization: Regularization techniques, such as L1 and L2 regularization, help prevent overfitting by penalizing overly complex models.