Explain data preprocessing with an example
Answer:-
Data preprocessing is the process of cleaning, transforming, and organizing raw data before feeding it into a machine learning model. It improves the quality of data and helps the model learn effectively.
Need for Data Preprocessing:
Real-world data is often:
- Incomplete (missing values)
- Noisy (errors or outliers)
- Inconsistent (conflicting values)
- Unstructured (not in usable format)
Preprocessing helps convert such data into a structured and clean format.
Steps in Data Preprocessing:
- Data Cleaning:
- Handle missing data (e.g., by replacing with mean/median).
- Remove duplicates and outliers.
- Correct errors.
- Data Integration:
- Combine data from multiple sources into a single dataset.
- Data Transformation:
- Normalize or scale data (e.g., range 0 to 1).
- Encode categorical data (e.g., convert “Male” to 0, “Female” to 1).
- Data Reduction:
- Reduce the size of data by feature selection or dimensionality reduction.
- Data Discretization (optional):
- Convert continuous values into categorical bins.
Example:
Suppose you have a dataset to predict house prices:
Area (sqft) | Bedrooms | Price ($) | Location |
---|---|---|---|
1200 | 3 | 300000 | Bangalore |
NaN | 2 | 200000 | Mumbai |
1500 | NaN | 350000 | Chennai |
1300 | 3 | NaN | Bangalore |
Preprocessing steps:
- Fill missing
Area
with average value. - Fill missing
Bedrooms
with mode (most common). - Remove or impute missing
Price
. - Encode
Location
using label encoding. - Normalize
Area
andPrice
.