1. Data Collection:
Data collection is the first step in the data analytics process. It involves gathering high-quality, relevant datasets. Good data is essential for accurate machine learning and analytics.
Qualities of Good Data:
- Timeliness – Data must be recent and up-to-date.
- Relevancy – It should match the machine learning or business needs.
- Understandability – Data should be clear, interpretable, and aligned with domain knowledge.
Sources of Data:
- Open/Public Data
- Free and available without restrictions.
- Examples: Government census, scientific experiments, healthcare data.
- Social Media Data
- Data from platforms like Twitter, Facebook, Instagram, YouTube.
- Multimodal Data
- Data with multiple formats: text, audio, image, video.
- Examples: Image archives, web pages.
2. Data Preprocessing:
Raw data is often “dirty”, meaning it contains errors, missing values, and inconsistencies. Preprocessing is essential to clean and prepare data for analytics or ML.
Common Data Issues:
- Incomplete data
- Inaccurate or inconsistent values
- Missing values
- Duplicate entries
- Outliers and noise
Example Table – Illustration of ‘Bad’ Data
Patient ID | Name | Age | Date of Birth (DoB) | Fever | Salary |
---|---|---|---|---|---|
1. | John | 21 | – | Low | –1500 |
2. | Andre | 36 | – | High | Yes |
3. | David | 5 | 10/10/1980 | Low | “ ” |
4. | Raju | 136 | – | High | Yes |
Explanation of Dirty Data
- Missing Data: Salary is missing for David. DoB is missing for John, Andre, and Raju.
- Inconsistent Data: David’s age is 5, but DoB is 10/10/1980 → mismatch.
- Noisy Data: John’s salary is –1500, which is not possible.
- Outlier: Raju’s age is 136 – likely a typographical error.
These errors affect machine learning accuracy and must be handled during data cleaning.
Data Cleaning Techniques:
A. Handling Missing Data:
- Ignore the tuple – Remove rows with missing data.
- Manual filling – Domain expert fills values.
- Global constant – Use ‘Unknown’ or ‘Infinity’.
- Attribute mean – Fill missing value with average.
- Class-wise mean – Average of values within a class.
- Prediction models – Use ML to predict missing values.
Note: These may introduce bias if not accurate.
B. Noise Removal Techniques:
Noise = Random error or variance in data.
Use Binning technique to smooth noisy data.
Types of Binning:
- Smoothing by mean – Replace values with bin average.
- Smoothing by median – Replace with bin median.
- Smoothing by bin boundaries – Use closest bin edge.
3. Data Integration:
Merging data from multiple sources. May lead to redundant data. Aim is to identify and remove duplicates during merging.
4. Data Transformation:
Transform data for better model performance.
Normalization helps scale data to a standard range.
Normalization Techniques:
A. Min-Max Normalization
Scales data to a specific range (usually 0 to 1).

Used in Neural Networks.
B. Z-Score Normalization
Standardizes data using mean and standard deviation.

5. Data Reduction:
Used to reduce data size while retaining the same analysis results.
Techniques:
- Data aggregation – Combine data.
- Feature selection – Select important features.
- Dimensionality reduction – Reduce number of attributes (like PCA).